Introduction

This is part of a series of posts where I examine data sets available as part of the City of Buffalo’s Open Data portal. Previously I examined recycling data (post 1 and post 2).

The next dataset that caught my eye is a very impressive tree inventory from the Bureau of Forestry. It contains information on the species, size, location, and economic contributions of every single street tree in the city.

Get the Tree Inventory Data

First, I will load the required R libraries. I’ll be using RSocrata to access the Socrata API in R, the tidyverse family of packages to manipulate the data, sf to handle spatial data, and ggplot2 for figures.

library(RSocrata)
library(tidyverse)
library(sf)
library(ggplot2)

Identify the URL for the tree inventory dataset in CSV form:

buf_datasets = ls.socrata('https://data.buffalony.gov')
ind = which(buf_datasets$title == 'Tree Inventory')
buf_datasets$distribution[[ind]]$mediaType
## [1] "text/csv"            "application/rdf+xml" "application/json"   
## [4] "application/xml"
csv_url = buf_datasets$distribution[[ind]]$downloadURL[1]
csv_url
## [1] "https://data.buffalony.gov/api/views/n4ni-uuec/rows.csv?accessType=DOWNLOAD"

Then load the dataset into R using RSocrata as a tibble:

trees_raw = read.socrata(csv_url) %>% as_tibble()
trees_raw
## # A tibble: 131,469 x 27
##    Editing Botanical.Name Common.Name   DBH Total.Yearly.Ec…
##    <chr>   <chr>          <chr>       <int>            <dbl>
##  1 Buffal… VACANT         VACANT          0              0  
##  2 Olmste… PLATANUS X AC… LONDON PLA…     1             18.5
##  3 Olmste… PLATANUS X AC… LONDON PLA…     3             38.8
##  4 Olmste… PLATANUS X AC… LONDON PLA…     3             38.8
##  5 Olmste… PLATANUS X AC… LONDON PLA…     4             45.1
##  6 Olmste… ZELKOVA SERRA… ZELKOVA, J…    11            164. 
##  7 Olmste… ZELKOVA SERRA… ZELKOVA, J…    11            164. 
##  8 Olmste… SYRINGA RETIC… LILAC, JAP…    12             50.3
##  9 Olmste… SYRINGA RETIC… LILAC, JAP…     9             40.9
## 10 Olmste… STUMP          STUMP          19              0  
## # ... with 131,459 more rows, and 22 more variables:
## #   Stormwater.Benefits.... <dbl>, Stormwater.Gallons.Saved <dbl>,
## #   Greenhouse.CO2.Benefits.... <dbl>, CO2.Avoided..in.lbs.. <dbl>,
## #   CO2.Sequestered..in.lbs.. <dbl>, Energy.Benefits.... <dbl>,
## #   kWh.Saved <dbl>, Therms.Saved <dbl>, Air.Quality.Benefits.... <dbl>,
## #   Pollutants.Saved..in.lbs.. <dbl>, Property.Benefits.... <dbl>,
## #   Leaf.Surface.Area..in.sq..ft.. <dbl>, Address <int>, Street <chr>,
## #   Side <chr>, Site <int>, Council.District <chr>, Park.Name <chr>,
## #   Latitude <dbl>, Longitude <dbl>, Site.ID <int>, Location <chr>

A quick look at the first 10 rows reveals that there are rows in the dataset for stumps (STUMP) and for, presumably, locations where a tree could be planted (VACANT). This is potentially interesting information but not for what I’m planning to do with this dataset. Thus, I removed these rows:

trees = trees_raw %>%
  filter(! Common.Name %in% c('STUMP', 'VACANT')) %>%
  rename(Species = Common.Name,
         Eco_Value = Total.Yearly.Eco.Benefits....)

Initial Visualizations

First I wanted to get a rough idea of the spatial distribution of the trees. My first approach was to convert the trees dataset to a simple feature (sf) object. However, I found that geom_sf(), the ggplot2 function normally for plotting sf objects, was extremely slow for large spatial point datasets (like this dataset). Thus, I instead used a combination of geom_point() and coord_map(), as suggested here.

trees %>%
  ggplot(aes(x = Longitude, y = Latitude, color = Species)) +
  geom_point(size=0.4) +
  coord_map() +
  theme(legend.position='none')