Introduction

I’ve long been a stubborn proponent of base R for processing and visualizing data. It remains an incredibly robust and flexible tool for these purposes. However, I’ve recently taken a deep dive into the tidyverse family of R packages, which perform the same types of tasks in an arguably more streamlined and clear fashion.

My approach to learning the tidyverse and the related ggplot2 visualization package has involved several excellent books, but I always learn best by actually applying tools to questions I’m interested in. So, in addition to applications in my current research, I’ve been looking for other available datasets to explore.

The city of Buffalo, NY recently created a website called the Buffalo OpenData for publishing data on city services. As I’m from the Buffalo region, I decided this was a perfect source for datasets to work with.

Downloading the Data

The Buffalo OpenData portal uses the Socrata Open Data API (SODA). Luckily, there is an R client for Socrata, RSocrata, available on CRAN.

The RSocrata package has a variety of functions available, including one for identifying available datasets (ls.socrata) and one for downloading them directly into R (read.socrata). Here are a few of the available datasets on the Buffalo OpenData Portal:

library(RSocrata)
buf_datasets = ls.socrata('https://data.buffalony.gov')
head(buf_datasets$title)
## [1] "DataU Application Form"                                  
## [2] "Code Violations"                                         
## [3] "Monthly Uniform Crime Reporting (UCR) Program Statistics"
## [4] "Zip Codes"                                               
## [5] "Local Historic Districts"                                
## [6] "Assessment Residential Neighborhoods"

I’m going to explore the first dataset listed, which provides monthly recycling rates for different Buffalo neighborhoods. The dataset is available in several formats:

ind = which(buf_datasets$title == 'Neighborhood Curbside Recycling Rates')
buf_datasets$distribution[[ind]]$mediaType
## [1] "text/csv"            "application/rdf+xml" "application/json"   
## [4] "application/xml"

You can get the URL for the CSV dataset with $downloadURL:

csv_url = buf_datasets$distribution[[ind]]$downloadURL[1]
csv_url
## [1] "https://data.buffalony.gov/api/views/ug79-xatx/rows.csv?accessType=DOWNLOAD"

I’ll pass the URL to read.socrata, then immediately pipe the resulting data frame into a tibble. For that (and later data manipulation) I’ll also need the tidyverse R package.

library(tidyverse)
rec_raw = read.socrata(csv_url) %>% as_tibble()
rec_raw
## # A tibble: 624 x 5
##    NEIGHBORHOOD  DATE                CURBSIDE.RECYCLIN… CURBSIDE.GARBAGE.…
##    <chr>         <dttm>                           <dbl>              <dbl>
##  1 ABBOTT McKIN… 2017-05-31 00:00:00             107521             553681
##  2 ABBOTT McKIN… 2017-06-30 00:00:00              82487             487734
##  3 ABBOTT McKIN… 2017-07-31 00:00:00              95722             562587
##  4 ABBOTT McKIN… 2017-08-31 00:00:00              73604             392703
##  5 ABBOTT McKIN… 2017-09-30 00:00:00              74255             455175
##  6 ABBOTT McKIN… 2017-10-31 00:00:00              97040             533051
##  7 ABBOTT McKIN… 2017-11-30 00:00:00              93935             449784
##  8 ABBOTT McKIN… 2017-12-31 00:00:00              76653             315163
##  9 ABBOTT McKIN… 2018-01-31 00:00:00             111554             458403
## 10 ABBOTT McKIN… 2018-02-28 00:00:00              69018             322288
## # ... with 614 more rows, and 1 more variable:
## #   CURBSIDE.RECYCLING.RATE <dbl>

Each row of the dataset represents a neighborhood’s monthly recycling statistics. There are columns for neighborhood, date, total recycling, total garbage, and the recycling rate.

Data are available for roughly the past year:

min(rec_raw$DATE)
## [1] "2017-05-31 EDT"
max(rec_raw$DATE)
## [1] "2018-04-30 EDT"

Tidying the Dataset

Cleaning up Variables

The first step in cleaning up this dataset is to rename the final three columns, which are a little long.

names(rec_raw)[3:5] = c('RECYCLE','GARBAGE','RATE')

Two additional things jumped out at me to fix. First the neighborhood names were inconsistently capitalized, which could becoming frustrating later on. Second, the RATE column appeared to be rounded. To fix these issues I made everything in the NEIGHBORHOOD column uppercase, and calculated the recycling rate myself:

rec = rec_raw %>%
    mutate(NEIGHBORHOOD = str_to_upper(NEIGHBORHOOD)) %>%
    mutate(RATE = RECYCLE / (RECYCLE + GARBAGE))

Summarizing by Neighborhood

Currently I have monthly recycling stats. For visualization, I decided to summarize information for each neighborhood across all available months. I also calculated an additional statistic that represented the proportional change in recycling rate for each neighborhood over the time period of the study:

\[ \textrm{Percent Change} = \frac{Rate_{end} - Rate_{start}}{Rate_{start}} \]

The equivalent function in R:

perc_change = function(val,date){
    start = val[which.min(date)]
    end = val[which.max(date)]
    (end - start)/start
}

The tools in the tidyverse make generating summary statistics for each neighborhood straightforward:

rec = rec %>%
    group_by(NEIGHBORHOOD) %>%
    summarize(MnRecycle = mean(RECYCLE,na.rm=T),
              MnGarbage = mean(GARBAGE,na.rm=T),
              MnRate = mean(RATE,na.rm=T) * 100,
              Change = perc_change(RATE, DATE))

rec
## # A tibble: 52 x 5
##    NEIGHBORHOOD      MnRecycle MnGarbage MnRate   Change
##    <chr>                 <dbl>     <dbl>  <dbl>    <dbl>
##  1 ABBOTT MCKINLEY       87338    448631  16.4   0.00900
##  2 ALBRIGHT              45562    180554  20.6   0.181  
##  3 ALLEN                 44611    200154  18.3  -0.00788
##  4 BABCOCK               39707    277084  12.7   0.147  
##  5 BLACK ROCK            64720    438862  13.0   0.269  
##  6 BROADWAY FILLMORE     46048    430929   9.78  0.115  
##  7 BRYANT                55434    293794  16.0  -0.00551
##  8 CAZENOVIA PARK        61598    320110  16.5  -0.0636 
##  9 COLD SPRING           11136     72917  13.3   0.0732 
## 10 COLUMBUS              12172     69311  15.0  -0.0144 
## # ... with 42 more rows

Now I have one row per neighborhood with summary statistics for each.

Visualizing the Data

I used the ggplot2 library for visualizations. Note that the geom_sf function I use later to plot spatial data is not available in the current CRAN version of ggplot2, so I installed the latest version from Github instead:

devtools::install_github('tidyverse/ggplot2')

Recycling Rate

Here’s a quick and dirty histogram, using ggplot2, showing the distribution of recycling rates by neighborhood. Most neighborhoods have a rate less than 20%.

library(ggplot2)
rec %>% ggplot() +
    labs(title = 'Distribution of recycling rates by neighborhood',
         x = 'Recycling Rate (%)') +
    geom_histogram(aes(x=MnRate),bins=15) +
    geom_vline(xintercept=mean(rec$MnRate),size=2,color='red') +
    geom_text(aes(x=16.5,y=6,
                  label=paste('Overall mean = ',round(mean(rec$MnRate),2),'%',sep='')))

Change in Recycling Rate

Instead looking at proportional change in recycling rate over the time period of the dataset:

rec %>% 
    mutate(above0 = ifelse(Change > 0,'Increasing','Decreasing')) %>%
    ggplot() +
    labs(title = 'Change in neighborhood recycling rate, May 2017-April 2018',
         x = '% Change in Recycling Rate',
         fill = 'Trend') +
    geom_histogram(aes(x=Change,fill=above0),breaks=seq(-0.15,0.35,0.05)) +
    geom_vline(xintercept=0,size=2,color='white')

Most neighborhoods are showing an increasing trend.

Seasonal Patterns

Note that the percent change statistic ignores any seasonal patterns in recycling rate that may exist. Summarizing the raw data by month reveals that recycling rates may be slightly higher in the winter, but the trend is not strong.

library(lubridate) #For handling dates
rec_raw %>%
    mutate(MONTH = factor(month(DATE)),
           SEASON = ifelse(MONTH%in%c(11,12,1,2), 'Winter','Other')) %>%
    ggplot(aes(x=MONTH,y=RATE, fill=SEASON)) +
        labs(title = 'Recycling rate by month',
             x = 'Month', y = 'Recycling rate') +
        geom_boxplot()

Regardless, this dataset spans almost exactly one year (May 2017 - April 2018) so seasonal effects should be minimized.

Georeferencing the Neighborhoods

Finding Spatial Data

To plot these data spatially (and to connect them to other spatially-referenced datasets), I needed information about the shape and location of each neighborhood. The metadata for this dataset refers to the neighborhood boundaries in the Buffalo ArcGIS database, but strangely this spatial information does not appear to be available on the OpenData portal.

I was forced to turn to Google instead, and quickly found a database of neighborhood boundary shapefiles provided by Zillow. The Zillow shapefiles are at the state level, so I downloaded the zipped-up file for New York and extracted it:

src_url = 'https://www.zillowstatic.com/static-neighborhood-boundaries/LATEST/static-neighborhood-boundaries/shp/ZillowNeighborhoods-NY.zip'
if (!file.exists('../data/ZillowNeighborhoods-NY.zip')){
    dir.create('../data')
    download.file(src_url,'../data/ZillowNeighborhoods-NY.zip')
}
unzip('../data/ZillowNeighborhoods-NY.zip',exdir='../data/')

To read in the shapefile I used the sf package, which can handle multiple different spatial data formats and also supports piping in the tidyverse style. I selected only the neighborhoods in the city of Buffalo, and also converted the neighborhood names to be all uppercase (to match the recycling data).

library(sf)
nb = st_read('../data/ZillowNeighborhoods-NY.shp',quiet=T) %>%
    filter(City == 'Buffalo') %>%
    mutate(NEIGHBORHOOD = str_to_upper(Name)) %>%
    select(NEIGHBORHOOD)

Joining the Datasets

The next step was to join the recycling data to the neighborhood spatial data by matching NEIGHBORHOOD names. I expected that there would be some mismatches to fix. I checked which neighborhoods in the spatial dataset didn’t have recycling data:

nb %>% full_join(rec,by='NEIGHBORHOOD') %>% filter(is.na(MnRecycle)) %>% as_tibble()
## # A tibble: 6 x 6
##   NEIGHBORHOOD MnRecycle MnGarbage MnRate Change                  geometry
##   <chr>            <dbl>     <dbl>  <dbl>  <dbl>        <MULTIPOLYGON [°]>
## 1 WATERFRONT          NA        NA     NA     NA (((-78.89637 42.89648, -…
## 2 BROADWAY-FI…        NA        NA     NA     NA (((-78.83935 42.90364, -…
## 3 MLK PARK            NA        NA     NA     NA (((-78.82406 42.91276, -…
## 4 DELAWARE-WE…        NA        NA     NA     NA (((-78.86706 42.92356, -…
## 5 SQUAW ISLAND        NA        NA     NA     NA (((-78.90206 42.91276, -…
## 6 CENTRAL BUS…        NA        NA     NA     NA (((-78.87156 42.89326, -…

and which neighborhoods in recycling dataset didn’t have spatial data:

nb %>% full_join(rec,by='NEIGHBORHOOD') %>% filter(is.na(st_dimension(.))) %>% as_tibble()
## # A tibble: 6 x 6
##   NEIGHBORHOOD      MnRecycle MnGarbage MnRate  Change
##   <chr>                 <dbl>     <dbl>  <dbl>   <dbl>
## 1 BROADWAY FILLMORE     46048    430929   9.78  0.115 
## 2 DELAWARE W. FERRY     45877    235099  16.4   0.108 
## 3 JOHNSON                8020     76138   9.68  0.167 
## 4 M.L.K. PARK           28716    249231  10.5   0.0969
## 5 PERRY                  1551     10524  13.0   0.300 
## 6 STATE HOSPITAL         1118      6918  14.1  -0.0269
## # ... with 1 more variable: geometry <MULTIPOLYGON [°]>

There are a few obvious issues of inconsistent naming which I fixed before doing a final join of the recycling and spatial datasets.

nb = nb %>%
    mutate(NEIGHBORHOOD = recode(NEIGHBORHOOD,
                `MLK PARK` = 'M.L.K. PARK',
                `DELAWARE-WEST FERRY` = 'DELAWARE W. FERRY',
                `BROADWAY-FILLMORE` = 'BROADWAY FILLMORE'))

spatial_rec = nb %>%    
    left_join(rec,by='NEIGHBORHOOD')

Building Maps

Now that the neighborhoods are georeferenced, I can visualize them in space using ggplot2.The first step is to calculate centroids for each neighborhood and extract their coordinates. I need these coordinates later so I can label individual neighborhoods.

nb_centers = spatial_rec %>%
    st_centroid() %>%
    mutate(X = st_coordinates(.)[,1],
           Y = st_coordinates(.)[,2])

Recycling Rate

I used ggplot with geom_sf to plot the neighborhood polygons and color them according to recycling rate. I also added labels to the three highest and three lowest neighborhoods by rate with geom_label, using the coordinates I extracted above.

spatial_rec %>%
    ggplot() +
    geom_sf(aes(fill=MnRate)) +
    scale_fill_gradient(low='red',high='green') +
    labs(title = 'Recycling rate in Buffalo neighborhoods',
         fill = '% Waste \n Recycled',
         x=NULL, y=NULL) +  
    theme(axis.text.x = element_blank(),
          axis.ticks.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank()) +
    geom_label(data=nb_centers %>% top_n(3,MnRate),
               mapping=aes(X,Y,label=NEIGHBORHOOD),
               color='darkgreen',alpha=0.7) +
    geom_label(data=nb_centers %>% top_n(-3,MnRate),
               mapping=aes(X,Y,label=NEIGHBORHOOD),
               color='red',alpha=0.7)