Add PPP analysis of Buffalo tree data

author: Ken Kellner <ken@kenkellner.com> 2018-09-04 13:30:45 -0400
committer: Ken Kellner <ken@kenkellner.com> 2018-09-04 13:30:45 -0400
commit: e863f78a8742093408e1609d747cb3d355f53da5 (patch)
tree: 9cb90e2f3334b0780d12ad0e21ac0246c33852b6
parent: 83ca108f437a76098afe3fbdd026239269132ba5 (diff)
2 files changed, 360 insertions, 0 deletions
diff --git a/src/buffalo-trees-part1.Rmd b/src/buffalo-trees-part1.Rmd
index 3adc57d..cf8699a 100644
--- a/src/buffalo-trees-part1.Rmd
+++ b/src/buffalo-trees-part1.Rmd
@@ -259,6 +259,14 @@ rich_plot +
         alpha=0.6)
 ```
 
+```{r,echo=F}
+#Save density plot for use in another post
+dens_overlay = dens_plot +
+  geom_sf(data=park_bounds %>%
+          filter(acres > 5), alpha=0.6)
+saveRDS(dens_overlay, '../data/dens_overlay.rds')
+```
+
 As predicted, areas of high tree species richness tend to be located inside parks.
 
 # Future Directions
diff --git a/src/buffalo-trees-part2.Rmd b/src/buffalo-trees-part2.Rmd
new file mode 100644
index 0000000..63289b5
--- /dev/null
+++ b/src/buffalo-trees-part2.Rmd
@@ -0,0 +1,352 @@
+---
+title: Distribution of Urban Trees in Buffalo (Part 2)
+date: 2018-09-04
+output:
+  html_document:
+    css: style.css
+    highlight: pygments
+---
+
+```{r,echo=FALSE}
+options(tidyverse.quiet=TRUE)
+library(tidyverse)
+suppressMessages(library(ggplot2))
+```
+
+# Introduction
+
+This is part two of an examination of the city of Buffalo's [tree inventory](https://data.buffalony.gov/Quality-of-Life/Tree-Inventory/n4ni-uuec).
+Part one is [here](buffalo-trees-part1.html).
+There is an enormous amount of information in the dataset and I'm only going to scratch the surface of what could be done with it.
+
+In the previous post, I qualitatively examined the spatial distribution of tree density, richness, and economic value across the city.
+I'll now take a more quantitative approach using Poisson point process models.
+
+# Poisson Point Processes
+
+A Poisson point process (PPP) is a statistical model that's appropriate for a series of events that occur at a random, but reasonably predictable, rate.
+For example, if you counted the number of cars that drove by your house from 12:00 PM - 1:00 PM every day for a year, you'd likely find that the count each day varied, but that the counts fell into a predictable range.
+The average number of cars per day is the PPP *rate* (or *intensity*), often represented by $\lambda$.
+The larger the value of $\lambda$, the more cars you'd expect in that hour-long period; if you live in Manhattan the value of $\lambda$ will be much higher than if you live in rural Kansas.
+
+The car example above is a PPP in one dimension (time).
+You can also have PPPs in two (or more) dimensions.
+Points in geographic space are often modeled with a PPP.
+Instead of counting the number of events in a given time interval you count the number of points in a given spatial area.
+In this case, the higher the value of $\lambda$, the more points that will be in the designated area.
+
+For example, the following figure shows realizations of two PPPs in two-dimensional space, one with rate $\lambda=100$ and one with rate $\lambda=500$, five times higher.
+The red rectangle represents the spatial area of interest.
+
+```{r,echo=F,fig.height=3}
+suppressMessages(library(spatstat))
+par(mfrow=c(1,2),mar=c(1,2,1,1))
+plot(rpoispp(100),main=expression(paste(lambda,'= 100')),pch=19)
+rect(0.1,0.1,0.5,0.5, border='red',lwd=1.5)
+plot(rpoispp(500),main=expression(paste(lambda,'= 500')),pch=19)
+rect(0.1,0.1,0.5,0.5, border='red',lwd=1.5)
+```
+
+There are roughly five times as many points in the outlined area on the right figure relative to the left.
+Note that the shape and size of the "area of interest" doesn't matter, as long as it's the same in both cases - no matter how you drew it, you'd expect five times as many points on the right.
+
+## Inhomogeneous Poisson Point Processes
+
+The examples above are all *homogeneous* point processes - that is, for a given model, the rate $\lambda$ is constant.
+However in some cases we might expect $\lambda$ to vary.
+For the cars example, you might expect that the day of the week could impact the number of cars going by - there might be fewer cars on weekend days.
+This is an *inhomogeneous* (and arguably much more interesting) point process.
+
+The figure below shows a homogeneous PPP on the left, and an inhomogeneous PPP (where the rate $\lambda$ increases moving from bottom to top) on the right.
+
+```{r,echo=F,fig.height=3}
+par(mfrow=c(1,2),mar=c(1,2,1,4))
+plot(rpoispp(100),main='Homogeneous PPP', pch=19)
+p_func <- function(x,y){y*400}
+plot(rpoispp(p_func),main='Inhomogeneous PPP', pch=19)
+```
+
+# Back to the Buffalo Trees Dataset
+
+```{r,echo=F}
+#Cache the dataset if necessary
+if (file.exists('../data/trees_raw.Rdata')){
+  load('../data/trees_raw.Rdata')
+} else {
+  dir.create('../data')
+  trees_raw = read.socrata(csv_url) %>% as_tibble()
+  save(trees_raw,file='../data/trees_raw.Rdata')
+}
+```
+
+In part 1 I qualitatively observed what appeared to be a very non-random (i.e., inhomogeneous) distribution of trees across the city of Buffalo.
+
+```{r, echo=F}
+trees_raw %>%
+  filter(! Common.Name %in% c('STUMP', 'VACANT')) %>%
+  ggplot(aes(x = Longitude, y = Latitude, color = Common.Name)) +
+  geom_point(size=0.4) +
+  coord_map() +
+  theme(legend.position='none')
+```
+
+Using Poisson point process models, we can test that observation quantitatively - and also determine what factors might be driving the distribution.
+For example, in part 1 I also (unsurprisingly) observed that tree density appeared to be higher inside city parks:
+
+```{r, echo=F}
+#Re-make density figure if it doesn't exist
+if (!file.exists('../data/dens_overlay.rds')){
+  dir.create('../data')
+  tempR <- tempfile(fileext = ".R")
+  library(knitr)
+  invisible(purl("buffalo-trees-part1.Rmd", output=tempR))
+  invisible(source(tempR))
+  unlink(tempR)
+}
+dens_overlay <- readRDS('../data/dens_overlay.rds')
+dens_overlay
+```
+
+In this analysis, I tested three hypotheses using Poisson point process models:
+
+1. The distribution of trees in Buffalo is non-random (inhomogeneous)
+2. The density of trees is higher in city parks
+3. The density of trees in an area is positively related to mean income (i.e., richer areas of the city have more trees)
+
+## Read in and Clean the Tree Dataset
+
+My goal was to get UTM X-Y coordinates (in meters) for each tree in the dataset, to feed into the PPP model.
+As in part 1, I started by reading in the raw data from the Socrata API using `RSocrata`.
+I cleaned it up using `tidyverse` tools.
+
+```{r}
+library(RSocrata)
+library(tidyverse)
+buf_datasets = ls.socrata('https://data.buffalony.gov')
+ind = which(buf_datasets$title == 'Tree Inventory')
+csv_url = buf_datasets$distribution[[ind]]$downloadURL[1]
+```
+
+```{r, eval=F}
+trees_raw = read.socrata(csv_url) %>% as_tibble()
+```
+
+As I'm focused on living trees, I removed information about stumps and vacant areas.
+Since I'm not interested in individual tree species for this analysis, I also dropped all columns except for the spatial information.
+
+```{r}
+trees = trees_raw %>%
+  filter(! Common.Name %in% c('STUMP', 'VACANT')) %>%
+  select(Latitude, Longitude)
+trees
+```
+
+Finally I used the `Latitude` and `Longitude` columns to convert the dataset to a simple features object using the `sf` library.
+I then projected to UTM and add separate columns for the resulting X and Y coordinates.
+
+```{r,message=F,warning=F}
+library(sf)
+
+trees_proj = trees %>%
+  st_as_sf(coords = c("Longitude", "Latitude"),
+          crs = "+proj=longlat +datum=WGS84") %>%
+  st_transform(crs="+proj=utm +zone=17 +datum=WGS84") %>%
+  mutate(X = st_coordinates(.)[,1],
+         Y = st_coordinates(.)[,2]) %>%
+  select(X,Y,geometry)
+```
+
+# Fit Poisson Point Process Models
+
+To fit PPPs, I used the `spatstat` R package which contains a variety of tools for spatial analysis.
+
+The first step was to convert the project trees dataset to a `ppp` object suitable for input into `spatstat` functions.
+The `ppp` function requires vectors of X and Y coordinates, as well as minimum and maximum X and Y coordinates for the area of interest.
+
+```{r,message=F,warning=F}
+library(spatstat)
+trees_ppp = ppp(x = trees_proj$X, y = trees_proj$Y, 
+              range(trees_proj$X), range(trees_proj$Y))
+```
+
+Plotting the `trees_ppp` object shows a familiar distribution of points:
+
+```{r}
+plot(trees_ppp,cex=0.3,pch=19)
+```
+
+## Homogeneous PPP
+
+Model fitting is done using the `ppm` function.
+I'll start by fitting a homogeneous PPP, where I assume the rate $\lambda$ is constant.
+This can be done by fitting an intercept-only model, using the usual one-sided R formula interface (where `~1` represents the intercept):
+
+```{r,message=F,warning=F}
+mod_intercept = ppm(trees_ppp, ~1)
+mod_intercept
+```
+
+The fitted model has an estimated intercept $\beta_0$ of `r round(mod_intercept$coef[1],2)` on the log scale.
+To get the rate $\lambda$ on the natural scale, take $\exp\left(\beta_0\right)$:
+
+```{r}
+exp(mod_intercept$coef[1])
+```
+
+The exact value of the rate doesn't really have a practical use, as it represents the expected number of points in a spatial region of finite, but arbitrarily small size.
+
+Note also that the printed model summary reminds us that it is a stationary (i.e., homogeneous) point process model.
+Is the homogeneous PPP model appropriate for this dataset?
+Visually, it appears that trees are non-randomly distributed.
+
+I explicitly tested my first hypothesis (that the distribution is actually inhomogeneous) using a quadrat test.
+First, I divided area of interest is divided into a grid of 5x5 evenly-sized rectangles (*quadrats*), and the number of trees in each quadrat counted:
+
+```{r}
+tree_quads = quadratcount(trees_ppp)
+```
+
+Then, I used a chi-square test to determine if the tree counts were unequal among the quadrats:
+
+```{r}
+quadrat.test(tree_quads)
+```
+
+I obtained a significant result, indicating that tree counts differed among quadrats.
+This is evidence in favor of my first hypothesis that the distribution of trees in Buffalo is inhomogeneous.
+
+## Inhomogeneous PPPs (Covariate Effects)
+
+The next step was to add two covariates to my intercept-only PPP model: location relative to city parks (inside or outside), and local average income.
+In both cases, I obtained covariates from shapefile data.
+
+### *Park Boundary Data*
+
+As in part 1, I obtained the park boundary shapefile from the Socrata API, imported it as a simple feature, and transformed it to UTM:
+
+```{r}
+get_shapefile <- function(url){
+  dest_dir = tempdir()
+  dest <- tempfile(tmpdir=dest_dir)
+  download.file(url, dest, quiet=T)
+  unzip(dest, exdir = dest_dir)
+  shape_name = grep('.shp',list.files(dest_dir),value=T)
+  setwd(dest_dir)
+  sf::st_read(shape_name,quiet=TRUE) 
+}
+
+ind = which(buf_datasets$title == 'Parks')
+shape_url = buf_datasets$distribution[26][[1]]$downloadURL[3]
+park_bounds = get_shapefile(shape_url)
+
+park_bounds = park_bounds %>%
+  st_transform(crs="+proj=utm +zone=17 +datum=WGS84")
+```
+
+The next step was to convert it to the special raster image format required by `ppm`.
+Since I needed to do this twice, I made a dedicated function.
+The function first converts a simple feature to a raster using the `fasterize` library, then converts the raster to the format required by `ppm` using the `maptools` library.
+
+```{r}
+sf_to_im = function(sf_object, background=0, field=NULL){
+  r = fasterize::raster(sf_object, res=5)
+  r = fasterize::fasterize(sf_object, r, background=background, field=field)
+  maptools::as.im.RasterLayer(r)
+}
+```
+
+```{r}
+park_raster = sf_to_im(park_bounds)
+plot(park_raster)
+```
+
+The resulting raster takes on value 1 for cells inside a park, and 0 otherwise.
+
+### *Income Data*
+
+I obtained recent income data at the U.S. Census tract level from the American Community Survey (ACS) dataset, which I also used in a [post on recycling in Buffalo](buffalo-recycle-part2.html).
+As before, I used the Census API via the `tidycensus` package to pull ACS income data for Buffalo in simple feature format.
+
+```{r,message=F,warning=F,results='hide'}
+library(tidycensus)
+income_data = get_acs(geography = 'tract', variables = c(income='B07011_001'),
+                   state='NY',county='Erie',geometry='TRUE') %>%
+  st_transform(crs="+proj=utm +zone=17 +datum=WGS84") %>%
+  st_crop(st_bbox(trees_proj)) %>%
+  st_cast()
+```
+
+I converted the raw income data to a Z-score so that model parameter estimates would be comparable:
+
+```{r}
+income_data = income_data %>%
+  mutate(estimateZ = scale(estimate))
+```
+
+Finally, as with the parks data, I converted this simple feature to a raster image.
+Higher Z-scores correspond to higher mean incomes.
+
+```{r}
+income_raster = sf_to_im(income_data, NA, "estimateZ")
+plot(income_raster)
+```
+
+### *Fit the Full Model*
+
+I now had all the pieces to fit an inhomogeneous PPP model where the rate $\lambda$ was a function of location relative to parks and mean income level.
+Again I used the `ppm` function while adding covariates to the model formula:
+
+```{r,warning=F}
+mod_full = ppm(trees_ppp, ~park + income, 
+               covariates=list(park=park_raster, income=income_raster))
+mod_full
+```
+
+The estimated slope $\beta_1$ for the effect of park was `r round(mod_full$coef['park'],2)`, and for income $\beta_2$ it was `r round(mod_full$coef['income'],2)`.
+Both covariates had a statistically significant effect on rate, indicating that both contributed to the non-random distribution of trees in the city.
+
+```{r,echo=F}
+ef_sizes = round(exp(mod_full$coef),2)
+```
+
+To put things on a more understandable scale, the effect size of each covariate can be calculated as $\exp\left(\beta\right)$. 
+Thus, being inside a park increased the rate $\lambda$ by a factor of `r ef_sizes['park']`, or `r (ef_sizes['park'] - 1) * 100`%.
+At the same time, a 1-standard deviation increase in mean income (about $`r round(sd(income_data$estimate,na.rm=T))`) increased the rate by `r (ef_sizes['income'] - 1) * 100`%.
+
+These relationships can also be shown graphically:
+
+```{r}
+#Generare predicted values and SEs
+inc_vals = seq(min(income_data$estimate,na.rm=T),max(income_data$estimate,na.rm=T),10)
+sc_vals = (inc_vals - mean(income_data$estimate,na.rm=T))/sd(income_data$estimate,na.rm=T)
+
+X = rbind(cbind(1,1,sc_vals),cbind(1,0,sc_vals))
+loglam = X %*% mod_full$coef
+se = sqrt(diag(X %*% vcov(mod_full) %*% t(X)))
+figdata = data.frame(`Park Status` = rep(c('Inside','Outside'),each=length(inc_vals)),
+                     inc=inc_vals, lam=exp(loglam), upper=exp(loglam+1.96*se),
+                     lower=exp(loglam-1.96*se),check.names=F)
+
+#Plot predicted values and 95% confidence envelopes
+ggplot(data=figdata) +
+  geom_ribbon(aes(x=inc,ymin=lower,ymax=upper,group=`Park Status`), alpha=0.1) +
+  geom_line(aes(x=inc,y=lam, color=`Park Status`)) +
+  xlab('Mean income ($)') + ylab(expression(paste('Predicted rate (',lambda,')',sep='')))
+```
+
+### *Assess Model Fit*
+
+The final step is to see if the inhomogeneous point process model with covariates fits the dataset better than the intercept-only model, using analysis of deviance.
+
+```{r, message=F, warning=F}
+anova(mod_intercept, mod_full, test='Chi')
+```
+
+Based on the significant test result, the model with covariates is indeed a better fit.
+
+# Conclusion
+
+I found evidence to suppport all three of my hypotheses.
+First, the distribution of trees in Buffalo is non-random.
+Second, the density of trees is higher inside city parks than outside.
+Finally, the density of trees increases with mean income: richer neighborhoods have more trees, matching the findings of [at least one other U.S. study](http://journals.sagepub.com/doi/abs/10.1068/a41236).
author	Ken Kellner <ken@kenkellner.com>	2018-09-04 13:30:45 -0400
committer	Ken Kellner <ken@kenkellner.com>	2018-09-04 13:30:45 -0400
commit	e863f78a8742093408e1609d747cb3d355f53da5 (patch)
tree	9cb90e2f3334b0780d12ad0e21ac0246c33852b6
parent	83ca108f437a76098afe3fbdd026239269132ba5 (diff)