src/buffalo-trees-part2.Rmd


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352

---
title: Distribution of Urban Trees in Buffalo (Part 2)
date: 2018-09-04
output:
  html_document:
    css: style.css
    highlight: pygments
---

```{r,echo=FALSE}
options(tidyverse.quiet=TRUE)
library(tidyverse)
suppressMessages(library(ggplot2))
```

# Introduction

This is part two of an examination of the city of Buffalo's [tree inventory](https://data.buffalony.gov/Quality-of-Life/Tree-Inventory/n4ni-uuec).
Part one is [here](buffalo-trees-part1.html).
There is an enormous amount of information in the dataset and I'm only going to scratch the surface of what could be done with it.

In the previous post, I qualitatively examined the spatial distribution of tree density, richness, and economic value across the city.
I'll now take a more quantitative approach using Poisson point process models.

# Poisson Point Processes

A Poisson point process (PPP) is a statistical model that's appropriate for a series of events that occur at a random, but reasonably predictable, rate.
For example, if you counted the number of cars that drove by your house from 12:00 PM - 1:00 PM every day for a year, you'd likely find that the count each day varied, but that the counts fell into a predictable range.
The average number of cars per day is the PPP *rate* (or *intensity*), often represented by $\lambda$.
The larger the value of $\lambda$, the more cars you'd expect in that hour-long period; if you live in Manhattan the value of $\lambda$ will be much higher than if you live in rural Kansas.

The car example above is a PPP in one dimension (time).
You can also have PPPs in two (or more) dimensions.
Points in geographic space are often modeled with a PPP.
Instead of counting the number of events in a given time interval you count the number of points in a given spatial area.
In this case, the higher the value of $\lambda$, the more points that will be in the designated area.

For example, the following figure shows realizations of two PPPs in two-dimensional space, one with rate $\lambda=100$ and one with rate $\lambda=500$, five times higher.
The red rectangle represents the spatial area of interest.

```{r,echo=F,fig.height=3}
suppressMessages(library(spatstat))
par(mfrow=c(1,2),mar=c(1,2,1,1))
plot(rpoispp(100),main=expression(paste(lambda,'= 100')),pch=19)
rect(0.1,0.1,0.5,0.5, border='red',lwd=1.5)
plot(rpoispp(500),main=expression(paste(lambda,'= 500')),pch=19)
rect(0.1,0.1,0.5,0.5, border='red',lwd=1.5)
```

There are roughly five times as many points in the outlined area on the right figure relative to the left.
Note that the shape and size of the "area of interest" doesn't matter, as long as it's the same in both cases - no matter how you drew it, you'd expect five times as many points on the right.

## Inhomogeneous Poisson Point Processes

The examples above are all *homogeneous* point processes - that is, for a given model, the rate $\lambda$ is constant.
However in some cases we might expect $\lambda$ to vary.
For the cars example, you might expect that the day of the week could impact the number of cars going by - there might be fewer cars on weekend days.
This is an *inhomogeneous* (and arguably much more interesting) point process.

The figure below shows a homogeneous PPP on the left, and an inhomogeneous PPP (where the rate $\lambda$ increases moving from bottom to top) on the right.

```{r,echo=F,fig.height=3}
par(mfrow=c(1,2),mar=c(1,2,1,4))
plot(rpoispp(100),main='Homogeneous PPP', pch=19)
p_func <- function(x,y){y*400}
plot(rpoispp(p_func),main='Inhomogeneous PPP', pch=19)
```

# Back to the Buffalo Trees Dataset

```{r,echo=F}
#Cache the dataset if necessary
if (file.exists('../data/trees_raw.Rdata')){
  load('../data/trees_raw.Rdata')
} else {
  dir.create('../data')
  trees_raw = read.socrata(csv_url) %>% as_tibble()
  save(trees_raw,file='../data/trees_raw.Rdata')
}
```

In part 1 I qualitatively observed what appeared to be a very non-random (i.e., inhomogeneous) distribution of trees across the city of Buffalo.

```{r, echo=F}
trees_raw %>%
  filter(! Common.Name %in% c('STUMP', 'VACANT')) %>%
  ggplot(aes(x = Longitude, y = Latitude, color = Common.Name)) +
  geom_point(size=0.4) +
  coord_map() +
  theme(legend.position='none')
```

Using Poisson point process models, we can test that observation quantitatively - and also determine what factors might be driving the distribution.
For example, in part 1 I also (unsurprisingly) observed that tree density appeared to be higher inside city parks:

```{r, echo=F}
#Re-make density figure if it doesn't exist
if (!file.exists('../data/dens_overlay.rds')){
  dir.create('../data')
  tempR <- tempfile(fileext = ".R")
  library(knitr)
  invisible(purl("buffalo-trees-part1.Rmd", output=tempR))
  invisible(source(tempR))
  unlink(tempR)
}
dens_overlay <- readRDS('../data/dens_overlay.rds')
dens_overlay
```

In this analysis, I tested three hypotheses using Poisson point process models:

1. The distribution of trees in Buffalo is non-random (inhomogeneous)
2. The density of trees is higher in city parks
3. The density of trees in an area is positively related to mean income (i.e., richer areas of the city have more trees)

## Read in and Clean the Tree Dataset

My goal was to get UTM X-Y coordinates (in meters) for each tree in the dataset, to feed into the PPP model.
As in part 1, I started by reading in the raw data from the Socrata API using `RSocrata`.
I cleaned it up using `tidyverse` tools.

```{r}
library(RSocrata)
library(tidyverse)
buf_datasets = ls.socrata('https://data.buffalony.gov')
ind = which(buf_datasets$title == 'Tree Inventory')
csv_url = buf_datasets$distribution[[ind]]$downloadURL[1]
```

```{r, eval=F}
trees_raw = read.socrata(csv_url) %>% as_tibble()
```

As I'm focused on living trees, I removed information about stumps and vacant areas.
Since I'm not interested in individual tree species for this analysis, I also dropped all columns except for the spatial information.

```{r}
trees = trees_raw %>%
  filter(! Common.Name %in% c('STUMP', 'VACANT')) %>%
  select(Latitude, Longitude)
trees
```

Finally I used the `Latitude` and `Longitude` columns to convert the dataset to a simple features object using the `sf` library.
I then projected to UTM and add separate columns for the resulting X and Y coordinates.

```{r,message=F,warning=F}
library(sf)

trees_proj = trees %>%
  st_as_sf(coords = c("Longitude", "Latitude"),
          crs = "+proj=longlat +datum=WGS84") %>%
  st_transform(crs="+proj=utm +zone=17 +datum=WGS84") %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) %>%
  select(X,Y,geometry)
```

# Fit Poisson Point Process Models

To fit PPPs, I used the `spatstat` R package which contains a variety of tools for spatial analysis.

The first step was to convert the project trees dataset to a `ppp` object suitable for input into `spatstat` functions.
The `ppp` function requires vectors of X and Y coordinates, as well as minimum and maximum X and Y coordinates for the area of interest.

```{r,message=F,warning=F}
library(spatstat)
trees_ppp = ppp(x = trees_proj$X, y = trees_proj$Y, 
              range(trees_proj$X), range(trees_proj$Y))
```

Plotting the `trees_ppp` object shows a familiar distribution of points:

```{r}
plot(trees_ppp,cex=0.3,pch=19)
```

## Homogeneous PPP

Model fitting is done using the `ppm` function.
I'll start by fitting a homogeneous PPP, where I assume the rate $\lambda$ is constant.
This can be done by fitting an intercept-only model, using the usual one-sided R formula interface (where `~1` represents the intercept):

```{r,message=F,warning=F}
mod_intercept = ppm(trees_ppp, ~1)
mod_intercept
```

The fitted model has an estimated intercept $\beta_0$ of `r round(mod_intercept$coef[1],2)` on the log scale.
To get the rate $\lambda$ on the natural scale, take $\exp\left(\beta_0\right)$:

```{r}
exp(mod_intercept$coef[1])
```

The exact value of the rate doesn't really have a practical use, as it represents the expected number of points in a spatial region of finite, but arbitrarily small size.

Note also that the printed model summary reminds us that it is a stationary (i.e., homogeneous) point process model.
Is the homogeneous PPP model appropriate for this dataset?
Visually, it appears that trees are non-randomly distributed.

I explicitly tested my first hypothesis (that the distribution is actually inhomogeneous) using a quadrat test.
First, I divided area of interest is divided into a grid of 5x5 evenly-sized rectangles (*quadrats*), and the number of trees in each quadrat counted:

```{r}
tree_quads = quadratcount(trees_ppp)
```

Then, I used a chi-square test to determine if the tree counts were unequal among the quadrats:

```{r}
quadrat.test(tree_quads)
```

I obtained a significant result, indicating that tree counts differed among quadrats.
This is evidence in favor of my first hypothesis that the distribution of trees in Buffalo is inhomogeneous.

## Inhomogeneous PPPs (Covariate Effects)

The next step was to add two covariates to my intercept-only PPP model: location relative to city parks (inside or outside), and local average income.
In both cases, I obtained covariates from shapefile data.

### *Park Boundary Data*

As in part 1, I obtained the park boundary shapefile from the Socrata API, imported it as a simple feature, and transformed it to UTM:

```{r}
get_shapefile <- function(url){
  dest_dir = tempdir()
  dest <- tempfile(tmpdir=dest_dir)
  download.file(url, dest, quiet=T)
  unzip(dest, exdir = dest_dir)
  shape_name = grep('.shp',list.files(dest_dir),value=T)
  setwd(dest_dir)
  sf::st_read(shape_name,quiet=TRUE) 
}

ind = which(buf_datasets$title == 'Parks')
shape_url = buf_datasets$distribution[26][[1]]$downloadURL[3]
park_bounds = get_shapefile(shape_url)

park_bounds = park_bounds %>%
  st_transform(crs="+proj=utm +zone=17 +datum=WGS84")
```

The next step was to convert it to the special raster image format required by `ppm`.
Since I needed to do this twice, I made a dedicated function.
The function first converts a simple feature to a raster using the `fasterize` library, then converts the raster to the format required by `ppm` using the `maptools` library.

```{r}
sf_to_im = function(sf_object, background=0, field=NULL){
  r = fasterize::raster(sf_object, res=5)
  r = fasterize::fasterize(sf_object, r, background=background, field=field)
  maptools::as.im.RasterLayer(r)
}
```

```{r}
park_raster = sf_to_im(park_bounds)
plot(park_raster)
```

The resulting raster takes on value 1 for cells inside a park, and 0 otherwise.

### *Income Data*

I obtained recent income data at the U.S. Census tract level from the American Community Survey (ACS) dataset, which I also used in a [post on recycling in Buffalo](buffalo-recycle-part2.html).
As before, I used the Census API via the `tidycensus` package to pull ACS income data for Buffalo in simple feature format.

```{r,message=F,warning=F,results='hide'}
library(tidycensus)
income_data = get_acs(geography = 'tract', variables = c(income='B07011_001'),
                   state='NY',county='Erie',geometry='TRUE') %>%
  st_transform(crs="+proj=utm +zone=17 +datum=WGS84") %>%
  st_crop(st_bbox(trees_proj)) %>%
  st_cast()
```

I converted the raw income data to a Z-score so that model parameter estimates would be comparable:

```{r}
income_data = income_data %>%
  mutate(estimateZ = scale(estimate))
```

Finally, as with the parks data, I converted this simple feature to a raster image.
Higher Z-scores correspond to higher mean incomes.

```{r}
income_raster = sf_to_im(income_data, NA, "estimateZ")
plot(income_raster)
```

### *Fit the Full Model*

I now had all the pieces to fit an inhomogeneous PPP model where the rate $\lambda$ was a function of location relative to parks and mean income level.
Again I used the `ppm` function while adding covariates to the model formula:

```{r,warning=F}
mod_full = ppm(trees_ppp, ~park + income, 
               covariates=list(park=park_raster, income=income_raster))
mod_full
```

The estimated slope $\beta_1$ for the effect of park was `r round(mod_full$coef['park'],2)`, and for income $\beta_2$ it was `r round(mod_full$coef['income'],2)`.
Both covariates had a statistically significant effect on rate, indicating that both contributed to the non-random distribution of trees in the city.

```{r,echo=F}
ef_sizes = round(exp(mod_full$coef),2)
```

To put things on a more understandable scale, the effect size of each covariate can be calculated as $\exp\left(\beta\right)$. 
Thus, being inside a park increased the rate $\lambda$ by a factor of `r ef_sizes['park']`, or `r (ef_sizes['park'] - 1) * 100`%.
At the same time, a 1-standard deviation increase in mean income (about $`r round(sd(income_data$estimate,na.rm=T))`) increased the rate by `r (ef_sizes['income'] - 1) * 100`%.

These relationships can also be shown graphically:

```{r}
#Generare predicted values and SEs
inc_vals = seq(min(income_data$estimate,na.rm=T),max(income_data$estimate,na.rm=T),10)
sc_vals = (inc_vals - mean(income_data$estimate,na.rm=T))/sd(income_data$estimate,na.rm=T)

X = rbind(cbind(1,1,sc_vals),cbind(1,0,sc_vals))
loglam = X %*% mod_full$coef
se = sqrt(diag(X %*% vcov(mod_full) %*% t(X)))
figdata = data.frame(`Park Status` = rep(c('Inside','Outside'),each=length(inc_vals)),
                     inc=inc_vals, lam=exp(loglam), upper=exp(loglam+1.96*se),
                     lower=exp(loglam-1.96*se),check.names=F)

#Plot predicted values and 95% confidence envelopes
ggplot(data=figdata) +
  geom_ribbon(aes(x=inc,ymin=lower,ymax=upper,group=`Park Status`), alpha=0.1) +
  geom_line(aes(x=inc,y=lam, color=`Park Status`)) +
  xlab('Mean income ($)') + ylab(expression(paste('Predicted rate (',lambda,')',sep='')))
```

### *Assess Model Fit*

The final step is to see if the inhomogeneous point process model with covariates fits the dataset better than the intercept-only model, using analysis of deviance.

```{r, message=F, warning=F}
anova(mod_intercept, mod_full, test='Chi')
```

Based on the significant test result, the model with covariates is indeed a better fit.

# Conclusion

I found evidence to suppport all three of my hypotheses.
First, the distribution of trees in Buffalo is non-random.
Second, the density of trees is higher inside city parks than outside.
Finally, the density of trees increases with mean income: richer neighborhoods have more trees, matching the findings of [at least one other U.S. study](http://journals.sagepub.com/doi/abs/10.1068/a41236).