From e863f78a8742093408e1609d747cb3d355f53da5 Mon Sep 17 00:00:00 2001 From: Ken Kellner Date: Tue, 4 Sep 2018 13:30:45 -0400 Subject: Add PPP analysis of Buffalo tree data --- src/buffalo-trees-part1.Rmd | 8 + src/buffalo-trees-part2.Rmd | 352 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 360 insertions(+) create mode 100644 src/buffalo-trees-part2.Rmd diff --git a/src/buffalo-trees-part1.Rmd b/src/buffalo-trees-part1.Rmd index 3adc57d..cf8699a 100644 --- a/src/buffalo-trees-part1.Rmd +++ b/src/buffalo-trees-part1.Rmd @@ -259,6 +259,14 @@ rich_plot + alpha=0.6) ``` +```{r,echo=F} +#Save density plot for use in another post +dens_overlay = dens_plot + + geom_sf(data=park_bounds %>% + filter(acres > 5), alpha=0.6) +saveRDS(dens_overlay, '../data/dens_overlay.rds') +``` + As predicted, areas of high tree species richness tend to be located inside parks. # Future Directions diff --git a/src/buffalo-trees-part2.Rmd b/src/buffalo-trees-part2.Rmd new file mode 100644 index 0000000..63289b5 --- /dev/null +++ b/src/buffalo-trees-part2.Rmd @@ -0,0 +1,352 @@ +--- +title: Distribution of Urban Trees in Buffalo (Part 2) +date: 2018-09-04 +output: + html_document: + css: style.css + highlight: pygments +--- + +```{r,echo=FALSE} +options(tidyverse.quiet=TRUE) +library(tidyverse) +suppressMessages(library(ggplot2)) +``` + +# Introduction + +This is part two of an examination of the city of Buffalo's [tree inventory](https://data.buffalony.gov/Quality-of-Life/Tree-Inventory/n4ni-uuec). +Part one is [here](buffalo-trees-part1.html). +There is an enormous amount of information in the dataset and I'm only going to scratch the surface of what could be done with it. + +In the previous post, I qualitatively examined the spatial distribution of tree density, richness, and economic value across the city. +I'll now take a more quantitative approach using Poisson point process models. + +# Poisson Point Processes + +A Poisson point process (PPP) is a statistical model that's appropriate for a series of events that occur at a random, but reasonably predictable, rate. +For example, if you counted the number of cars that drove by your house from 12:00 PM - 1:00 PM every day for a year, you'd likely find that the count each day varied, but that the counts fell into a predictable range. +The average number of cars per day is the PPP *rate* (or *intensity*), often represented by $\lambda$. +The larger the value of $\lambda$, the more cars you'd expect in that hour-long period; if you live in Manhattan the value of $\lambda$ will be much higher than if you live in rural Kansas. + +The car example above is a PPP in one dimension (time). +You can also have PPPs in two (or more) dimensions. +Points in geographic space are often modeled with a PPP. +Instead of counting the number of events in a given time interval you count the number of points in a given spatial area. +In this case, the higher the value of $\lambda$, the more points that will be in the designated area. + +For example, the following figure shows realizations of two PPPs in two-dimensional space, one with rate $\lambda=100$ and one with rate $\lambda=500$, five times higher. +The red rectangle represents the spatial area of interest. + +```{r,echo=F,fig.height=3} +suppressMessages(library(spatstat)) +par(mfrow=c(1,2),mar=c(1,2,1,1)) +plot(rpoispp(100),main=expression(paste(lambda,'= 100')),pch=19) +rect(0.1,0.1,0.5,0.5, border='red',lwd=1.5) +plot(rpoispp(500),main=expression(paste(lambda,'= 500')),pch=19) +rect(0.1,0.1,0.5,0.5, border='red',lwd=1.5) +``` + +There are roughly five times as many points in the outlined area on the right figure relative to the left. +Note that the shape and size of the "area of interest" doesn't matter, as long as it's the same in both cases - no matter how you drew it, you'd expect five times as many points on the right. + +## Inhomogeneous Poisson Point Processes + +The examples above are all *homogeneous* point processes - that is, for a given model, the rate $\lambda$ is constant. +However in some cases we might expect $\lambda$ to vary. +For the cars example, you might expect that the day of the week could impact the number of cars going by - there might be fewer cars on weekend days. +This is an *inhomogeneous* (and arguably much more interesting) point process. + +The figure below shows a homogeneous PPP on the left, and an inhomogeneous PPP (where the rate $\lambda$ increases moving from bottom to top) on the right. + +```{r,echo=F,fig.height=3} +par(mfrow=c(1,2),mar=c(1,2,1,4)) +plot(rpoispp(100),main='Homogeneous PPP', pch=19) +p_func <- function(x,y){y*400} +plot(rpoispp(p_func),main='Inhomogeneous PPP', pch=19) +``` + +# Back to the Buffalo Trees Dataset + +```{r,echo=F} +#Cache the dataset if necessary +if (file.exists('../data/trees_raw.Rdata')){ + load('../data/trees_raw.Rdata') +} else { + dir.create('../data') + trees_raw = read.socrata(csv_url) %>% as_tibble() + save(trees_raw,file='../data/trees_raw.Rdata') +} +``` + +In part 1 I qualitatively observed what appeared to be a very non-random (i.e., inhomogeneous) distribution of trees across the city of Buffalo. + +```{r, echo=F} +trees_raw %>% + filter(! Common.Name %in% c('STUMP', 'VACANT')) %>% + ggplot(aes(x = Longitude, y = Latitude, color = Common.Name)) + + geom_point(size=0.4) + + coord_map() + + theme(legend.position='none') +``` + +Using Poisson point process models, we can test that observation quantitatively - and also determine what factors might be driving the distribution. +For example, in part 1 I also (unsurprisingly) observed that tree density appeared to be higher inside city parks: + +```{r, echo=F} +#Re-make density figure if it doesn't exist +if (!file.exists('../data/dens_overlay.rds')){ + dir.create('../data') + tempR <- tempfile(fileext = ".R") + library(knitr) + invisible(purl("buffalo-trees-part1.Rmd", output=tempR)) + invisible(source(tempR)) + unlink(tempR) +} +dens_overlay <- readRDS('../data/dens_overlay.rds') +dens_overlay +``` + +In this analysis, I tested three hypotheses using Poisson point process models: + +1. The distribution of trees in Buffalo is non-random (inhomogeneous) +2. The density of trees is higher in city parks +3. The density of trees in an area is positively related to mean income (i.e., richer areas of the city have more trees) + +## Read in and Clean the Tree Dataset + +My goal was to get UTM X-Y coordinates (in meters) for each tree in the dataset, to feed into the PPP model. +As in part 1, I started by reading in the raw data from the Socrata API using `RSocrata`. +I cleaned it up using `tidyverse` tools. + +```{r} +library(RSocrata) +library(tidyverse) +buf_datasets = ls.socrata('https://data.buffalony.gov') +ind = which(buf_datasets$title == 'Tree Inventory') +csv_url = buf_datasets$distribution[[ind]]$downloadURL[1] +``` + +```{r, eval=F} +trees_raw = read.socrata(csv_url) %>% as_tibble() +``` + +As I'm focused on living trees, I removed information about stumps and vacant areas. +Since I'm not interested in individual tree species for this analysis, I also dropped all columns except for the spatial information. + +```{r} +trees = trees_raw %>% + filter(! Common.Name %in% c('STUMP', 'VACANT')) %>% + select(Latitude, Longitude) +trees +``` + +Finally I used the `Latitude` and `Longitude` columns to convert the dataset to a simple features object using the `sf` library. +I then projected to UTM and add separate columns for the resulting X and Y coordinates. + +```{r,message=F,warning=F} +library(sf) + +trees_proj = trees %>% + st_as_sf(coords = c("Longitude", "Latitude"), + crs = "+proj=longlat +datum=WGS84") %>% + st_transform(crs="+proj=utm +zone=17 +datum=WGS84") %>% + mutate(X = st_coordinates(.)[,1], + Y = st_coordinates(.)[,2]) %>% + select(X,Y,geometry) +``` + +# Fit Poisson Point Process Models + +To fit PPPs, I used the `spatstat` R package which contains a variety of tools for spatial analysis. + +The first step was to convert the project trees dataset to a `ppp` object suitable for input into `spatstat` functions. +The `ppp` function requires vectors of X and Y coordinates, as well as minimum and maximum X and Y coordinates for the area of interest. + +```{r,message=F,warning=F} +library(spatstat) +trees_ppp = ppp(x = trees_proj$X, y = trees_proj$Y, + range(trees_proj$X), range(trees_proj$Y)) +``` + +Plotting the `trees_ppp` object shows a familiar distribution of points: + +```{r} +plot(trees_ppp,cex=0.3,pch=19) +``` + +## Homogeneous PPP + +Model fitting is done using the `ppm` function. +I'll start by fitting a homogeneous PPP, where I assume the rate $\lambda$ is constant. +This can be done by fitting an intercept-only model, using the usual one-sided R formula interface (where `~1` represents the intercept): + +```{r,message=F,warning=F} +mod_intercept = ppm(trees_ppp, ~1) +mod_intercept +``` + +The fitted model has an estimated intercept $\beta_0$ of `r round(mod_intercept$coef[1],2)` on the log scale. +To get the rate $\lambda$ on the natural scale, take $\exp\left(\beta_0\right)$: + +```{r} +exp(mod_intercept$coef[1]) +``` + +The exact value of the rate doesn't really have a practical use, as it represents the expected number of points in a spatial region of finite, but arbitrarily small size. + +Note also that the printed model summary reminds us that it is a stationary (i.e., homogeneous) point process model. +Is the homogeneous PPP model appropriate for this dataset? +Visually, it appears that trees are non-randomly distributed. + +I explicitly tested my first hypothesis (that the distribution is actually inhomogeneous) using a quadrat test. +First, I divided area of interest is divided into a grid of 5x5 evenly-sized rectangles (*quadrats*), and the number of trees in each quadrat counted: + +```{r} +tree_quads = quadratcount(trees_ppp) +``` + +Then, I used a chi-square test to determine if the tree counts were unequal among the quadrats: + +```{r} +quadrat.test(tree_quads) +``` + +I obtained a significant result, indicating that tree counts differed among quadrats. +This is evidence in favor of my first hypothesis that the distribution of trees in Buffalo is inhomogeneous. + +## Inhomogeneous PPPs (Covariate Effects) + +The next step was to add two covariates to my intercept-only PPP model: location relative to city parks (inside or outside), and local average income. +In both cases, I obtained covariates from shapefile data. + +### *Park Boundary Data* + +As in part 1, I obtained the park boundary shapefile from the Socrata API, imported it as a simple feature, and transformed it to UTM: + +```{r} +get_shapefile <- function(url){ + dest_dir = tempdir() + dest <- tempfile(tmpdir=dest_dir) + download.file(url, dest, quiet=T) + unzip(dest, exdir = dest_dir) + shape_name = grep('.shp',list.files(dest_dir),value=T) + setwd(dest_dir) + sf::st_read(shape_name,quiet=TRUE) +} + +ind = which(buf_datasets$title == 'Parks') +shape_url = buf_datasets$distribution[26][[1]]$downloadURL[3] +park_bounds = get_shapefile(shape_url) + +park_bounds = park_bounds %>% + st_transform(crs="+proj=utm +zone=17 +datum=WGS84") +``` + +The next step was to convert it to the special raster image format required by `ppm`. +Since I needed to do this twice, I made a dedicated function. +The function first converts a simple feature to a raster using the `fasterize` library, then converts the raster to the format required by `ppm` using the `maptools` library. + +```{r} +sf_to_im = function(sf_object, background=0, field=NULL){ + r = fasterize::raster(sf_object, res=5) + r = fasterize::fasterize(sf_object, r, background=background, field=field) + maptools::as.im.RasterLayer(r) +} +``` + +```{r} +park_raster = sf_to_im(park_bounds) +plot(park_raster) +``` + +The resulting raster takes on value 1 for cells inside a park, and 0 otherwise. + +### *Income Data* + +I obtained recent income data at the U.S. Census tract level from the American Community Survey (ACS) dataset, which I also used in a [post on recycling in Buffalo](buffalo-recycle-part2.html). +As before, I used the Census API via the `tidycensus` package to pull ACS income data for Buffalo in simple feature format. + +```{r,message=F,warning=F,results='hide'} +library(tidycensus) +income_data = get_acs(geography = 'tract', variables = c(income='B07011_001'), + state='NY',county='Erie',geometry='TRUE') %>% + st_transform(crs="+proj=utm +zone=17 +datum=WGS84") %>% + st_crop(st_bbox(trees_proj)) %>% + st_cast() +``` + +I converted the raw income data to a Z-score so that model parameter estimates would be comparable: + +```{r} +income_data = income_data %>% + mutate(estimateZ = scale(estimate)) +``` + +Finally, as with the parks data, I converted this simple feature to a raster image. +Higher Z-scores correspond to higher mean incomes. + +```{r} +income_raster = sf_to_im(income_data, NA, "estimateZ") +plot(income_raster) +``` + +### *Fit the Full Model* + +I now had all the pieces to fit an inhomogeneous PPP model where the rate $\lambda$ was a function of location relative to parks and mean income level. +Again I used the `ppm` function while adding covariates to the model formula: + +```{r,warning=F} +mod_full = ppm(trees_ppp, ~park + income, + covariates=list(park=park_raster, income=income_raster)) +mod_full +``` + +The estimated slope $\beta_1$ for the effect of park was `r round(mod_full$coef['park'],2)`, and for income $\beta_2$ it was `r round(mod_full$coef['income'],2)`. +Both covariates had a statistically significant effect on rate, indicating that both contributed to the non-random distribution of trees in the city. + +```{r,echo=F} +ef_sizes = round(exp(mod_full$coef),2) +``` + +To put things on a more understandable scale, the effect size of each covariate can be calculated as $\exp\left(\beta\right)$. +Thus, being inside a park increased the rate $\lambda$ by a factor of `r ef_sizes['park']`, or `r (ef_sizes['park'] - 1) * 100`%. +At the same time, a 1-standard deviation increase in mean income (about $`r round(sd(income_data$estimate,na.rm=T))`) increased the rate by `r (ef_sizes['income'] - 1) * 100`%. + +These relationships can also be shown graphically: + +```{r} +#Generare predicted values and SEs +inc_vals = seq(min(income_data$estimate,na.rm=T),max(income_data$estimate,na.rm=T),10) +sc_vals = (inc_vals - mean(income_data$estimate,na.rm=T))/sd(income_data$estimate,na.rm=T) + +X = rbind(cbind(1,1,sc_vals),cbind(1,0,sc_vals)) +loglam = X %*% mod_full$coef +se = sqrt(diag(X %*% vcov(mod_full) %*% t(X))) +figdata = data.frame(`Park Status` = rep(c('Inside','Outside'),each=length(inc_vals)), + inc=inc_vals, lam=exp(loglam), upper=exp(loglam+1.96*se), + lower=exp(loglam-1.96*se),check.names=F) + +#Plot predicted values and 95% confidence envelopes +ggplot(data=figdata) + + geom_ribbon(aes(x=inc,ymin=lower,ymax=upper,group=`Park Status`), alpha=0.1) + + geom_line(aes(x=inc,y=lam, color=`Park Status`)) + + xlab('Mean income ($)') + ylab(expression(paste('Predicted rate (',lambda,')',sep=''))) +``` + +### *Assess Model Fit* + +The final step is to see if the inhomogeneous point process model with covariates fits the dataset better than the intercept-only model, using analysis of deviance. + +```{r, message=F, warning=F} +anova(mod_intercept, mod_full, test='Chi') +``` + +Based on the significant test result, the model with covariates is indeed a better fit. + +# Conclusion + +I found evidence to suppport all three of my hypotheses. +First, the distribution of trees in Buffalo is non-random. +Second, the density of trees is higher inside city parks than outside. +Finally, the density of trees increases with mean income: richer neighborhoods have more trees, matching the findings of [at least one other U.S. study](http://journals.sagepub.com/doi/abs/10.1068/a41236). -- cgit v1.2.3