vignettes/distsamp.Rmd


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363

---
title: Distance sampling analysis in unmarked
author: Richard Chandler, USGS Patuxent Wildlife Research Center
date: March 4, 2020
bibliography: unmarked.bib
csl: ecology.csl
output: 
  rmarkdown::html_vignette:
    fig_width: 5
    fig_height: 3.5
    number_sections: true
    toc: true
vignette: >
  %\VignetteIndexEntry{Distance sampling analysis}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}

---

```{r,echo=FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Abstract

Distance sampling is a wildlife sampling technique used to
estimate population size or density. Describing how density varies
spatially is often of equal interest; however, conventional methods
of analysis do not allow for explicit modeling of both density and
detection probability. The function `distsamp` implements the
multinomial-Poisson mixture model of @royle_modeling_2004, which was developed to overcome this
limitation. This model requires that line- or point-transects are
spatially replicated and that distance data are recorded in discrete
intervals. The function \code{gdistsamp} extends this basic model,
by introducing the parameter $\phi$, the probability of
being available for detection [@chandlerEA_2011]. Furthermore,
this function allows
abundance to be modeled using the negative binomial distribution,
which may be useful for dealing with over-dispersion. This document
describes how to format data, fit models,
and manipulate results in package \package{unmarked}. It does not
cover the statistical theory and assumptions underlying distance
sampling [@buckland_distsamp_2001], which the user is expected
to be familiar with.

# Introduction

Spatial variation in density is common to virtually all wildlife
populations, and describing this variation is a central objective of
much research. To accurately estimate and model density, it is often
necessary to account for individuals present but not
detected. Distance from observer is a ubiquitous source of variation
in detection probability, and thus distance sampling has become a
commonly used survey methodology. Numerous options exist for analyzing
distance sampling data, but here the focus is on the model of @royle_modeling_2004, which assumes that multiple transects
have been surveyed and distance data are recorded in discrete
intervals. The details of the model formulation are as follows:

The latent transect-level abundance distribution is currently assumed
to be
$$
  \label{eq:1}
  N_{i} \sim \mathrm{Poisson}(\lambda)
  \quad  i=1,\dots,M\
$$
The detection process is modeled as
$$
  \label{eq:2}
  y_{ij} \sim \mathrm{Multinomial}(N_{i}, \pi_{ij})
  \quad  i=1,\dots,M\;j=1,\dots,J
$$
where $\pi_{ij}$ is the multinomial cell probability for transect $i$
in distance class $j$. These are computed by integrating a detection
function such as the half-normal (with scale parameter $\sigma$) over
each distance interval.

Parameters $\lambda$ and $\sigma$ can be vectors affected by
transect-specific covariates using the log link.

# Importing, formatting, and summarizing data

The first step is to import the data into R. The simplest option is to
use the `read.csv` function to import a .csv file that has been
formatted so that each row represents a transect, and columns describe
either the number of individuals detected in each distance interval or
transect-specific covariates. Alternatively, if data were not recorded
in discrete distance intervals, a .csv file could be imported that
contains a row for each individual detected and columns for the
distances and transect names. This could then be converted to
transect-level data using the function `formatDistData`. For
example,

```{r}
library(unmarked)
dists <- read.csv(system.file("csv", "distdata.csv", package="unmarked"),
                  stringsAsFactors=TRUE)
head(dists, 10)
levels(dists$transect) <- c(levels(dists$transect), "g")
levels(dists$transect)
yDat <- formatDistData(dists, distCol="distance",
	transectNameCol="transect", dist.breaks=c(0, 5, 10, 15, 20))
yDat
```

Here we have created an object called `yDat` that contains counts for
each transect (row) in each distance interval (columns). Note the
method used to include transect `"g"`, which was surveyed but where no
individuals were detected. It is important that all surveyed transects
are included in the analysis.

Suppose there also exists transect-specific covariate data.

```{r}
(covs <- data.frame(canopyHt = c(5, 8, 3, 2, 4, 7, 5),
	habitat = c('A','A','A','A','B','B','B'), row.names=letters[1:7]))
```

The function `unmarkedFrameDS` can now be used to organize these
data along with their metadata (study design (line- or
point-transect), distance class break points, transect lengths, and
units of measurement) into an object to be used as the `data`
argument in `distsamp`. By organizing the data this way, the user
does not need to repetitively specify these arguments during each call
to `distsamp`, thereby reducing the potential for errors and
facilitating data summary and manipulation.

```{r}
umf <- unmarkedFrameDS(y=as.matrix(yDat), siteCovs=covs, survey="line",
	dist.breaks=c(0, 5, 10, 15, 20), tlength=rep(100, 7),
	unitsIn="m")
```

Note that there is no `obsCovs` argument, meaning that
distance-interval-level covariates cannot be included in the
analysis. The call to `unmarkedFrameDS` indicates that the data
were collected on seven line transects, each 100 meters long, and
detections were tabulated into distance intervals defined by the
`dist.breaks` cutpoints. It is important that both transect
lengths and distance break points are provided in the same units
specified by `unitsIn`.

We can look at these data using a variety of methods.

```{r, fig.height=4, fig.width=4, fig.cap="Figure 1. Histogram of detection distances"}
summary(umf)
hist(umf, xlab="distance (m)", main="", cex.lab=0.8, cex.axis=0.8)
```

# Model fitting

Now that we have put our data into an object of class
`unmarkedFrameDS`, we are ready to fit some models with
`distsamp`. The first argument is a `formula` which
specifies the detection covariates followed by the density (or
abundance) covariates. The only other required argument is the
`data`, but several other optional arguments exist. By default,
the half-normal detection function is used to model density in animals
/ ha. The detection function can be selected using the `keyfun`
argument. The response can be changed from `"density"`, to `"abund"`
with the `output` argument. When modeling density, the output
units can be changed from `"ha"` to `"kmsq"` using the `unitsOut`
argument. `distsamp` also includes the arguments `starts`,
`method`, and `control`, which are common to all unmarked
fitting functions.

Below is a series of models that demonstrates `distsamp`'s
arguments and defaults.

```{r}
hn_Null <- distsamp(~1~1, umf)
hn_Null <- distsamp(~1~1, umf, keyfun="halfnorm", output="density",
	unitsOut="ha")
haz_Null <- distsamp(~1~1, umf, keyfun="hazard")
hn_Hab.Ht <- distsamp(~canopyHt ~habitat, umf)
```

The first two models are the same, a null half-normal detection
function with density returned in animals / ha (on the log-scale). The
third model uses the hazard-rate detection function, and the fourth
model includes covariates affecting the Poisson mean ($\lambda$) and
the half-normal scale parameter ($\sigma$).

Once a model has been fit, typing its name will display parameter
estimate information and AIC. A summary method shows extra details
including the scale on which parameters were estimated and convergence
results.

```{r}
haz_Null
```

# Manipulating results

Back-transforming estimates to the original scale and obtaining
standard errors via the delta method is easily accomplished:

```{r}
names(haz_Null)
backTransform(haz_Null, type="state")
backTransform(haz_Null, type="det")
backTransform(haz_Null, type="scale")
backTransform(linearComb(hn_Hab.Ht['det'], c(1, 5)))
```

The first back-transformation returns population density, since this
is the default state parameter modeled when `distsamp`'s
`output` argument is set to `"density"`. The second
back-transformation returns the hazard-rate shape parameter, and third
is the hazard-rate scale parameter. When covariates are present,
`backTransform` in conjunction with `linearComb` should be
used. Here, we requested the value of sigma when canopy height was 5
meters tall. Note that the intercept was included in the calculation
by setting the first value in the linear equation to 1.

Parameters that do not occur in the likelihood may also be of
interest. For example, the number of individuals occurring in the
sampled plots (local population size) is a fundamental parameter in
monitoring and conservation efforts. The following commands can be
used to derive this parameter from our model of density:

```{r}
site.level.density <- predict(hn_Hab.Ht, type="state")$Predicted
plotArea.inHectares <- 100 * 40 / 10000
site.level.abundance <- site.level.density * plotArea.inHectares
(N.hat <- sum(site.level.abundance))
```

To describe the uncertainty of `N.hat`, or any other derived parameter,
we can use a parametric bootstrap approach. First we define a function
to estimate `N.hat`, and then we apply this function to numerous models
fit to data simulated from our original model.

```{r}
getN.hat <- function(fit) {
    d <- predict(fit, type="state")$Predicted
    a <- d * (100 * 40 / 10000)
    N.hat <- c(N.hat = sum(a))
    return(N.hat)
    }
pb <- parboot(hn_Hab.Ht, statistic=getN.hat, nsim=25)
pb
```

Here, `t_B` is an approximation of the sampling distribution for
`N.hat`, conditioned on our fitted model. Confidence intervals
can be calculated from the quantiles of `t_B`. Note that in
practice `nsim` should be set to a much larger value and a
goodness-of-fit test should be performed before making inference from
a fitted model. Parametric bootstrapping can be used for the latter
by supplying a fit statistic such as `SSE` instead of
`getN.hat`. See `?parboot` and `vignette('unmarked')` for examples.

# Prediction and plotting

A `predict` method exits for all `unmarkedFit` objects,
which is useful when multiple covariate combinations exist. This
method also facilitates plotting. Suppose we wanted model predictions
from the covariate model along the range of covariate values
studied. First we need to make new `data.frame`s holding the
desired covariate combinations. Note that column names must match
those of the original data, and factor variables must contain the same
levels.

```{r}
head(habConstant <- data.frame(canopyHt = seq(2, 8, length=20),
	habitat=factor("A", levels=c("A", "B"))))
(htConstant <- data.frame(canopyHt = 5,
	habitat=factor(c("A", "B"))))
```

Now `predict` can be used to estimate density and $\sigma$ for
each row of our new `data.frame`s.

```{r}
(Elambda <- predict(hn_Hab.Ht, type="state", newdata=htConstant,
	appendData=TRUE))
head(Esigma <- predict(hn_Hab.Ht, type="det", newdata=habConstant,
	appendData=TRUE))
```

Once predictions have been made, plotting is
straight-forward. Figure 2a shows density as a
function of habitat type, and Figure 2b shows that
$\sigma$ is not related to canopy height.

```{r, fig.height=3, fig.width=6, fig.cap="Figure 2. Predicted covariate relationships"}
par(mfrow=c(1, 2))
with(Elambda, {
	x <- barplot(Predicted, names=habitat, xlab="Habitat",
		ylab="Density (animals / ha)", ylim=c(0, 25), cex.names=0.7,
        cex.lab=0.7, cex.axis=0.7)
	arrows(x, Predicted, x, Predicted+SE, code=3, angle=90, length=0.05)
	box()
	})
with(Esigma, {
	plot(canopyHt, Predicted, type="l", xlab="Canopy height",
		ylab=expression(paste("Half-normal scale parameter (", sigma, ")",
        sep="")), ylim=c(0, 20), cex=0.7, cex.lab=0.7,
    cex.axis=0.7)
	lines(canopyHt, Predicted+SE, lty=2)
	lines(canopyHt, Predicted-SE, lty=2)
	})
```

Plots of the detection function parameters can be less informative
than plots of the detection functions themselves. To do the latter, we
can plug predicted values of $\sigma$ at given covariate values into
the `gxhn` function. For instance, Figure 3a
shows to the half-normal function at a canopy height of 2m. This was
plotted by setting $\sigma$ to 10.8, the predicted value shown
above. The available detection functions are described on the
`detFuns` help page. Probability density functions such as
`dxhn` can be plotted with the distance histogram using the
`hist` method for `unmarkedFitDS` objects
(Figure 3b). This only works for models without
detection covariates; however, probability density functions at
specific covariate values can be added in a fashion similar to that
above (Figure 3b).

```{r, fig.width=6, fig.height=3, fig.cap="Figure 3. Detection and probability density functions"}
par(mfrow=c(1, 2))
plot(function(x) gxhn(x, sigma=10.8), 0, 20, xlab="Distance (m)",
	ylab="Detection prob. at 2m canopy ht.", cex.lab=0.7,
    cex.axis=0.7, las=1)
hist(hn_Null, xlab="Distance (m)", ylab="Probability density", main="",
    ylim=c(0, 0.1), cex.lab=0.7, cex.axis=0.7, las=1)
plot(function(x) dxhn(x, sigma=10.8), 0, 20, add=TRUE, col="blue")
plot(function(x) dxhn(x, sigma=9.9), 0, 20, add=TRUE, col="green")
legend('topright', c("Canopy ht. = 2m", "Null", "Canopy ht. = 8m"),
	col=c("blue", "black", "green"), lty=1, cex=0.4)
```

# Model extensions

A common criticism of distance sampling is that all individuals must
be available for detection. Similarly, the probability of detecting an
individual a distance of 0 must be 1. These assumptions often cannot
be met. For instance, when counting cues such as bird songs or whale
blows, the probability that an individual will produce a cue (and thus
be available for detection) is rarely 1 during the sampling
interval. Recently developed methods [@chandlerEA_2011]
allow for the estimation of the
probability of being available for detection $\phi$. To do so,
replicate distance sampling observations must be collected at each
transect. These replicates could be collected using repeated visits or
multiple observers working independently. Implementation of this model
in `unmarked` is accomplished using the `gdistsamp`
function. The function also provides the option to model abundance
using the negative binomial distribution. Formatting data and
specifying models is similar to methods described above and is more
fully outlined in the help pages.

# Conclusion

This document has emphasized methods tailored to distance sampling
analysis; however, the more general methods available in package
`unmarked` can also be applied to models fitted using
`distsamp` and `gdistsamp`.
For example, model-selection and model-averaging can be
accomplished using the `fitList` function and the `modSel`
and `predict` methods.

# References