diff options
Diffstat (limited to 'vignettes/colext.Rmd')
-rw-r--r-- | vignettes/colext.Rmd | 1091 |
1 files changed, 1091 insertions, 0 deletions
diff --git a/vignettes/colext.Rmd b/vignettes/colext.Rmd new file mode 100644 index 0000000..0bfc1a1 --- /dev/null +++ b/vignettes/colext.Rmd @@ -0,0 +1,1091 @@ +--- +title: Dynamic occupancy models in unmarked +author: +- name: Marc Kéry, Swiss Ornithological Institute +- name: Richard Chandler, University of Georgia +date: August 16, 2016 +bibliography: unmarked.bib +csl: ecology.csl +output: + rmarkdown::html_vignette: + fig_width: 5 + fig_height: 3.5 + number_sections: true + toc: true +vignette: > + %\VignetteIndexEntry{Dynamic occupancy models} + %\VignetteEngine{knitr::rmarkdown} + \usepackage[utf8]{inputenc} +--- + + + +# Abstract + +Dynamic occupancy models [@mackenzie_estimating_2003] allow inference about +the occurrence of "things" at collections of "sites" +and about how changes in occurrence are driven by colonization and +local extinction. These models also account for imperfect detection +probability. Depending on how "thing" and "site" are defined, +occupancy may have vastly different biological meanings, +including the presence of a disease in an individual (disease +incidence) of a species at a site (occurrence, distribution), or of an +individual in a territory. +Dynamic occupancy models in `unmarked` are fit using the +function `colext`. +All parameters can be modeled as functions of covariates, i.e., +first-year occupancy with covariates varying by site +(site-covariates), +colonization and survival with site- and yearly-site-covariates and +detection with site-, yearly-site- and sample-occasion-covariates. +We give two commented example analyses: one for a simulated data set +and another for a real data set on crossbills in the Swiss breeding +bird survey MHB. +We also give examples to show how predictions, along with standard +errors and confidence intervals, can be obtained. + +# Introduction + +Occurrence is a quantity of central importance in many branches of +ecology and related sciences. +The presence of a disease in an individual or of a species +at a site are two common types of occurrence studies. +The associated biological metrics are the incidence of the disease and +species occurrence or species distribution. +Thus, depending on how we define the thing we are looking for and the +sample unit, very different biological quantities can be analyzed +using statistical models for occupancy. + +If we denote presence of the "thing" as $y=1$ and its absence as +$y=0$, then it is natural to characterize all these metrics by the +probability that a randomly chosen sample unit ("site") is occupied, +i.e., has a "thing" present: $Pr(y=1) = \psi$. +We call this the occupancy probability, or occupancy for short, and +from now on will call the sample unit, +where the presence or absence of a "thing" is assessed, generically +a "site". + +Naturally, we would like to explore factors that affect the likelihood +that a site is occupied. +A binomial generalized linear model, or logistic regression, is the +customary statistical model for occurrence. +In this model, we treat occurrence $y$ as a binomial random variable +with trial size 1 and success probability $p$, or, equivalently, a +Bernoulli trial with $p$. +"Success" means occurrence, so $p$ is the occurrence probability. +It can be modeled as a linear or other function of covariates via a +suitable link function, e.g., the logit link. +This simple model is described in many places, including @McCullagh_1989, +Royle and Dorazio [-@royle_dorazio:2008, chapter 3], Kéry [-@Kery_2010, +chapter 17] and Kéry and Schaub [-@Kery_2011, chapter 3]. + +A generalization of this model accounts for changes in the occupancy +state of sites by introducing parameters for survival +(or alternatively, extinction) and colonization probability. +Thus, when we have observations of occurrence for more than a single +point in time, we can model the transition of the occupancy +state at site $i$ between successive times as another Bernoulli trial. +To model the fate of an occupied site, we denote the probability that +a site occupied at $t$ is again occupied at $t+1$ as $Pr(y_{i,t+1} = 1 +| y_{i,t} = 1 ) = \phi$. +This represents the survival probability of a site that is occupied. +Of course, we could also choose to express this component of occupancy +dynamics by the converse, extinction probability $\epsilon$ --- +the parameterization used in `unmarked`. +To model the fate of an unoccupied site, we denote as $Pr(y_{i,t+1} = +1 | y_{i,t} = 0 ) = \gamma$ the probability that an unoccupied site at +$t$ becomes occupied at $t+1$. +This is the colonization probability of an empty site. +Such a dynamic model of occurrence has become famous in the ecological literature under the name "metapopulation model" [@Hanski_1998]. + +However, when using ecological data collected in the field to fit such +models of occurrence, we face the usual challenge of imperfect +detection [e.g. @Kery_2008]. +For instance, a species can go unobserved at a surveyed site or an +occupied territory can appear unoccupied during a particular survey, +perhaps because both birds are away hunting. +Not accounting for detection error may seriously bias all parameter +estimators of a metapopulation model [@Moilanen_2002; @royle_dorazio:2008]. +To account for this additional stochastic component in the generation +of most ecological field data, the classical metapopulation model may +be generalized to include a submodel for the observation process, +which allows an occupied site to be recorded as unoccupied. +This model has been developed by @mackenzie_estimating_2003. It is +described as a hierarchical model by @Royle_2007, Royle +and Dorazio [-@royle_dorazio:2008, chapter 9] and Kéry and Schaub [-@Kery_2011, chapter 13]. +The model is usually called a multi-season, multi-year or a +dynamic site-occupancy model. +The former terms denote the fact that it is applied to multiple +"seasons" or years and the latter emphasizes that the model allows +for between-season occurrence dynamics. + +This vignette describes the use of the `unmarked` function +`colext` to fit dynamic occupancy models. Note that we will use +italics for the names of functions. +Static occupancy models, i.e., for a single season without changes in +the occupancy state [@mackenzie_estimating_2002], can be fit with `occu`, +for the model described by @mackenzie_estimating_2002 and @Tyre_2002, and with `occuRN`, for the heterogeneity occupancy model +described by @royle_estimating_2003. +In the next section (section 2), we give a more technical description +of the dynamic occupancy model. +In section 3, we provide R code for generating data under a basic +dynamic occupancy model and illustrate use of `colext` for fitting the +model. +In section 4, we use real data from the Swiss breeding bird survey MHB +[@schmid_etal:2004] to fit a few more elaborate models with +covariates for all parameters. +We also give examples illustrating how to compute predictions, with +standard errors and 95% confidence intervals, for the parameters. + +# Dynamic occupancy models + +To be able to estimate the parameters of the dynamic occupancy model +(probabilities of occurrence, survival and colonization) separately +from the parameters for the observation process (detection +probability), replicate observations are required from a period of +closure, +during which the occupancy state of a site must remain constant, i.e., +it is either occupied or unoccupied. +The modeled data $y_{ijt}$ are indicators for whether a species is +detected at site $i$ ($i = 1, 2, \ldots M$), during replicate survey +$j$ ($j = 1, 2, \ldots J$) in season $t$ ($t = 1, 2, \ldots T$). +That is, $y_{ijt}=1$ if at least one individual is detected and +$y_{ijt}=0$ if none is detected. + +The model makes the following assumptions: +* replicate surveys at a site during a single season are + independent (or else dependency must be modeled) +* occurrence state $z_{it}$ (see below) does not change over + replicate surveys at site $i$ during season $t$ +* there are no false-positive errors, i.e., a species can only be + overlooked where it occurs, but it cannot be detected where it does + not in fact occur (i.e., there are no false-positives) + +The complete model consists of one submodel to describe the ecological +process, or state, and another submodel for the observation process, +which is dependent on the result of the ecological process. +The ecological process describes the latent occurrence dynamics for +all sites in terms of parameters for the probability of initial +occurrence and site survival and colonization. +The observation process describes the probability of detecting a +presence (i.e., $y = 1$) at a site that is occupied and takes account +of false-negative observation errors. + +## Ecological or state process + +This initial state is denoted $z_{i1}$ and represents occurrence at +site $i$ during season 1. +For this, the model assumes a Bernoulli trial governed by the +occupancy probability in the first season $\psi_{i1}$: + +$$ +z_{i1} = Bernoulli(\psi_{i1}) +$$ + +We must distinguish the sample quantity "occurrence" at a site, $z$, +from the population quantity "occupancy probability", $\psi$. +The former is the realization of a Bernoulli random variable with +parameter $\psi$. +This distinction becomes important when we want to compute the number +of occupied sites among the sample of surveyed sites; +see @Royle_2007 and @Weir_2009 for this +distinction. + +For all later seasons ($t = 2, 3, \ldots T$), occurrence is a function +of occurrence at site $i$ at time $t-1$ and one of two parameters that +describe the colonization-extinction dynamics of the system. +These dynamic parameters are the probability of local survival +$\phi_{it}$, also called probability of persistence (= 1 minus the +probability of local extinction), +and the probability of colonization $\gamma_{it}$. + +$$ +z_{it} \sim Bernoulli(z_{i,t-1} \phi_{it} + (1-z_{i,t-1}) \gamma_{it}) +$$ + +Hence, if site $i$ is unoccupied at $t-1$ , $z_{i,t-1}=0$, and the +success probability of the Bernoulli is +$0*\phi_{it} + (1-0) * \gamma_{it}$, so the site is occupied +(=colonized) in season $t$ with probability $\gamma_{it}$ +. Conversely, if site $i$ is occupied at $t-1$ , $z_{i,t-1}=1$, and +the success probability of the Bernoulli is given by $1*\phi_{it} + +(1-1) * \gamma_{it}$, so the site is occupied in (=survives to) season +$t$ with probability $\phi_{it}$. + +Occupancy probability ($\psi_{it}$) and occurrence ($z_{it}$) at all +later times $t$ can be computed recursively from $\psi_{i1}$, +$z_{i1}$ , $\phi_{it}$ and $\gamma_{it}$. +Variances of these derived estimates can be obtained via the delta +method or the bootstrap. + +## Observation process + +To account for the observation error (specifically, false-negative +observations), the conventional Bernoulli detection process is +assumed, such that + +$$ +y_{ijt} \sim Bernoulli(z_{it} p_{ijt}) +$$ + +Here, $y_{ijt}$ is the detection probability at site $i$ during +survey $j$ and season $t$. Detection is conditional on occurrence, and +multiplying $p_{ijt}$ with $z_{it}$ ensures that occurrence can only +be detected where in fact a species occurs, i.e. where $z_{it}=1$. + +## Modeling of parameters + +The preceding, fully general model description allows for site-($i$) +dependence of all parameters. In addition to that, survival and +colonization probabilities may be season-($t$)dependent and detection +probability season-($t$) and survey-($j$) dependent. +All this complexity may be dropped, especially the dependence on +sites. On the other hand, all parameters that are indexed in some way +can be modeled, e.g., as functions of covariates that vary along the +dimension denoted by an index. We will fit linear functions (on the +logit link scale) of covariates into first-year occupancy, survival +and colonization and into detection probability. +That is, for probabilities of first-year occupancy, survival, +colonization and detection, respectively, we will fit models of the +form + $logit(\psi_{i1}) = \alpha + \beta x_i$, where $x_i$ may be forest + cover or elevation of site $i$ , + $logit(\phi_{it}) = \alpha + \beta x_{it}$, where $x_{it}$ may be + tree mast at site $i$ during season $t$, + $logit(\gamma_{it}) = \alpha + \beta x_{it}$, for a similarly + defined covariate $x_{it}$, or + $logit(p_{ijt}) = \alpha + \beta x_{ijt}$ , where $x_{ijt}$ is the + Julian date of the survey $j$ at site $i$ in season $t$. + +We note that for first-year occupancy, only covariates that vary among +sites ("site covariates") can be fitted, while for survival and +colonization, covariates that vary by site and by season ("yearly +site covariates") may be fitted as well. +For detection, covariates of three formats may be fitted: +"site-covariates", "yearly-site-covariates" and +"observation-covariates", as +they are called in `unmarked`. + +# Dynamic occupancy models for simulated data + +We first generate a simple, simulated data set +with specified, year-specific values for +the parameters as well as design specifications, i.e., number of +sites, years and surveys per year. +Then, we show how to fit a dynamic occupancy model with +year-dependence in the parameters for colonization, extinction and +detection probability. + +## Simulating, formatting, and summarizing data + +To simulate the data, we execute the following R code. +The actual values for these parameters for each year are drawn +randomly from a uniform distribution with +the specified bounds. + + +```r +M <- 250 # Number of sites +J <- 3 # num secondary sample periods +T <- 10 # num primary sample periods + +psi <- rep(NA, T) # Occupancy probability +muZ <- z <- array(dim = c(M, T)) # Expected and realized occurrence +y <- array(NA, dim = c(M, J, T)) # Detection histories + +set.seed(13973) +psi[1] <- 0.4 # Initial occupancy probability +p <- c(0.3,0.4,0.5,0.5,0.1,0.3,0.5,0.5,0.6,0.2) +phi <- runif(n=T-1, min=0.6, max=0.8) # Survival probability (1-epsilon) +gamma <- runif(n=T-1, min=0.1, max=0.2) # Colonization probability + +# Generate latent states of occurrence +# First year +z[,1] <- rbinom(M, 1, psi[1]) # Initial occupancy state +# Later years +for(i in 1:M){ # Loop over sites + for(k in 2:T){ # Loop over years + muZ[k] <- z[i, k-1]*phi[k-1] + (1-z[i, k-1])*gamma[k-1] + z[i,k] <- rbinom(1, 1, muZ[k]) + } +} + +# Generate detection/non-detection data +for(i in 1:M){ + for(k in 1:T){ + prob <- z[i,k] * p[k] + for(j in 1:J){ + y[i,j,k] <- rbinom(1, 1, prob) + } + } +} + +# Compute annual population occupancy +for (k in 2:T){ + psi[k] <- psi[k-1]*phi[k-1] + (1-psi[k-1])*gamma[k-1] + } +``` + +We have now generated a single realization from the stochastic system +thus defined. Figure 1 +illustrates the fundamental issue +of imperfect detection --- the actual proportion of sites occupied +differs greatly from the observed proportion of sites occupied, and +because $p$ varies among years, the observed data cannot be used as a +valid index of the parameter of interest $\psi_i$. + + + + +```r +plot(1:T, colMeans(z), type = "b", xlab = "Year", + ylab = "Proportion of sites occupied", + col = "black", xlim=c(0.5, 10.5), xaxp=c(1,10,9), + ylim = c(0,0.6), lwd = 2, lty = 1, + frame.plot = FALSE, las = 1, pch=16) + +psi.app <- colMeans(apply(y, c(1,3), max)) +lines(1:T, psi.app, type = "b", col = "blue", lty=3, lwd = 2) +legend(1, 0.6, c("truth", "observed"), + col=c("black", "blue"), lty=c(1,3), pch=c(16,1)) +``` + +![Figure 1. Summary of the multi-year occupancy data set generated.](colext-data-1.png) + +To analyze this data set with a dynamic occupancy model in +`unmarked`, we first load the package. + + +```r +library(unmarked) +``` + +Next, we reformat the detection/non-detection data from a 3-dimensional +array (as generated) into a 2-dimensional matrix with M rows. +That is, we put the annual tables of data (the slices of the former +3-D array) sideways to produce a "wide" layout of the data. + + +```r +yy <- matrix(y, M, J*T) +``` + +Next, we create a matrix indicating the year each site was surveyed. + + +```r +year <- matrix(c('01','02','03','04','05','06','07','08','09','10'), + nrow(yy), T, byrow=TRUE) +``` + +To organize the data in the format required by `colext`, we make +use of the function `unmarkedMultFrame`. The only required +arguments are `y`, the detection/non-detection data, and +`numPrimary`, the number of seasons. The three types of +covariates described earlier can also be supplied using the arguments +`siteCovs`, `yearlySiteCovs`, and `obsCovs`. In this case, +we only make use of the second type, which must have M rows and T +columns. + + +```r +simUMF <- unmarkedMultFrame( + y = yy, + yearlySiteCovs = list(year = year), + numPrimary=T) +summary(simUMF) +``` + +``` +## unmarkedFrame Object +## +## 250 sites +## Maximum number of observations per site: 30 +## Mean number of observations per site: 30 +## Number of primary survey periods: 10 +## Number of secondary survey periods: 3 +## Sites with at least one detection: 195 +## +## Tabulation of y observations: +## 0 1 +## 6430 1070 +## +## Yearly-site-level covariates: +## year +## 01 : 250 +## 02 : 250 +## 03 : 250 +## 04 : 250 +## 05 : 250 +## 06 : 250 +## (Other):1000 +``` + +## Model fitting + +We are ready to fit a few dynamic occupancy models. +We will fit a model with constant values for all parameters and +another with full time-dependence for colonization, extinction and +detection probability. We also time the calculations. + + +```r +# Model with all constant parameters +m0 <- colext(psiformula= ~1, gammaformula = ~ 1, epsilonformula = ~ 1, + pformula = ~ 1, data = simUMF, method="BFGS") +summary(m0) +``` + +``` +## +## Call: +## colext(psiformula = ~1, gammaformula = ~1, epsilonformula = ~1, +## pformula = ~1, data = simUMF, method = "BFGS") +## +## Initial (logit-scale): +## Estimate SE z P(>|z|) +## -0.813 0.158 -5.16 2.46e-07 +## +## Colonization (logit-scale): +## Estimate SE z P(>|z|) +## -1.77 0.0807 -22 2.75e-107 +## +## Extinction (logit-scale): +## Estimate SE z P(>|z|) +## -0.59 0.102 -5.79 7.04e-09 +## +## Detection (logit-scale): +## Estimate SE z P(>|z|) +## -0.0837 0.0562 -1.49 0.137 +## +## AIC: 4972.597 +## Number of sites: 250 +## optim convergence code: 0 +## optim iterations: 27 +## Bootstrap iterations: 0 +``` + +The computation time was only a few seconds. +Note that all parameters were estimated on the logit scale. To +back-transform to the original scale, we can simply use the +inverse-logit function, named `plogis` in R. + + +```r +plogis(-0.813) +``` + +``` +## [1] 0.3072516 +``` + +Alternatively, we can use `backTransform`, which +computes standard errors using the delta method. Confidence intervals +are also easily obtained using the function `confint`. +We first remind ourselves of the names of parameters, which can all be +used as arguments for these functions. + + +```r +names(m0) +``` + +``` +## [1] "psi" "col" "ext" "det" +``` + +```r +backTransform(m0, type="psi") +``` + +``` +## Backtransformed linear combination(s) of Initial estimate(s) +## +## Estimate SE LinComb (Intercept) +## 0.307 0.0335 -0.813 1 +## +## Transformation: logistic +``` + +```r +confint(backTransform(m0, type="psi")) +``` + +``` +## 0.025 0.975 +## 0.2457313 0.3765804 +``` + +Next, we fit the dynamic occupancy model with full year-dependence in +the parameters describing occupancy dynamics and also in detection. +This is the same model under which we generated the data set, so we +would expect accurate estimates. + +By default in R, a factor such as year in this analysis, is a +parameterized in terms of an intercept and effects representing +differences. This would mean that the parameter for the first year is +the intercept and the effects would denote the differences between +the parameter values in all other years, relative to the parameter +value in the first year, which serves as a reference level. +This treatment or effects parameterization is useful for testing for +differences. For simple presentation, a means parameterization is more +practical. It can be specified by adding a -1 to the formula for the +time-dependent parameters. + + +```r +m1 <- colext(psiformula = ~1, # First-year occupancy + gammaformula = ~ year-1, # Colonization + epsilonformula = ~ year-1, # Extinction + pformula = ~ year-1, # Detection + data = simUMF) +m1 +``` + +``` +## +## Call: +## colext(psiformula = ~1, gammaformula = ~year - 1, epsilonformula = ~year - +## 1, pformula = ~year - 1, data = simUMF) +## +## Initial: +## Estimate SE z P(>|z|) +## -0.273 0.302 -0.906 0.365 +## +## Colonization: +## Estimate SE z P(>|z|) +## year01 -2.08 0.951 -2.19 2.86e-02 +## year02 -2.18 0.365 -5.96 2.52e-09 +## year03 -1.98 0.274 -7.23 4.88e-13 +## year04 -2.32 0.678 -3.42 6.37e-04 +## year05 -1.89 0.478 -3.95 7.78e-05 +## year06 -1.76 0.294 -5.97 2.44e-09 +## year07 -1.55 0.230 -6.73 1.75e-11 +## year08 -1.43 0.228 -6.29 3.19e-10 +## year09 -2.35 0.470 -5.00 5.64e-07 +## +## Extinction: +## Estimate SE z P(>|z|) +## year01 -1.4209 0.418 -3.401 6.72e-04 +## year02 -0.4808 0.239 -2.009 4.45e-02 +## year03 -1.2606 0.366 -3.440 5.83e-04 +## year04 -0.0907 0.650 -0.139 8.89e-01 +## year05 -0.6456 0.599 -1.078 2.81e-01 +## year06 -0.9586 0.378 -2.539 1.11e-02 +## year07 -1.2279 0.365 -3.362 7.74e-04 +## year08 -1.1894 0.292 -4.076 4.58e-05 +## year09 -0.6292 0.635 -0.991 3.22e-01 +## +## Detection: +## Estimate SE z P(>|z|) +## year01 -1.0824 0.244 -4.434 9.26e-06 +## year02 -0.2232 0.148 -1.508 1.32e-01 +## year03 0.2951 0.154 1.918 5.52e-02 +## year04 0.0662 0.161 0.412 6.81e-01 +## year05 -2.0396 0.433 -4.706 2.52e-06 +## year06 -0.6982 0.232 -3.005 2.66e-03 +## year07 0.2413 0.165 1.466 1.43e-01 +## year08 0.0847 0.155 0.548 5.84e-01 +## year09 0.6052 0.140 4.338 1.44e-05 +## year10 -1.1699 0.306 -3.828 1.29e-04 +## +## AIC: 4779.172 +``` + +## Manipulating results: prediction and plotting + +Again, all estimates are shown on the logit-scale. Back-transforming +estimates when covariates, such as year, are present involves an +extra step. Specifically, we need to tell `unmarked` the values +of our covariate +at which we want an estimate. This can be done using +`backTransform` in combination with `linearComb`, although +it can be easier to use `predict`. `predict` allows the user +to supply a data.frame in which each row represents a combination of +covariate values of interest. Below, we create data.frames called +`nd` with each row representing a year. +Then we request yearly estimates of the probability of extinction, +colonization and detection, +and compare them to "truth", i.e., the values with which we +simulated the data set. Note that there are T-1 extinction and +colonization parameters in this case, so we do not need to include +year 10 in `nd`. + + +```r +nd <- data.frame(year=c('01','02','03','04','05','06','07','08','09')) +E.ext <- predict(m1, type='ext', newdata=nd) +E.col <- predict(m1, type='col', newdata=nd) +nd <- data.frame(year=c('01','02','03','04','05','06','07','08','09','10')) +E.det <- predict(m1, type='det', newdata=nd) +``` + +`predict` returns the predictions along with standard errors and +confidence intervals. These can be used to create plots. The +`with` function is used to simplify the process of requesting the +columns of `data.frame` returned by `predict`. + + +```r +op <- par(mfrow=c(3,1), mai=c(0.6, 0.6, 0.1, 0.1)) + +with(E.ext, { # Plot for extinction probability + plot(1:9, Predicted, pch=1, xaxt='n', xlab='Year', + ylab=expression(paste('Extinction probability ( ', epsilon, ' )')), + ylim=c(0,1), col=4) + axis(1, at=1:9, labels=nd$year[1:9]) + arrows(1:9, lower, 1:9, upper, code=3, angle=90, length=0.03, col=4) + points((1:9)-0.1, 1-phi, col=1, lwd = 1, pch=16) + legend(7, 1, c('Parameter', 'Estimate'), col=c(1,4), pch=c(16, 1), + cex=0.8) + }) + +with(E.col, { # Plot for colonization probability + plot(1:9, Predicted, pch=1, xaxt='n', xlab='Year', + ylab=expression(paste('Colonization probability ( ', gamma, ' )')), + ylim=c(0,1), col=4) + axis(1, at=1:9, labels=nd$year[1:9]) + arrows(1:9, lower, 1:9, upper, code=3, angle=90, length=0.03, col=4) + points((1:9)-0.1, gamma, col=1, lwd = 1, pch=16) + legend(7, 1, c('Parameter', 'Estimate'), col=c(1,4), pch=c(16, 1), + cex=0.8) + }) + +with(E.det, { # Plot for detection probability: note 10 years + plot(1:10, Predicted, pch=1, xaxt='n', xlab='Year', + ylab=expression(paste('Detection probability ( ', p, ' )')), + ylim=c(0,1), col=4) + axis(1, at=1:10, labels=nd$year) + arrows(1:10, lower, 1:10, upper, code=3, angle=90, length=0.03, col=4) + points((1:10)-0.1, p, col=1, lwd = 1, pch=16) + legend(7.5, 1, c('Parameter','Estimate'), col=c(1,4), pch=c(16, 1), + cex=0.8) + }) +``` + +![Figure 2. Yearly estimates of parameters](colext-est-1.png) + +```r +par(op) +``` + +Figure 2 shows that the 95% confidence intervals +include the true parameter values, and the point estimates are not too +far off. + +## Derived parameters + +Estimates of occupancy probability in years $T>1$ must be derived from the +estimates of first-year occupancy and the two parameters governing the +dynamics, extinction/survival and colonization. +`unmarked` does this automatically in two ways. First, the +population-level estimates of occupancy probability +$\psi_t = \psi_{t-1}\phi_{t-1} + (1-\phi_{t-1})\gamma$ are calculated +and stored in the slot named \emph{projected}. Slots can be accessed +using the `@` operator, e.g. `fm@projected`. +In some cases, interest may lie in making +inference about the proportion of the sampled sites that are occupied, +rather than the entire population of sites. These estimates are +contained in the `smoothed` slot of the fitted model. Thus, the +`projected` values are estimates of population parameters, and +the `smoothed` estimates are of the finite-sample +quantities. Discussions of the differences can be found in @Weir_2009. + +Bootstrap methods can be used to compute standard errors of derived +parameter estimates. Here we employ a non-parametric bootstrap to obtain +standard errors of the smoothed estimates of occupancy probability +during each year. + + +```r +m1 <- nonparboot(m1, B = 10) +cbind(psi=psi, smoothed=smoothed(m1)[2,], SE=m1@smoothed.mean.bsse[2,]) +``` + +``` +## psi smoothed SE +## 1 0.4000000 0.4320671 0.06783911 +## 2 0.3493746 0.4110124 0.03786402 +## 3 0.2977125 0.3139967 0.02780818 +## 4 0.3148447 0.3278179 0.04303542 +## 5 0.3192990 0.2316695 0.10858419 +## 6 0.2915934 0.2528485 0.04179036 +## 7 0.3114415 0.2928429 0.03113920 +## 8 0.3636580 0.3504885 0.04224678 +## 9 0.3654064 0.3936991 0.02103870 +## 10 0.3460641 0.3095786 0.06830698 +``` + +In practice, `B` should be much higher, possibly >1000 for complex +models . + +Another derived parameters of interest is turnover probability + +$$ +\tau_t = \frac{\gamma_{t-1}(1-\psi_{t-1})}{\gamma_{t-1}(1-\psi_{t-1}) + + \phi_{t-1}\psi_{t-1}} +$$ + +The following function returns these estimates. + + +```r +turnover <- function(fm) { + psi.hat <- plogis(coef(fm, type="psi")) + if(length(psi.hat) > 1) + stop("this function only works if psi is scalar") + T <- getData(fm)@numPrimary + tau.hat <- numeric(T-1) + gamma.hat <- plogis(coef(fm, type="col")) + phi.hat <- 1 - plogis(coef(fm, type="ext")) + if(length(gamma.hat) != T-1 | length(phi.hat) != T-1) + stop("this function only works if gamma and phi T-1 vectors") + for(t in 2:T) { + psi.hat[t] <- psi.hat[t-1]*phi.hat[t-1] + + (1-psi.hat[t-1])*gamma.hat[t-1] + tau.hat[t-1] <- gamma.hat[t-1]*(1-psi.hat[t-1]) / psi.hat[t] + } + return(tau.hat) + } +``` + +The bootstrap again offers a means of estimating variance. Here we +show how to generate 95\% confidence intervals for the turnover +estimates using the parametric bootstrap. + + +```r +pb <- parboot(m1, statistic=turnover, nsim=2) +turnCI <- cbind(pb@t0, + t(apply(pb@t.star, 2, quantile, probs=c(0.025, 0.975)))) +colnames(turnCI) <- c("tau", "lower", "upper") +turnCI +``` + +``` +## tau lower upper +## t*1 0.1532645 0.00536045 0.1974714 +## t*2 0.1911530 0.07881180 0.2119585 +## t*3 0.2537292 0.19777204 0.2785973 +## t*4 0.2604356 0.04063769 0.4197328 +## t*5 0.3989303 0.34078483 0.4720357 +## t*6 0.3758690 0.32703698 0.5370796 +## t*7 0.3537473 0.32696166 0.3564059 +## t*8 0.3174983 0.32925238 0.4139696 +## t*9 0.1704449 0.18946470 0.3186236 +``` + +Which bootstrap method is most appropriate for variance estimation? +For detailed distinctions between the +non-parametric and the parametric bootstrap, see @Davison_1997. +We note simply that the parametric bootstrap resamples from +the fitted model, and thus the +measures of uncertainty are purely +functions of the distributions assumed by the model. Non-parametric +bootstrap samples, in contrast, are obtained by resampling the +data, not the model, and thus are not necessarily affected by the +variance formulas of the model's distributions. + +## Goodness-of-fit + +In addition to estimating the variance of an estimate, the parametric +bootstrap can be used to assess goodness-of-fit. For this purpose, a +fit-statistic, i.e. one that compares +observed and expected values, is evaluated using the original fitted +model, and numerous other models fitted to simulated datasets. The +simulation yields an approximation of +the distribution of the fit-statistic, and a \emph{P}-value +can be computed as the proportion of simulated values greater than the +observed value. + +@Hosmer_1997 found that a $\chi^2$ statistic performed +reasonably well in assessing lack of fit for logistic regression +models. We know of no studies formally +evaluating the performance of various fit-statistics for dynamic +occupancy models, so this approach should be +considered experimental. Fit-statistics applied to aggregated +encounter histories offer an alternative approach [@MacKenzie_2004], but are difficult to implement when J*T is high and +missing values or continuous covariates are present. + + +```r +chisq <- function(fm) { + umf <- getData(fm) + y <- getY(umf) + sr <- fm@sitesRemoved + if(length(sr)>0) + y <- y[-sr,,drop=FALSE] + fv <- fitted(fm, na.rm=TRUE) + y[is.na(fv)] <- NA + sum((y-fv)^2/(fv*(1-fv))) + } + +set.seed(344) +pb.gof <- parboot(m0, statistic=chisq, nsim=100) + +plot(pb.gof, xlab=expression(chi^2), main="", col=gray(0.95), + xlim=c(7300, 7700)) +``` + +![Figure 3. Goodness-of-fit](colext-gof-1.png) + +Figure 3 indicates that, as expected, the constant +parameter model does not fit the data well. + +# Dynamic occupancy models for crossbill data from the Swiss MHB + +## The crossbill data set + +The crossbill data are included with the `unmarked` package. +The dataset contains the results of nine years of surveys (1999--2007) +for the European crossbill (*Loxia curvirostra*), +a pine-seed eating finch, in 267 1-km$^2$ sample quadrats in Switzerland. +Quadrats are surveyed annually as part of the Swiss breeding bird +survey MHB [@schmid_etal:2004]. +They are laid out as a grid over Switzerland and surveyed 2 or 3 times +every breeding season (mid-April to late June) +by experienced field ornithologists along a haphazard survey route of +length 1-9 km (average 5 km). +High-elevation sites are only surveyed twice per breeding season. + +## Importing, formatting, and summarizing data + +The data can be loaded into an open R workspace using the `data` command. + + +```r +data(crossbill) +colnames(crossbill) +``` + +``` +## [1] "id" "ele" "forest" "surveys" "det991" "det992" "det993" +## [8] "det001" "det002" "det003" "det011" "det012" "det013" "det021" +## [15] "det022" "det023" "det031" "det032" "det033" "det041" "det042" +## [22] "det043" "det051" "det052" "det053" "det061" "det062" "det063" +## [29] "det071" "det072" "det073" "date991" "date992" "date993" "date001" +## [36] "date002" "date003" "date011" "date012" "date013" "date021" "date022" +## [43] "date023" "date031" "date032" "date033" "date041" "date042" "date043" +## [50] "date051" "date052" "date053" "date061" "date062" "date063" "date071" +## [57] "date072" "date073" +``` + +We have three covariates that vary by site: median elevation of the +quadrat (`ele`, in metres), forest cover of the quadrat (`forest`, in +percent) and the number of surveys per season (i.e., 2 or 3, +surveys). +These are called site covariates, because they vary by sites only. +The 27 columns entitled `det991` - `det073` contain the crossbill +detection/nondetection data during all surveys over the 9 years. +They contain a 1 when at least one crossbill was recorded during a +survey and a 0 otherwise. +`NA`s indicate surveys that did not take place, either because a site is +high-elevation and has no third survey or because it failed to be +surveyed altogether in a year. +The final 27 columns entitled `date991` - `date073` give the Julian +date of each survey. +They represent a "survey-covariate" or "observation covariate". +We note that the paper by @Royle_2007 used a subset of this +data set. + +AIC-based model selection (see section 5.4) requires +that all models are fit to the same data. +`unmarked` removes missing data in a context specific way. For +missing `siteCovs`, the entire row of data must be removed. However, for +missing `yearlySiteCovs` or `obsCovs`, only the +corresponding observation +are removed. Thus, if `unmarked` removes different observations +from different models, the models cannot be compared using AIC. A way +around this is to remove the detection data corresponding to +missing covariates before fitting the models. +The crossbill data have missing dates and so we remove the associated +detection/non-detection data. + + + +```r +DATE <- as.matrix(crossbill[,32:58]) +y.cross <- as.matrix(crossbill[,5:31]) +y.cross[is.na(DATE) != is.na(y.cross)] <- NA +``` + +In addition, continuous covariates should be transformed in a way +that brings their values close to zero in order to improve +or even enable numerical convergence of the maximum-likelihood routine. +We do this "by hand" and note that we could also have used the R +function `scale`. We subtract the mean and divide by the standard +deviation. + + +```r +sd.DATE <- sd(c(DATE), na.rm=TRUE) +mean.DATE <- mean(DATE, na.rm=TRUE) +DATE <- (DATE - mean.DATE) / sd.DATE +``` + +Before we can fit occupancy models, we need to format this data set +appropriately. + + +```r +years <- as.character(1999:2007) +years <- matrix(years, nrow(crossbill), 9, byrow=TRUE) +umf <- unmarkedMultFrame(y=y.cross, + siteCovs=crossbill[,2:3], yearlySiteCovs=list(year=years), + obsCovs=list(date=DATE), + numPrimary=9) +``` + +## Model fitting + +We fit a series of models that represent different hypotheses about +the colonization-extinction dynamics of Swiss crossbills +at a spatial scale of 1 km$^2$. +We fit year effects on colonization and extinction in the means +parameterization, +but for detection probability, we choose an effects parameterization. +The latter is more useful for getting predictions in the presence of +other explanatory variables for that parameter. +For model `fm5` with more complex covariate relationships, we use as +starting values for the optimization routine +the solution from a "neighboring" model with slightly less +complexity, model `fm4`. +Wise choice of starting values can be decisive for success or failure +of maximum likelihood estimation. + + +```r +# A model with constant parameters +fm0 <- colext(~1, ~1, ~1, ~1, umf) + +# Like fm0, but with year-dependent detection +fm1 <- colext(~1, ~1, ~1, ~year, umf) + +# Like fm0, but with year-dependent colonization and extinction +fm2 <- colext(~1, ~year-1, ~year-1, ~1, umf) + +# A fully time-dependent model +fm3 <- colext(~1, ~year-1, ~year-1, ~year, umf) + +# Like fm3 with forest-dependence of 1st-year occupancy +fm4 <- colext(~forest, ~year-1, ~year-1, ~year, umf) + +# Like fm4 with date- and year-dependence of detection +fm5 <- colext(~forest, ~year-1, ~year-1, ~year + date + I(date^2), + umf, starts=c(coef(fm4), 0, 0)) + +# Same as fm5, but with detection in addition depending on forest cover +fm6 <- colext(~forest, ~year-1, ~year-1, ~year + date + I(date^2) + + forest, umf) +``` + +## Model selection + +We can compare models using the Akaike information criterion +($AIC$). +Note that `unmarked` yields $AIC$, not $AIC_c$ +because the latter would require the sample size, +which is not really known for +hierarchical models such as the dynamic occupancy model. + +Model selection and model-averaged prediction in `unmarked` +require that we create a list of models using `fitList`. +This function organizes models and conducts a series of tests to +ensure that the models were fit to the same data. + + +```r +models <- fitList('psi(.)gam(.)eps(.)p(.)' = fm0, + 'psi(.)gam(.)eps(.)p(Y)' = fm1, + 'psi(.)gam(Y)eps(Y)p(.)' = fm2, + 'psi(.)gam(Y)eps(Y)p(Y)' = fm3, + 'psi(F)gam(Y)eps(Y)p(Y)' = fm4, + 'psi(F)gam(Y)eps(Y)p(YD2)' = fm5, + 'psi(F)gam(Y)eps(Y)p(YD2F)' = fm6) +ms <- modSel(models) +ms +``` + +``` +## nPars AIC delta AICwt cumltvWt +## psi(F)gam(Y)eps(Y)p(YD2F) 30 4986.39 0.00 1.0e+00 1.00 +## psi(F)gam(Y)eps(Y)p(YD2) 29 5059.30 72.91 1.5e-16 1.00 +## psi(F)gam(Y)eps(Y)p(Y) 27 5095.38 108.99 2.2e-24 1.00 +## psi(.)gam(.)eps(.)p(Y) 12 5111.32 124.93 7.5e-28 1.00 +## psi(.)gam(Y)eps(Y)p(Y) 26 5127.63 141.24 2.1e-31 1.00 +## psi(.)gam(Y)eps(Y)p(.) 18 5170.54 184.15 1.0e-40 1.00 +## psi(.)gam(.)eps(.)p(.) 4 5193.50 207.11 1.1e-45 1.00 +``` + +One model has overwhelming support, so we can base inference on that +one alone. Before doing so, we point out how to extract coefficients +from a `fitList` object, and convert the results to a +`data.frame`, which could be exported from R. + + +```r +coef(ms) # Estimates only +SE(ms) # Standard errors only +toExport <- as(ms, "data.frame") # Everything +``` + +## Manipulating results: Prediction and plotting + +Fitted models can be used to predict expected outcomes when given new +data. For example, one could ask "how many crossbills would you +expect to find in a quadrat with 50% forest cover?" Prediction also +offers a way of +presenting the results of an analysis. We illustrate by plotting the +predictions of $\psi$ and $p$ over the range of covariate values studied. +Note that because we standardized date, we need to transform it back +to its original scale after obtaining predictions on the +standardized scale. + + +```r +op <- par(mfrow=c(1,2), mai=c(0.8,0.8,0.1,0.1)) + +nd <- data.frame(forest=seq(0, 100, length=50)) +E.psi <- predict(fm6, type="psi", newdata=nd, appendData=TRUE) + +with(E.psi, { + plot(forest, Predicted, ylim=c(0,1), type="l", + xlab="Percent cover of forest", + ylab=expression(hat(psi)), cex.lab=0.8, cex.axis=0.8) + lines(forest, Predicted+1.96*SE, col=gray(0.7)) + lines(forest, Predicted-1.96*SE, col=gray(0.7)) + }) + +nd <- data.frame(date=seq(-2, 2, length=50), + year=factor("2005", levels=c(unique(years))), + forest=50) +E.p <- predict(fm6, type="det", newdata=nd, appendData=TRUE) +E.p$dateOrig <- E.p$date*sd.DATE + mean.DATE + +with(E.p, { + plot(dateOrig, Predicted, ylim=c(0,1), type="l", + xlab="Julian date", ylab=expression( italic(p) ), + cex.lab=0.8, cex.axis=0.8) + lines(dateOrig, Predicted+1.96*SE, col=gray(0.7)) + lines(dateOrig, Predicted-1.96*SE, col=gray(0.7)) + }) +``` + +![Figure 4. Covariates](colext-pred-1.png) + +```r +par(op) +``` + +**Acknowledgments** + +Special thanks goes to Ian Fiske, the author of `colext` and the +original developer of `unmarked`. Andy Royle provided the +initial funding and support for the package. The questions of many +people on the users' list motivated the writing of this document. + +# References + +```{r, echo=FALSE} +options(rmarkdown.html_vignette.check_title = FALSE) +``` |