aboutsummaryrefslogtreecommitdiff
path: root/vignettes/colext.Rmd
diff options
context:
space:
mode:
Diffstat (limited to 'vignettes/colext.Rmd')
-rw-r--r--vignettes/colext.Rmd1091
1 files changed, 1091 insertions, 0 deletions
diff --git a/vignettes/colext.Rmd b/vignettes/colext.Rmd
new file mode 100644
index 0000000..0bfc1a1
--- /dev/null
+++ b/vignettes/colext.Rmd
@@ -0,0 +1,1091 @@
+---
+title: Dynamic occupancy models in unmarked
+author:
+- name: Marc Kéry, Swiss Ornithological Institute
+- name: Richard Chandler, University of Georgia
+date: August 16, 2016
+bibliography: unmarked.bib
+csl: ecology.csl
+output:
+ rmarkdown::html_vignette:
+ fig_width: 5
+ fig_height: 3.5
+ number_sections: true
+ toc: true
+vignette: >
+ %\VignetteIndexEntry{Dynamic occupancy models}
+ %\VignetteEngine{knitr::rmarkdown}
+ \usepackage[utf8]{inputenc}
+---
+
+
+
+# Abstract
+
+Dynamic occupancy models [@mackenzie_estimating_2003] allow inference about
+the occurrence of "things" at collections of "sites"
+and about how changes in occurrence are driven by colonization and
+local extinction. These models also account for imperfect detection
+probability. Depending on how "thing" and "site" are defined,
+occupancy may have vastly different biological meanings,
+including the presence of a disease in an individual (disease
+incidence) of a species at a site (occurrence, distribution), or of an
+individual in a territory.
+Dynamic occupancy models in `unmarked` are fit using the
+function `colext`.
+All parameters can be modeled as functions of covariates, i.e.,
+first-year occupancy with covariates varying by site
+(site-covariates),
+colonization and survival with site- and yearly-site-covariates and
+detection with site-, yearly-site- and sample-occasion-covariates.
+We give two commented example analyses: one for a simulated data set
+and another for a real data set on crossbills in the Swiss breeding
+bird survey MHB.
+We also give examples to show how predictions, along with standard
+errors and confidence intervals, can be obtained.
+
+# Introduction
+
+Occurrence is a quantity of central importance in many branches of
+ecology and related sciences.
+The presence of a disease in an individual or of a species
+at a site are two common types of occurrence studies.
+The associated biological metrics are the incidence of the disease and
+species occurrence or species distribution.
+Thus, depending on how we define the thing we are looking for and the
+sample unit, very different biological quantities can be analyzed
+using statistical models for occupancy.
+
+If we denote presence of the "thing" as $y=1$ and its absence as
+$y=0$, then it is natural to characterize all these metrics by the
+probability that a randomly chosen sample unit ("site") is occupied,
+i.e., has a "thing" present: $Pr(y=1) = \psi$.
+We call this the occupancy probability, or occupancy for short, and
+from now on will call the sample unit,
+where the presence or absence of a "thing" is assessed, generically
+a "site".
+
+Naturally, we would like to explore factors that affect the likelihood
+that a site is occupied.
+A binomial generalized linear model, or logistic regression, is the
+customary statistical model for occurrence.
+In this model, we treat occurrence $y$ as a binomial random variable
+with trial size 1 and success probability $p$, or, equivalently, a
+Bernoulli trial with $p$.
+"Success" means occurrence, so $p$ is the occurrence probability.
+It can be modeled as a linear or other function of covariates via a
+suitable link function, e.g., the logit link.
+This simple model is described in many places, including @McCullagh_1989,
+Royle and Dorazio [-@royle_dorazio:2008, chapter 3], Kéry [-@Kery_2010,
+chapter 17] and Kéry and Schaub [-@Kery_2011, chapter 3].
+
+A generalization of this model accounts for changes in the occupancy
+state of sites by introducing parameters for survival
+(or alternatively, extinction) and colonization probability.
+Thus, when we have observations of occurrence for more than a single
+point in time, we can model the transition of the occupancy
+state at site $i$ between successive times as another Bernoulli trial.
+To model the fate of an occupied site, we denote the probability that
+a site occupied at $t$ is again occupied at $t+1$ as $Pr(y_{i,t+1} = 1
+| y_{i,t} = 1 ) = \phi$.
+This represents the survival probability of a site that is occupied.
+Of course, we could also choose to express this component of occupancy
+dynamics by the converse, extinction probability $\epsilon$ ---
+the parameterization used in `unmarked`.
+To model the fate of an unoccupied site, we denote as $Pr(y_{i,t+1} =
+1 | y_{i,t} = 0 ) = \gamma$ the probability that an unoccupied site at
+$t$ becomes occupied at $t+1$.
+This is the colonization probability of an empty site.
+Such a dynamic model of occurrence has become famous in the ecological literature under the name "metapopulation model" [@Hanski_1998].
+
+However, when using ecological data collected in the field to fit such
+models of occurrence, we face the usual challenge of imperfect
+detection [e.g. @Kery_2008].
+For instance, a species can go unobserved at a surveyed site or an
+occupied territory can appear unoccupied during a particular survey,
+perhaps because both birds are away hunting.
+Not accounting for detection error may seriously bias all parameter
+estimators of a metapopulation model [@Moilanen_2002; @royle_dorazio:2008].
+To account for this additional stochastic component in the generation
+of most ecological field data, the classical metapopulation model may
+be generalized to include a submodel for the observation process,
+which allows an occupied site to be recorded as unoccupied.
+This model has been developed by @mackenzie_estimating_2003. It is
+described as a hierarchical model by @Royle_2007, Royle
+and Dorazio [-@royle_dorazio:2008, chapter 9] and Kéry and Schaub [-@Kery_2011, chapter 13].
+The model is usually called a multi-season, multi-year or a
+dynamic site-occupancy model.
+The former terms denote the fact that it is applied to multiple
+"seasons" or years and the latter emphasizes that the model allows
+for between-season occurrence dynamics.
+
+This vignette describes the use of the `unmarked` function
+`colext` to fit dynamic occupancy models. Note that we will use
+italics for the names of functions.
+Static occupancy models, i.e., for a single season without changes in
+the occupancy state [@mackenzie_estimating_2002], can be fit with `occu`,
+for the model described by @mackenzie_estimating_2002 and @Tyre_2002, and with `occuRN`, for the heterogeneity occupancy model
+described by @royle_estimating_2003.
+In the next section (section 2), we give a more technical description
+of the dynamic occupancy model.
+In section 3, we provide R code for generating data under a basic
+dynamic occupancy model and illustrate use of `colext` for fitting the
+model.
+In section 4, we use real data from the Swiss breeding bird survey MHB
+[@schmid_etal:2004] to fit a few more elaborate models with
+covariates for all parameters.
+We also give examples illustrating how to compute predictions, with
+standard errors and 95% confidence intervals, for the parameters.
+
+# Dynamic occupancy models
+
+To be able to estimate the parameters of the dynamic occupancy model
+(probabilities of occurrence, survival and colonization) separately
+from the parameters for the observation process (detection
+probability), replicate observations are required from a period of
+closure,
+during which the occupancy state of a site must remain constant, i.e.,
+it is either occupied or unoccupied.
+The modeled data $y_{ijt}$ are indicators for whether a species is
+detected at site $i$ ($i = 1, 2, \ldots M$), during replicate survey
+$j$ ($j = 1, 2, \ldots J$) in season $t$ ($t = 1, 2, \ldots T$).
+That is, $y_{ijt}=1$ if at least one individual is detected and
+$y_{ijt}=0$ if none is detected.
+
+The model makes the following assumptions:
+* replicate surveys at a site during a single season are
+ independent (or else dependency must be modeled)
+* occurrence state $z_{it}$ (see below) does not change over
+ replicate surveys at site $i$ during season $t$
+* there are no false-positive errors, i.e., a species can only be
+ overlooked where it occurs, but it cannot be detected where it does
+ not in fact occur (i.e., there are no false-positives)
+
+The complete model consists of one submodel to describe the ecological
+process, or state, and another submodel for the observation process,
+which is dependent on the result of the ecological process.
+The ecological process describes the latent occurrence dynamics for
+all sites in terms of parameters for the probability of initial
+occurrence and site survival and colonization.
+The observation process describes the probability of detecting a
+presence (i.e., $y = 1$) at a site that is occupied and takes account
+of false-negative observation errors.
+
+## Ecological or state process
+
+This initial state is denoted $z_{i1}$ and represents occurrence at
+site $i$ during season 1.
+For this, the model assumes a Bernoulli trial governed by the
+occupancy probability in the first season $\psi_{i1}$:
+
+$$
+z_{i1} = Bernoulli(\psi_{i1})
+$$
+
+We must distinguish the sample quantity "occurrence" at a site, $z$,
+from the population quantity "occupancy probability", $\psi$.
+The former is the realization of a Bernoulli random variable with
+parameter $\psi$.
+This distinction becomes important when we want to compute the number
+of occupied sites among the sample of surveyed sites;
+see @Royle_2007 and @Weir_2009 for this
+distinction.
+
+For all later seasons ($t = 2, 3, \ldots T$), occurrence is a function
+of occurrence at site $i$ at time $t-1$ and one of two parameters that
+describe the colonization-extinction dynamics of the system.
+These dynamic parameters are the probability of local survival
+$\phi_{it}$, also called probability of persistence (= 1 minus the
+probability of local extinction),
+and the probability of colonization $\gamma_{it}$.
+
+$$
+z_{it} \sim Bernoulli(z_{i,t-1} \phi_{it} + (1-z_{i,t-1}) \gamma_{it})
+$$
+
+Hence, if site $i$ is unoccupied at $t-1$ , $z_{i,t-1}=0$, and the
+success probability of the Bernoulli is
+$0*\phi_{it} + (1-0) * \gamma_{it}$, so the site is occupied
+(=colonized) in season $t$ with probability $\gamma_{it}$
+. Conversely, if site $i$ is occupied at $t-1$ , $z_{i,t-1}=1$, and
+the success probability of the Bernoulli is given by $1*\phi_{it} +
+(1-1) * \gamma_{it}$, so the site is occupied in (=survives to) season
+$t$ with probability $\phi_{it}$.
+
+Occupancy probability ($\psi_{it}$) and occurrence ($z_{it}$) at all
+later times $t$ can be computed recursively from $\psi_{i1}$,
+$z_{i1}$ , $\phi_{it}$ and $\gamma_{it}$.
+Variances of these derived estimates can be obtained via the delta
+method or the bootstrap.
+
+## Observation process
+
+To account for the observation error (specifically, false-negative
+observations), the conventional Bernoulli detection process is
+assumed, such that
+
+$$
+y_{ijt} \sim Bernoulli(z_{it} p_{ijt})
+$$
+
+Here, $y_{ijt}$ is the detection probability at site $i$ during
+survey $j$ and season $t$. Detection is conditional on occurrence, and
+multiplying $p_{ijt}$ with $z_{it}$ ensures that occurrence can only
+be detected where in fact a species occurs, i.e. where $z_{it}=1$.
+
+## Modeling of parameters
+
+The preceding, fully general model description allows for site-($i$)
+dependence of all parameters. In addition to that, survival and
+colonization probabilities may be season-($t$)dependent and detection
+probability season-($t$) and survey-($j$) dependent.
+All this complexity may be dropped, especially the dependence on
+sites. On the other hand, all parameters that are indexed in some way
+can be modeled, e.g., as functions of covariates that vary along the
+dimension denoted by an index. We will fit linear functions (on the
+logit link scale) of covariates into first-year occupancy, survival
+and colonization and into detection probability.
+That is, for probabilities of first-year occupancy, survival,
+colonization and detection, respectively, we will fit models of the
+form
+ $logit(\psi_{i1}) = \alpha + \beta x_i$, where $x_i$ may be forest
+ cover or elevation of site $i$ ,
+ $logit(\phi_{it}) = \alpha + \beta x_{it}$, where $x_{it}$ may be
+ tree mast at site $i$ during season $t$,
+ $logit(\gamma_{it}) = \alpha + \beta x_{it}$, for a similarly
+ defined covariate $x_{it}$, or
+ $logit(p_{ijt}) = \alpha + \beta x_{ijt}$ , where $x_{ijt}$ is the
+ Julian date of the survey $j$ at site $i$ in season $t$.
+
+We note that for first-year occupancy, only covariates that vary among
+sites ("site covariates") can be fitted, while for survival and
+colonization, covariates that vary by site and by season ("yearly
+site covariates") may be fitted as well.
+For detection, covariates of three formats may be fitted:
+"site-covariates", "yearly-site-covariates" and
+"observation-covariates", as
+they are called in `unmarked`.
+
+# Dynamic occupancy models for simulated data
+
+We first generate a simple, simulated data set
+with specified, year-specific values for
+the parameters as well as design specifications, i.e., number of
+sites, years and surveys per year.
+Then, we show how to fit a dynamic occupancy model with
+year-dependence in the parameters for colonization, extinction and
+detection probability.
+
+## Simulating, formatting, and summarizing data
+
+To simulate the data, we execute the following R code.
+The actual values for these parameters for each year are drawn
+randomly from a uniform distribution with
+the specified bounds.
+
+
+```r
+M <- 250 # Number of sites
+J <- 3 # num secondary sample periods
+T <- 10 # num primary sample periods
+
+psi <- rep(NA, T) # Occupancy probability
+muZ <- z <- array(dim = c(M, T)) # Expected and realized occurrence
+y <- array(NA, dim = c(M, J, T)) # Detection histories
+
+set.seed(13973)
+psi[1] <- 0.4 # Initial occupancy probability
+p <- c(0.3,0.4,0.5,0.5,0.1,0.3,0.5,0.5,0.6,0.2)
+phi <- runif(n=T-1, min=0.6, max=0.8) # Survival probability (1-epsilon)
+gamma <- runif(n=T-1, min=0.1, max=0.2) # Colonization probability
+
+# Generate latent states of occurrence
+# First year
+z[,1] <- rbinom(M, 1, psi[1]) # Initial occupancy state
+# Later years
+for(i in 1:M){ # Loop over sites
+ for(k in 2:T){ # Loop over years
+ muZ[k] <- z[i, k-1]*phi[k-1] + (1-z[i, k-1])*gamma[k-1]
+ z[i,k] <- rbinom(1, 1, muZ[k])
+ }
+}
+
+# Generate detection/non-detection data
+for(i in 1:M){
+ for(k in 1:T){
+ prob <- z[i,k] * p[k]
+ for(j in 1:J){
+ y[i,j,k] <- rbinom(1, 1, prob)
+ }
+ }
+}
+
+# Compute annual population occupancy
+for (k in 2:T){
+ psi[k] <- psi[k-1]*phi[k-1] + (1-psi[k-1])*gamma[k-1]
+ }
+```
+
+We have now generated a single realization from the stochastic system
+thus defined. Figure 1
+illustrates the fundamental issue
+of imperfect detection --- the actual proportion of sites occupied
+differs greatly from the observed proportion of sites occupied, and
+because $p$ varies among years, the observed data cannot be used as a
+valid index of the parameter of interest $\psi_i$.
+
+
+
+
+```r
+plot(1:T, colMeans(z), type = "b", xlab = "Year",
+ ylab = "Proportion of sites occupied",
+ col = "black", xlim=c(0.5, 10.5), xaxp=c(1,10,9),
+ ylim = c(0,0.6), lwd = 2, lty = 1,
+ frame.plot = FALSE, las = 1, pch=16)
+
+psi.app <- colMeans(apply(y, c(1,3), max))
+lines(1:T, psi.app, type = "b", col = "blue", lty=3, lwd = 2)
+legend(1, 0.6, c("truth", "observed"),
+ col=c("black", "blue"), lty=c(1,3), pch=c(16,1))
+```
+
+![Figure 1. Summary of the multi-year occupancy data set generated.](colext-data-1.png)
+
+To analyze this data set with a dynamic occupancy model in
+`unmarked`, we first load the package.
+
+
+```r
+library(unmarked)
+```
+
+Next, we reformat the detection/non-detection data from a 3-dimensional
+array (as generated) into a 2-dimensional matrix with M rows.
+That is, we put the annual tables of data (the slices of the former
+3-D array) sideways to produce a "wide" layout of the data.
+
+
+```r
+yy <- matrix(y, M, J*T)
+```
+
+Next, we create a matrix indicating the year each site was surveyed.
+
+
+```r
+year <- matrix(c('01','02','03','04','05','06','07','08','09','10'),
+ nrow(yy), T, byrow=TRUE)
+```
+
+To organize the data in the format required by `colext`, we make
+use of the function `unmarkedMultFrame`. The only required
+arguments are `y`, the detection/non-detection data, and
+`numPrimary`, the number of seasons. The three types of
+covariates described earlier can also be supplied using the arguments
+`siteCovs`, `yearlySiteCovs`, and `obsCovs`. In this case,
+we only make use of the second type, which must have M rows and T
+columns.
+
+
+```r
+simUMF <- unmarkedMultFrame(
+ y = yy,
+ yearlySiteCovs = list(year = year),
+ numPrimary=T)
+summary(simUMF)
+```
+
+```
+## unmarkedFrame Object
+##
+## 250 sites
+## Maximum number of observations per site: 30
+## Mean number of observations per site: 30
+## Number of primary survey periods: 10
+## Number of secondary survey periods: 3
+## Sites with at least one detection: 195
+##
+## Tabulation of y observations:
+## 0 1
+## 6430 1070
+##
+## Yearly-site-level covariates:
+## year
+## 01 : 250
+## 02 : 250
+## 03 : 250
+## 04 : 250
+## 05 : 250
+## 06 : 250
+## (Other):1000
+```
+
+## Model fitting
+
+We are ready to fit a few dynamic occupancy models.
+We will fit a model with constant values for all parameters and
+another with full time-dependence for colonization, extinction and
+detection probability. We also time the calculations.
+
+
+```r
+# Model with all constant parameters
+m0 <- colext(psiformula= ~1, gammaformula = ~ 1, epsilonformula = ~ 1,
+ pformula = ~ 1, data = simUMF, method="BFGS")
+summary(m0)
+```
+
+```
+##
+## Call:
+## colext(psiformula = ~1, gammaformula = ~1, epsilonformula = ~1,
+## pformula = ~1, data = simUMF, method = "BFGS")
+##
+## Initial (logit-scale):
+## Estimate SE z P(>|z|)
+## -0.813 0.158 -5.16 2.46e-07
+##
+## Colonization (logit-scale):
+## Estimate SE z P(>|z|)
+## -1.77 0.0807 -22 2.75e-107
+##
+## Extinction (logit-scale):
+## Estimate SE z P(>|z|)
+## -0.59 0.102 -5.79 7.04e-09
+##
+## Detection (logit-scale):
+## Estimate SE z P(>|z|)
+## -0.0837 0.0562 -1.49 0.137
+##
+## AIC: 4972.597
+## Number of sites: 250
+## optim convergence code: 0
+## optim iterations: 27
+## Bootstrap iterations: 0
+```
+
+The computation time was only a few seconds.
+Note that all parameters were estimated on the logit scale. To
+back-transform to the original scale, we can simply use the
+inverse-logit function, named `plogis` in R.
+
+
+```r
+plogis(-0.813)
+```
+
+```
+## [1] 0.3072516
+```
+
+Alternatively, we can use `backTransform`, which
+computes standard errors using the delta method. Confidence intervals
+are also easily obtained using the function `confint`.
+We first remind ourselves of the names of parameters, which can all be
+used as arguments for these functions.
+
+
+```r
+names(m0)
+```
+
+```
+## [1] "psi" "col" "ext" "det"
+```
+
+```r
+backTransform(m0, type="psi")
+```
+
+```
+## Backtransformed linear combination(s) of Initial estimate(s)
+##
+## Estimate SE LinComb (Intercept)
+## 0.307 0.0335 -0.813 1
+##
+## Transformation: logistic
+```
+
+```r
+confint(backTransform(m0, type="psi"))
+```
+
+```
+## 0.025 0.975
+## 0.2457313 0.3765804
+```
+
+Next, we fit the dynamic occupancy model with full year-dependence in
+the parameters describing occupancy dynamics and also in detection.
+This is the same model under which we generated the data set, so we
+would expect accurate estimates.
+
+By default in R, a factor such as year in this analysis, is a
+parameterized in terms of an intercept and effects representing
+differences. This would mean that the parameter for the first year is
+the intercept and the effects would denote the differences between
+the parameter values in all other years, relative to the parameter
+value in the first year, which serves as a reference level.
+This treatment or effects parameterization is useful for testing for
+differences. For simple presentation, a means parameterization is more
+practical. It can be specified by adding a -1 to the formula for the
+time-dependent parameters.
+
+
+```r
+m1 <- colext(psiformula = ~1, # First-year occupancy
+ gammaformula = ~ year-1, # Colonization
+ epsilonformula = ~ year-1, # Extinction
+ pformula = ~ year-1, # Detection
+ data = simUMF)
+m1
+```
+
+```
+##
+## Call:
+## colext(psiformula = ~1, gammaformula = ~year - 1, epsilonformula = ~year -
+## 1, pformula = ~year - 1, data = simUMF)
+##
+## Initial:
+## Estimate SE z P(>|z|)
+## -0.273 0.302 -0.906 0.365
+##
+## Colonization:
+## Estimate SE z P(>|z|)
+## year01 -2.08 0.951 -2.19 2.86e-02
+## year02 -2.18 0.365 -5.96 2.52e-09
+## year03 -1.98 0.274 -7.23 4.88e-13
+## year04 -2.32 0.678 -3.42 6.37e-04
+## year05 -1.89 0.478 -3.95 7.78e-05
+## year06 -1.76 0.294 -5.97 2.44e-09
+## year07 -1.55 0.230 -6.73 1.75e-11
+## year08 -1.43 0.228 -6.29 3.19e-10
+## year09 -2.35 0.470 -5.00 5.64e-07
+##
+## Extinction:
+## Estimate SE z P(>|z|)
+## year01 -1.4209 0.418 -3.401 6.72e-04
+## year02 -0.4808 0.239 -2.009 4.45e-02
+## year03 -1.2606 0.366 -3.440 5.83e-04
+## year04 -0.0907 0.650 -0.139 8.89e-01
+## year05 -0.6456 0.599 -1.078 2.81e-01
+## year06 -0.9586 0.378 -2.539 1.11e-02
+## year07 -1.2279 0.365 -3.362 7.74e-04
+## year08 -1.1894 0.292 -4.076 4.58e-05
+## year09 -0.6292 0.635 -0.991 3.22e-01
+##
+## Detection:
+## Estimate SE z P(>|z|)
+## year01 -1.0824 0.244 -4.434 9.26e-06
+## year02 -0.2232 0.148 -1.508 1.32e-01
+## year03 0.2951 0.154 1.918 5.52e-02
+## year04 0.0662 0.161 0.412 6.81e-01
+## year05 -2.0396 0.433 -4.706 2.52e-06
+## year06 -0.6982 0.232 -3.005 2.66e-03
+## year07 0.2413 0.165 1.466 1.43e-01
+## year08 0.0847 0.155 0.548 5.84e-01
+## year09 0.6052 0.140 4.338 1.44e-05
+## year10 -1.1699 0.306 -3.828 1.29e-04
+##
+## AIC: 4779.172
+```
+
+## Manipulating results: prediction and plotting
+
+Again, all estimates are shown on the logit-scale. Back-transforming
+estimates when covariates, such as year, are present involves an
+extra step. Specifically, we need to tell `unmarked` the values
+of our covariate
+at which we want an estimate. This can be done using
+`backTransform` in combination with `linearComb`, although
+it can be easier to use `predict`. `predict` allows the user
+to supply a data.frame in which each row represents a combination of
+covariate values of interest. Below, we create data.frames called
+`nd` with each row representing a year.
+Then we request yearly estimates of the probability of extinction,
+colonization and detection,
+and compare them to "truth", i.e., the values with which we
+simulated the data set. Note that there are T-1 extinction and
+colonization parameters in this case, so we do not need to include
+year 10 in `nd`.
+
+
+```r
+nd <- data.frame(year=c('01','02','03','04','05','06','07','08','09'))
+E.ext <- predict(m1, type='ext', newdata=nd)
+E.col <- predict(m1, type='col', newdata=nd)
+nd <- data.frame(year=c('01','02','03','04','05','06','07','08','09','10'))
+E.det <- predict(m1, type='det', newdata=nd)
+```
+
+`predict` returns the predictions along with standard errors and
+confidence intervals. These can be used to create plots. The
+`with` function is used to simplify the process of requesting the
+columns of `data.frame` returned by `predict`.
+
+
+```r
+op <- par(mfrow=c(3,1), mai=c(0.6, 0.6, 0.1, 0.1))
+
+with(E.ext, { # Plot for extinction probability
+ plot(1:9, Predicted, pch=1, xaxt='n', xlab='Year',
+ ylab=expression(paste('Extinction probability ( ', epsilon, ' )')),
+ ylim=c(0,1), col=4)
+ axis(1, at=1:9, labels=nd$year[1:9])
+ arrows(1:9, lower, 1:9, upper, code=3, angle=90, length=0.03, col=4)
+ points((1:9)-0.1, 1-phi, col=1, lwd = 1, pch=16)
+ legend(7, 1, c('Parameter', 'Estimate'), col=c(1,4), pch=c(16, 1),
+ cex=0.8)
+ })
+
+with(E.col, { # Plot for colonization probability
+ plot(1:9, Predicted, pch=1, xaxt='n', xlab='Year',
+ ylab=expression(paste('Colonization probability ( ', gamma, ' )')),
+ ylim=c(0,1), col=4)
+ axis(1, at=1:9, labels=nd$year[1:9])
+ arrows(1:9, lower, 1:9, upper, code=3, angle=90, length=0.03, col=4)
+ points((1:9)-0.1, gamma, col=1, lwd = 1, pch=16)
+ legend(7, 1, c('Parameter', 'Estimate'), col=c(1,4), pch=c(16, 1),
+ cex=0.8)
+ })
+
+with(E.det, { # Plot for detection probability: note 10 years
+ plot(1:10, Predicted, pch=1, xaxt='n', xlab='Year',
+ ylab=expression(paste('Detection probability ( ', p, ' )')),
+ ylim=c(0,1), col=4)
+ axis(1, at=1:10, labels=nd$year)
+ arrows(1:10, lower, 1:10, upper, code=3, angle=90, length=0.03, col=4)
+ points((1:10)-0.1, p, col=1, lwd = 1, pch=16)
+ legend(7.5, 1, c('Parameter','Estimate'), col=c(1,4), pch=c(16, 1),
+ cex=0.8)
+ })
+```
+
+![Figure 2. Yearly estimates of parameters](colext-est-1.png)
+
+```r
+par(op)
+```
+
+Figure 2 shows that the 95% confidence intervals
+include the true parameter values, and the point estimates are not too
+far off.
+
+## Derived parameters
+
+Estimates of occupancy probability in years $T>1$ must be derived from the
+estimates of first-year occupancy and the two parameters governing the
+dynamics, extinction/survival and colonization.
+`unmarked` does this automatically in two ways. First, the
+population-level estimates of occupancy probability
+$\psi_t = \psi_{t-1}\phi_{t-1} + (1-\phi_{t-1})\gamma$ are calculated
+and stored in the slot named \emph{projected}. Slots can be accessed
+using the `@` operator, e.g. `fm@projected`.
+In some cases, interest may lie in making
+inference about the proportion of the sampled sites that are occupied,
+rather than the entire population of sites. These estimates are
+contained in the `smoothed` slot of the fitted model. Thus, the
+`projected` values are estimates of population parameters, and
+the `smoothed` estimates are of the finite-sample
+quantities. Discussions of the differences can be found in @Weir_2009.
+
+Bootstrap methods can be used to compute standard errors of derived
+parameter estimates. Here we employ a non-parametric bootstrap to obtain
+standard errors of the smoothed estimates of occupancy probability
+during each year.
+
+
+```r
+m1 <- nonparboot(m1, B = 10)
+cbind(psi=psi, smoothed=smoothed(m1)[2,], SE=m1@smoothed.mean.bsse[2,])
+```
+
+```
+## psi smoothed SE
+## 1 0.4000000 0.4320671 0.06783911
+## 2 0.3493746 0.4110124 0.03786402
+## 3 0.2977125 0.3139967 0.02780818
+## 4 0.3148447 0.3278179 0.04303542
+## 5 0.3192990 0.2316695 0.10858419
+## 6 0.2915934 0.2528485 0.04179036
+## 7 0.3114415 0.2928429 0.03113920
+## 8 0.3636580 0.3504885 0.04224678
+## 9 0.3654064 0.3936991 0.02103870
+## 10 0.3460641 0.3095786 0.06830698
+```
+
+In practice, `B` should be much higher, possibly >1000 for complex
+models .
+
+Another derived parameters of interest is turnover probability
+
+$$
+\tau_t = \frac{\gamma_{t-1}(1-\psi_{t-1})}{\gamma_{t-1}(1-\psi_{t-1})
+ + \phi_{t-1}\psi_{t-1}}
+$$
+
+The following function returns these estimates.
+
+
+```r
+turnover <- function(fm) {
+ psi.hat <- plogis(coef(fm, type="psi"))
+ if(length(psi.hat) > 1)
+ stop("this function only works if psi is scalar")
+ T <- getData(fm)@numPrimary
+ tau.hat <- numeric(T-1)
+ gamma.hat <- plogis(coef(fm, type="col"))
+ phi.hat <- 1 - plogis(coef(fm, type="ext"))
+ if(length(gamma.hat) != T-1 | length(phi.hat) != T-1)
+ stop("this function only works if gamma and phi T-1 vectors")
+ for(t in 2:T) {
+ psi.hat[t] <- psi.hat[t-1]*phi.hat[t-1] +
+ (1-psi.hat[t-1])*gamma.hat[t-1]
+ tau.hat[t-1] <- gamma.hat[t-1]*(1-psi.hat[t-1]) / psi.hat[t]
+ }
+ return(tau.hat)
+ }
+```
+
+The bootstrap again offers a means of estimating variance. Here we
+show how to generate 95\% confidence intervals for the turnover
+estimates using the parametric bootstrap.
+
+
+```r
+pb <- parboot(m1, statistic=turnover, nsim=2)
+turnCI <- cbind(pb@t0,
+ t(apply(pb@t.star, 2, quantile, probs=c(0.025, 0.975))))
+colnames(turnCI) <- c("tau", "lower", "upper")
+turnCI
+```
+
+```
+## tau lower upper
+## t*1 0.1532645 0.00536045 0.1974714
+## t*2 0.1911530 0.07881180 0.2119585
+## t*3 0.2537292 0.19777204 0.2785973
+## t*4 0.2604356 0.04063769 0.4197328
+## t*5 0.3989303 0.34078483 0.4720357
+## t*6 0.3758690 0.32703698 0.5370796
+## t*7 0.3537473 0.32696166 0.3564059
+## t*8 0.3174983 0.32925238 0.4139696
+## t*9 0.1704449 0.18946470 0.3186236
+```
+
+Which bootstrap method is most appropriate for variance estimation?
+For detailed distinctions between the
+non-parametric and the parametric bootstrap, see @Davison_1997.
+We note simply that the parametric bootstrap resamples from
+the fitted model, and thus the
+measures of uncertainty are purely
+functions of the distributions assumed by the model. Non-parametric
+bootstrap samples, in contrast, are obtained by resampling the
+data, not the model, and thus are not necessarily affected by the
+variance formulas of the model's distributions.
+
+## Goodness-of-fit
+
+In addition to estimating the variance of an estimate, the parametric
+bootstrap can be used to assess goodness-of-fit. For this purpose, a
+fit-statistic, i.e. one that compares
+observed and expected values, is evaluated using the original fitted
+model, and numerous other models fitted to simulated datasets. The
+simulation yields an approximation of
+the distribution of the fit-statistic, and a \emph{P}-value
+can be computed as the proportion of simulated values greater than the
+observed value.
+
+@Hosmer_1997 found that a $\chi^2$ statistic performed
+reasonably well in assessing lack of fit for logistic regression
+models. We know of no studies formally
+evaluating the performance of various fit-statistics for dynamic
+occupancy models, so this approach should be
+considered experimental. Fit-statistics applied to aggregated
+encounter histories offer an alternative approach [@MacKenzie_2004], but are difficult to implement when J*T is high and
+missing values or continuous covariates are present.
+
+
+```r
+chisq <- function(fm) {
+ umf <- getData(fm)
+ y <- getY(umf)
+ sr <- fm@sitesRemoved
+ if(length(sr)>0)
+ y <- y[-sr,,drop=FALSE]
+ fv <- fitted(fm, na.rm=TRUE)
+ y[is.na(fv)] <- NA
+ sum((y-fv)^2/(fv*(1-fv)))
+ }
+
+set.seed(344)
+pb.gof <- parboot(m0, statistic=chisq, nsim=100)
+
+plot(pb.gof, xlab=expression(chi^2), main="", col=gray(0.95),
+ xlim=c(7300, 7700))
+```
+
+![Figure 3. Goodness-of-fit](colext-gof-1.png)
+
+Figure 3 indicates that, as expected, the constant
+parameter model does not fit the data well.
+
+# Dynamic occupancy models for crossbill data from the Swiss MHB
+
+## The crossbill data set
+
+The crossbill data are included with the `unmarked` package.
+The dataset contains the results of nine years of surveys (1999--2007)
+for the European crossbill (*Loxia curvirostra*),
+a pine-seed eating finch, in 267 1-km$^2$ sample quadrats in Switzerland.
+Quadrats are surveyed annually as part of the Swiss breeding bird
+survey MHB [@schmid_etal:2004].
+They are laid out as a grid over Switzerland and surveyed 2 or 3 times
+every breeding season (mid-April to late June)
+by experienced field ornithologists along a haphazard survey route of
+length 1-9 km (average 5 km).
+High-elevation sites are only surveyed twice per breeding season.
+
+## Importing, formatting, and summarizing data
+
+The data can be loaded into an open R workspace using the `data` command.
+
+
+```r
+data(crossbill)
+colnames(crossbill)
+```
+
+```
+## [1] "id" "ele" "forest" "surveys" "det991" "det992" "det993"
+## [8] "det001" "det002" "det003" "det011" "det012" "det013" "det021"
+## [15] "det022" "det023" "det031" "det032" "det033" "det041" "det042"
+## [22] "det043" "det051" "det052" "det053" "det061" "det062" "det063"
+## [29] "det071" "det072" "det073" "date991" "date992" "date993" "date001"
+## [36] "date002" "date003" "date011" "date012" "date013" "date021" "date022"
+## [43] "date023" "date031" "date032" "date033" "date041" "date042" "date043"
+## [50] "date051" "date052" "date053" "date061" "date062" "date063" "date071"
+## [57] "date072" "date073"
+```
+
+We have three covariates that vary by site: median elevation of the
+quadrat (`ele`, in metres), forest cover of the quadrat (`forest`, in
+percent) and the number of surveys per season (i.e., 2 or 3,
+surveys).
+These are called site covariates, because they vary by sites only.
+The 27 columns entitled `det991` - `det073` contain the crossbill
+detection/nondetection data during all surveys over the 9 years.
+They contain a 1 when at least one crossbill was recorded during a
+survey and a 0 otherwise.
+`NA`s indicate surveys that did not take place, either because a site is
+high-elevation and has no third survey or because it failed to be
+surveyed altogether in a year.
+The final 27 columns entitled `date991` - `date073` give the Julian
+date of each survey.
+They represent a "survey-covariate" or "observation covariate".
+We note that the paper by @Royle_2007 used a subset of this
+data set.
+
+AIC-based model selection (see section 5.4) requires
+that all models are fit to the same data.
+`unmarked` removes missing data in a context specific way. For
+missing `siteCovs`, the entire row of data must be removed. However, for
+missing `yearlySiteCovs` or `obsCovs`, only the
+corresponding observation
+are removed. Thus, if `unmarked` removes different observations
+from different models, the models cannot be compared using AIC. A way
+around this is to remove the detection data corresponding to
+missing covariates before fitting the models.
+The crossbill data have missing dates and so we remove the associated
+detection/non-detection data.
+
+
+
+```r
+DATE <- as.matrix(crossbill[,32:58])
+y.cross <- as.matrix(crossbill[,5:31])
+y.cross[is.na(DATE) != is.na(y.cross)] <- NA
+```
+
+In addition, continuous covariates should be transformed in a way
+that brings their values close to zero in order to improve
+or even enable numerical convergence of the maximum-likelihood routine.
+We do this "by hand" and note that we could also have used the R
+function `scale`. We subtract the mean and divide by the standard
+deviation.
+
+
+```r
+sd.DATE <- sd(c(DATE), na.rm=TRUE)
+mean.DATE <- mean(DATE, na.rm=TRUE)
+DATE <- (DATE - mean.DATE) / sd.DATE
+```
+
+Before we can fit occupancy models, we need to format this data set
+appropriately.
+
+
+```r
+years <- as.character(1999:2007)
+years <- matrix(years, nrow(crossbill), 9, byrow=TRUE)
+umf <- unmarkedMultFrame(y=y.cross,
+ siteCovs=crossbill[,2:3], yearlySiteCovs=list(year=years),
+ obsCovs=list(date=DATE),
+ numPrimary=9)
+```
+
+## Model fitting
+
+We fit a series of models that represent different hypotheses about
+the colonization-extinction dynamics of Swiss crossbills
+at a spatial scale of 1 km$^2$.
+We fit year effects on colonization and extinction in the means
+parameterization,
+but for detection probability, we choose an effects parameterization.
+The latter is more useful for getting predictions in the presence of
+other explanatory variables for that parameter.
+For model `fm5` with more complex covariate relationships, we use as
+starting values for the optimization routine
+the solution from a "neighboring" model with slightly less
+complexity, model `fm4`.
+Wise choice of starting values can be decisive for success or failure
+of maximum likelihood estimation.
+
+
+```r
+# A model with constant parameters
+fm0 <- colext(~1, ~1, ~1, ~1, umf)
+
+# Like fm0, but with year-dependent detection
+fm1 <- colext(~1, ~1, ~1, ~year, umf)
+
+# Like fm0, but with year-dependent colonization and extinction
+fm2 <- colext(~1, ~year-1, ~year-1, ~1, umf)
+
+# A fully time-dependent model
+fm3 <- colext(~1, ~year-1, ~year-1, ~year, umf)
+
+# Like fm3 with forest-dependence of 1st-year occupancy
+fm4 <- colext(~forest, ~year-1, ~year-1, ~year, umf)
+
+# Like fm4 with date- and year-dependence of detection
+fm5 <- colext(~forest, ~year-1, ~year-1, ~year + date + I(date^2),
+ umf, starts=c(coef(fm4), 0, 0))
+
+# Same as fm5, but with detection in addition depending on forest cover
+fm6 <- colext(~forest, ~year-1, ~year-1, ~year + date + I(date^2) +
+ forest, umf)
+```
+
+## Model selection
+
+We can compare models using the Akaike information criterion
+($AIC$).
+Note that `unmarked` yields $AIC$, not $AIC_c$
+because the latter would require the sample size,
+which is not really known for
+hierarchical models such as the dynamic occupancy model.
+
+Model selection and model-averaged prediction in `unmarked`
+require that we create a list of models using `fitList`.
+This function organizes models and conducts a series of tests to
+ensure that the models were fit to the same data.
+
+
+```r
+models <- fitList('psi(.)gam(.)eps(.)p(.)' = fm0,
+ 'psi(.)gam(.)eps(.)p(Y)' = fm1,
+ 'psi(.)gam(Y)eps(Y)p(.)' = fm2,
+ 'psi(.)gam(Y)eps(Y)p(Y)' = fm3,
+ 'psi(F)gam(Y)eps(Y)p(Y)' = fm4,
+ 'psi(F)gam(Y)eps(Y)p(YD2)' = fm5,
+ 'psi(F)gam(Y)eps(Y)p(YD2F)' = fm6)
+ms <- modSel(models)
+ms
+```
+
+```
+## nPars AIC delta AICwt cumltvWt
+## psi(F)gam(Y)eps(Y)p(YD2F) 30 4986.39 0.00 1.0e+00 1.00
+## psi(F)gam(Y)eps(Y)p(YD2) 29 5059.30 72.91 1.5e-16 1.00
+## psi(F)gam(Y)eps(Y)p(Y) 27 5095.38 108.99 2.2e-24 1.00
+## psi(.)gam(.)eps(.)p(Y) 12 5111.32 124.93 7.5e-28 1.00
+## psi(.)gam(Y)eps(Y)p(Y) 26 5127.63 141.24 2.1e-31 1.00
+## psi(.)gam(Y)eps(Y)p(.) 18 5170.54 184.15 1.0e-40 1.00
+## psi(.)gam(.)eps(.)p(.) 4 5193.50 207.11 1.1e-45 1.00
+```
+
+One model has overwhelming support, so we can base inference on that
+one alone. Before doing so, we point out how to extract coefficients
+from a `fitList` object, and convert the results to a
+`data.frame`, which could be exported from R.
+
+
+```r
+coef(ms) # Estimates only
+SE(ms) # Standard errors only
+toExport <- as(ms, "data.frame") # Everything
+```
+
+## Manipulating results: Prediction and plotting
+
+Fitted models can be used to predict expected outcomes when given new
+data. For example, one could ask "how many crossbills would you
+expect to find in a quadrat with 50% forest cover?" Prediction also
+offers a way of
+presenting the results of an analysis. We illustrate by plotting the
+predictions of $\psi$ and $p$ over the range of covariate values studied.
+Note that because we standardized date, we need to transform it back
+to its original scale after obtaining predictions on the
+standardized scale.
+
+
+```r
+op <- par(mfrow=c(1,2), mai=c(0.8,0.8,0.1,0.1))
+
+nd <- data.frame(forest=seq(0, 100, length=50))
+E.psi <- predict(fm6, type="psi", newdata=nd, appendData=TRUE)
+
+with(E.psi, {
+ plot(forest, Predicted, ylim=c(0,1), type="l",
+ xlab="Percent cover of forest",
+ ylab=expression(hat(psi)), cex.lab=0.8, cex.axis=0.8)
+ lines(forest, Predicted+1.96*SE, col=gray(0.7))
+ lines(forest, Predicted-1.96*SE, col=gray(0.7))
+ })
+
+nd <- data.frame(date=seq(-2, 2, length=50),
+ year=factor("2005", levels=c(unique(years))),
+ forest=50)
+E.p <- predict(fm6, type="det", newdata=nd, appendData=TRUE)
+E.p$dateOrig <- E.p$date*sd.DATE + mean.DATE
+
+with(E.p, {
+ plot(dateOrig, Predicted, ylim=c(0,1), type="l",
+ xlab="Julian date", ylab=expression( italic(p) ),
+ cex.lab=0.8, cex.axis=0.8)
+ lines(dateOrig, Predicted+1.96*SE, col=gray(0.7))
+ lines(dateOrig, Predicted-1.96*SE, col=gray(0.7))
+ })
+```
+
+![Figure 4. Covariates](colext-pred-1.png)
+
+```r
+par(op)
+```
+
+**Acknowledgments**
+
+Special thanks goes to Ian Fiske, the author of `colext` and the
+original developer of `unmarked`. Andy Royle provided the
+initial funding and support for the package. The questions of many
+people on the users' list motivated the writing of this document.
+
+# References
+
+```{r, echo=FALSE}
+options(rmarkdown.html_vignette.check_title = FALSE)
+```