vignettes/colext.Rmd


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091

---
title: Dynamic occupancy models in unmarked
author:
- name: Marc Kéry, Swiss Ornithological Institute
- name: Richard Chandler, University of Georgia
date: August 16, 2016
bibliography: unmarked.bib
csl: ecology.csl
output: 
  rmarkdown::html_vignette:
    fig_width: 5
    fig_height: 3.5
    number_sections: true
    toc: true
vignette: >
  %\VignetteIndexEntry{Dynamic occupancy models}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---


# Abstract

Dynamic occupancy models [@mackenzie_estimating_2003] allow inference about
the occurrence of "things" at collections of "sites"
and about how changes in occurrence are driven by colonization and
local extinction. These models also account for imperfect detection
probability. Depending on how "thing" and "site" are defined,
occupancy may have vastly different biological meanings,
including the presence of a disease in an individual (disease
incidence) of a species at a site (occurrence, distribution), or of an
individual in a territory.
Dynamic occupancy models in `unmarked` are fit using the
function `colext`.
All parameters can be modeled as functions of covariates, i.e.,
first-year occupancy with covariates varying by site
(site-covariates),
colonization and survival with site- and yearly-site-covariates and
detection with site-, yearly-site- and sample-occasion-covariates.
We give two commented example analyses: one for a simulated data set
and another for a real data set on crossbills in the Swiss breeding
bird survey MHB.
We also give examples to show how predictions, along with standard
errors and confidence intervals, can be obtained.

# Introduction

Occurrence is a quantity of central importance in many branches of
ecology and related sciences.
The presence of a disease in an individual or of a species
at a site are two common types of occurrence studies.
The associated biological metrics are the incidence of the disease and
species occurrence or species distribution.
Thus, depending on how we define the thing we are looking for and the
sample unit, very different biological quantities can be analyzed
using statistical models for occupancy.

If we denote presence of the "thing" as $y=1$ and its absence as
$y=0$, then it is natural to characterize all these metrics by the
probability that a randomly chosen sample unit ("site") is occupied,
i.e., has a "thing" present: $Pr(y=1) = \psi$.
We call this the occupancy probability, or occupancy for short, and
from now on will call the sample unit,
where the presence or absence of a "thing" is assessed, generically
a "site".

Naturally, we would like to explore factors that affect the likelihood
that a site is occupied.
A binomial generalized linear model, or logistic regression, is the
customary statistical model for occurrence.
In this model, we treat occurrence $y$ as a binomial random variable
with trial size 1 and success probability $p$, or, equivalently, a
Bernoulli trial with $p$.
"Success" means occurrence, so $p$ is the occurrence probability.
It can be modeled as a linear or other function of covariates via a
suitable link function, e.g., the logit link.
This simple model is described in many places, including @McCullagh_1989, 
Royle and Dorazio [-@royle_dorazio:2008, chapter 3], Kéry [-@Kery_2010,
chapter 17] and Kéry and Schaub [-@Kery_2011, chapter 3].

A generalization of this model accounts for changes in the occupancy
state of sites by introducing parameters for survival
(or alternatively, extinction) and colonization probability.
Thus, when we have observations of occurrence for more than a single
point in time, we can model the transition of the occupancy
state at site $i$ between successive times as another Bernoulli trial.
To model the fate of an occupied site, we denote the probability that
a site occupied at $t$ is again occupied at $t+1$ as $Pr(y_{i,t+1} = 1
| y_{i,t} = 1 ) = \phi$.
This represents the survival probability of a site that is occupied.
Of course, we could also choose to express this component of occupancy
dynamics by the converse, extinction probability $\epsilon$ ---
the parameterization used in `unmarked`.
To model the fate of an unoccupied site, we denote as $Pr(y_{i,t+1} =
1 | y_{i,t} = 0 ) = \gamma$ the probability that an unoccupied site at
$t$ becomes occupied at $t+1$.
This is the colonization probability of an empty site.
Such a dynamic model of occurrence has become famous in the ecological literature under the name "metapopulation model" [@Hanski_1998].

However, when using ecological data collected in the field to fit such
models of occurrence, we face the usual challenge of imperfect
detection [e.g. @Kery_2008].
For instance, a species can go unobserved at a surveyed site or an
occupied territory can appear unoccupied during a particular survey,
perhaps because both birds are away hunting.
Not accounting for detection error may seriously bias all parameter
estimators of a metapopulation model [@Moilanen_2002; @royle_dorazio:2008].
To account for this additional stochastic component in the generation
of most ecological field data, the classical metapopulation model may
be generalized to include a submodel for the observation process,
which allows an occupied site to be recorded as unoccupied.
This model has been developed by @mackenzie_estimating_2003. It is
described as a hierarchical model by @Royle_2007, Royle
and Dorazio [-@royle_dorazio:2008, chapter 9] and Kéry and Schaub [-@Kery_2011, chapter 13].
The model is usually called a multi-season, multi-year or a
dynamic site-occupancy model.
The former terms denote the fact that it is applied to multiple
"seasons" or years and the latter emphasizes that the model allows
for between-season occurrence dynamics.

This vignette describes the use of the `unmarked` function
`colext` to fit dynamic occupancy models. Note that we will use
italics for the names of functions.
Static occupancy models, i.e., for a single season without changes in
the occupancy state [@mackenzie_estimating_2002], can be fit with `occu`,
for the model described by @mackenzie_estimating_2002 and @Tyre_2002, and with `occuRN`, for the heterogeneity occupancy model
described by @royle_estimating_2003.
In the next section (section 2), we give a more technical description
of the dynamic occupancy model.
In section 3, we provide R code for generating data under a basic
dynamic occupancy model and illustrate use of `colext` for fitting the
model.
In section 4, we use real data from the Swiss breeding bird survey MHB
[@schmid_etal:2004] to fit a few more elaborate models with
covariates for all parameters.
We also give examples illustrating how to compute predictions, with
standard errors and 95% confidence intervals, for the parameters.

# Dynamic occupancy models

To be able to estimate the parameters of the dynamic occupancy model
(probabilities of occurrence, survival and colonization) separately
from the parameters for the observation process (detection
probability), replicate observations are required from a period of
closure,
during which the occupancy state of a site must remain constant, i.e.,
it is either occupied or unoccupied.
The modeled data $y_{ijt}$ are indicators for whether a species is
detected at site $i$ ($i = 1, 2, \ldots M$), during replicate survey
$j$ ($j = 1, 2, \ldots J$) in season $t$ ($t = 1, 2, \ldots T$).
That is, $y_{ijt}=1$ if at least one individual is detected and
$y_{ijt}=0$ if none is detected.

The model makes the following assumptions:
* replicate surveys at a site during a single season are
  independent (or else dependency must be modeled)
* occurrence state $z_{it}$ (see below) does not change over
  replicate surveys at site $i$ during season $t$
* there are no false-positive errors, i.e., a species can only be
  overlooked where it occurs, but it cannot be detected where it does
  not in fact occur (i.e., there are no false-positives)

The complete model consists of one submodel to describe the ecological
process, or state, and another submodel for the observation process,
which is dependent on the result of the ecological process.
The ecological process describes the latent occurrence dynamics for
all sites in terms of parameters for the probability of initial
occurrence and site survival and colonization.
The observation process describes the probability of detecting a
presence (i.e., $y = 1$) at a site that is occupied and takes account
of false-negative observation errors.

## Ecological or state process

This initial state is denoted $z_{i1}$ and represents occurrence at
site $i$ during season 1.
For this, the model assumes a Bernoulli trial governed by the
occupancy probability in the first season $\psi_{i1}$:

$$
z_{i1} = Bernoulli(\psi_{i1})
$$

We must distinguish the sample quantity "occurrence" at a site, $z$,
from the population quantity "occupancy probability", $\psi$.
The former is the realization of a Bernoulli random variable with
parameter $\psi$.
This distinction becomes important when we want to compute the number
of occupied sites among the sample of surveyed sites;
see @Royle_2007 and @Weir_2009 for this
distinction.

For all later seasons ($t = 2, 3, \ldots T$), occurrence is a function
of occurrence at site $i$ at time $t-1$ and one of two parameters that
describe the colonization-extinction dynamics of the system.
These dynamic parameters are the probability of local survival
$\phi_{it}$, also called probability of persistence (= 1 minus the
probability of local extinction),
and the probability of colonization $\gamma_{it}$.

$$
z_{it} \sim Bernoulli(z_{i,t-1} \phi_{it} + (1-z_{i,t-1}) \gamma_{it})
$$

Hence, if site $i$ is unoccupied at $t-1$ , $z_{i,t-1}=0$, and the
success probability of the Bernoulli is
$0*\phi_{it} + (1-0) * \gamma_{it}$, so the site is occupied
(=colonized) in season $t$ with probability $\gamma_{it}$
. Conversely, if site $i$ is occupied at $t-1$ , $z_{i,t-1}=1$, and
the success probability of the Bernoulli is given by $1*\phi_{it} +
(1-1) * \gamma_{it}$, so the site is occupied in (=survives to) season
$t$ with probability $\phi_{it}$.

Occupancy probability ($\psi_{it}$) and occurrence ($z_{it}$) at all
later times $t$ can be computed recursively from  $\psi_{i1}$,
$z_{i1}$ , $\phi_{it}$  and  $\gamma_{it}$.
Variances of these derived estimates can be obtained via the delta
method or the bootstrap.

## Observation process

To account for the observation error (specifically, false-negative
observations), the conventional Bernoulli detection process is
assumed, such that

$$
y_{ijt} \sim Bernoulli(z_{it} p_{ijt})
$$

Here, $y_{ijt}$ is the detection probability at site  $i$ during
survey $j$ and season $t$. Detection is conditional on occurrence, and
multiplying $p_{ijt}$ with $z_{it}$ ensures that occurrence can only
be detected where in fact a species occurs, i.e. where $z_{it}=1$.

## Modeling of parameters

The preceding, fully general model description allows for site-($i$)
dependence of all parameters. In addition to that, survival and
colonization probabilities may be season-($t$)dependent and detection
probability season-($t$) and survey-($j$) dependent.
All this complexity may be dropped, especially the dependence on
sites. On the other hand, all parameters that are indexed in some way
can be modeled, e.g., as functions of covariates that vary along the
dimension denoted by an index. We will fit linear functions (on the
logit link scale) of covariates into first-year occupancy, survival
and colonization and into detection probability.
That is, for probabilities of first-year occupancy, survival,
colonization and detection, respectively, we will fit models of the
form
 $logit(\psi_{i1}) = \alpha + \beta x_i$, where $x_i$  may be forest
 cover or elevation of site $i$ ,
 $logit(\phi_{it}) = \alpha + \beta x_{it}$, where $x_{it}$  may be
 tree mast at site $i$ during season $t$,
 $logit(\gamma_{it}) = \alpha + \beta x_{it}$, for a similarly
 defined covariate $x_{it}$, or
 $logit(p_{ijt}) = \alpha + \beta x_{ijt}$ , where $x_{ijt}$ is the
 Julian date of the survey $j$ at site $i$ in season $t$.

We note that for first-year occupancy, only covariates that vary among
sites ("site covariates") can be fitted, while for survival and
colonization, covariates that vary by site and by season ("yearly
site covariates") may be fitted as well.
For detection, covariates of three formats may be fitted:
"site-covariates", "yearly-site-covariates" and
"observation-covariates", as
they are called in `unmarked`.

# Dynamic occupancy models for simulated data

We first generate a simple, simulated data set
with specified, year-specific values for
the parameters as well as design specifications, i.e., number of
sites, years and surveys per year.
Then, we show how to fit a dynamic occupancy model with
year-dependence in the parameters for colonization, extinction and
detection probability.

## Simulating, formatting, and summarizing data

To simulate the data, we execute the following R code.
The actual values for these parameters for each year are drawn
randomly from a uniform distribution with
the specified bounds.


```r
M <- 250                                # Number of sites
J <- 3                                  # num secondary sample periods
T <- 10                                 # num primary sample periods

psi <- rep(NA, T)                       # Occupancy probability
muZ <- z <- array(dim = c(M, T))        # Expected and realized occurrence
y <- array(NA, dim = c(M, J, T))        # Detection histories

set.seed(13973)
psi[1] <- 0.4                           # Initial occupancy probability
p <- c(0.3,0.4,0.5,0.5,0.1,0.3,0.5,0.5,0.6,0.2)
phi <- runif(n=T-1, min=0.6, max=0.8)   # Survival probability (1-epsilon)
gamma <- runif(n=T-1, min=0.1, max=0.2) # Colonization probability

# Generate latent states of occurrence
# First year
z[,1] <- rbinom(M, 1, psi[1])           # Initial occupancy state
# Later years
for(i in 1:M){                          # Loop over sites
   for(k in 2:T){                        # Loop over years
      muZ[k] <- z[i, k-1]*phi[k-1] + (1-z[i, k-1])*gamma[k-1]
      z[i,k] <- rbinom(1, 1, muZ[k])
   }
}

# Generate detection/non-detection data
for(i in 1:M){
   for(k in 1:T){
      prob <- z[i,k] * p[k]
      for(j in 1:J){
         y[i,j,k] <- rbinom(1, 1, prob)
      }
   }
}

# Compute annual population occupancy
for (k in 2:T){
   psi[k] <- psi[k-1]*phi[k-1] + (1-psi[k-1])*gamma[k-1]
   }
```

We have now generated a single realization from the stochastic system
thus defined. Figure 1
illustrates the fundamental issue
of imperfect detection --- the actual proportion of sites occupied
differs greatly from the observed proportion of sites occupied, and
because $p$ varies  among years, the observed data cannot be used as a
valid index of the parameter of interest $\psi_i$.


```r
plot(1:T, colMeans(z), type = "b", xlab = "Year",
     ylab = "Proportion of sites occupied",
     col = "black", xlim=c(0.5, 10.5), xaxp=c(1,10,9),
     ylim = c(0,0.6), lwd = 2, lty = 1,
     frame.plot = FALSE, las = 1, pch=16)

psi.app <- colMeans(apply(y, c(1,3), max))
lines(1:T, psi.app, type = "b", col = "blue", lty=3, lwd = 2)
legend(1, 0.6, c("truth", "observed"),
       col=c("black", "blue"), lty=c(1,3), pch=c(16,1))
```

![Figure 1. Summary of the multi-year occupancy data set generated.](colext-data-1.png)

To analyze this data set with a dynamic occupancy model in
`unmarked`, we first load the package.


```r
library(unmarked)
```

Next, we reformat the detection/non-detection data from a 3-dimensional
array (as generated) into a 2-dimensional matrix with M rows.
That is, we put the annual tables of data (the slices of the former
3-D array) sideways to produce a "wide" layout of the data.


```r
yy <- matrix(y, M, J*T)
```

Next, we create a matrix indicating the year each site was surveyed.


```r
year <- matrix(c('01','02','03','04','05','06','07','08','09','10'),
               nrow(yy), T, byrow=TRUE)
```

To organize the data in the format required by `colext`, we make
use of the function `unmarkedMultFrame`. The only required
arguments are `y`, the detection/non-detection data, and
`numPrimary`, the number of seasons. The three types of
covariates described earlier can also be supplied using the arguments
`siteCovs`, `yearlySiteCovs`, and `obsCovs`. In this case,
we only make use of the second type, which must have M rows and T
columns.


```r
simUMF <- unmarkedMultFrame(
    y = yy,
    yearlySiteCovs = list(year = year),
    numPrimary=T)
summary(simUMF)
```

```
## unmarkedFrame Object
## 
## 250 sites
## Maximum number of observations per site: 30 
## Mean number of observations per site: 30 
## Number of primary survey periods: 10 
## Number of secondary survey periods: 3 
## Sites with at least one detection: 195 
## 
## Tabulation of y observations:
##    0    1 
## 6430 1070 
## 
## Yearly-site-level covariates:
##       year     
##  01     : 250  
##  02     : 250  
##  03     : 250  
##  04     : 250  
##  05     : 250  
##  06     : 250  
##  (Other):1000
```

## Model fitting

We are ready to fit a few dynamic occupancy models.
We will fit a model with constant values for all parameters and
another with full time-dependence for colonization, extinction and
detection probability. We also time the calculations.


```r
# Model with all constant parameters
m0 <- colext(psiformula= ~1, gammaformula = ~ 1, epsilonformula = ~ 1,
             pformula = ~ 1, data = simUMF, method="BFGS")
summary(m0)
```

```
## 
## Call:
## colext(psiformula = ~1, gammaformula = ~1, epsilonformula = ~1, 
##     pformula = ~1, data = simUMF, method = "BFGS")
## 
## Initial (logit-scale):
##  Estimate    SE     z  P(>|z|)
##    -0.813 0.158 -5.16 2.46e-07
## 
## Colonization (logit-scale):
##  Estimate     SE   z   P(>|z|)
##     -1.77 0.0807 -22 2.75e-107
## 
## Extinction (logit-scale):
##  Estimate    SE     z  P(>|z|)
##     -0.59 0.102 -5.79 7.04e-09
## 
## Detection (logit-scale):
##  Estimate     SE     z P(>|z|)
##   -0.0837 0.0562 -1.49   0.137
## 
## AIC: 4972.597 
## Number of sites: 250
## optim convergence code: 0
## optim iterations: 27 
## Bootstrap iterations: 0
```

The computation time was only a few seconds.
Note that all parameters were estimated on the logit scale. To
back-transform to the original scale, we can simply use the
inverse-logit function, named `plogis` in R.


```r
plogis(-0.813)
```

```
## [1] 0.3072516
```

Alternatively, we can use `backTransform`, which
computes standard errors using the delta method. Confidence intervals
are also easily obtained using the function `confint`.
We first remind ourselves of the names of parameters, which can all be
used as arguments for these functions.


```r
names(m0)
```

```
## [1] "psi" "col" "ext" "det"
```

```r
backTransform(m0, type="psi")
```

```
## Backtransformed linear combination(s) of Initial estimate(s)
## 
##  Estimate     SE LinComb (Intercept)
##     0.307 0.0335  -0.813           1
## 
## Transformation: logistic
```

```r
confint(backTransform(m0, type="psi"))
```

```
##      0.025     0.975
##  0.2457313 0.3765804
```

Next, we fit the dynamic occupancy model with full year-dependence in
the parameters describing occupancy dynamics and also in detection.
This is the same model under which we generated the data set, so we
would expect accurate estimates.

By default in R, a factor such as year in this analysis, is a
parameterized in terms of an intercept and effects representing
differences. This would mean that the parameter for the first year is
the intercept  and the effects would denote the differences between
the parameter  values in all other years, relative to the parameter
value in the first year, which serves as a reference level.
This treatment or effects parameterization is useful for testing for
differences. For simple presentation, a means parameterization is more
practical. It can be specified by adding a -1 to the formula for the
time-dependent parameters.


```r
m1 <- colext(psiformula = ~1,   # First-year occupancy
    gammaformula = ~ year-1,    # Colonization
    epsilonformula = ~ year-1,  # Extinction
    pformula = ~ year-1,        # Detection
    data = simUMF)
m1
```

```
## 
## Call:
## colext(psiformula = ~1, gammaformula = ~year - 1, epsilonformula = ~year - 
##     1, pformula = ~year - 1, data = simUMF)
## 
## Initial:
##  Estimate    SE      z P(>|z|)
##    -0.273 0.302 -0.906   0.365
## 
## Colonization:
##        Estimate    SE     z  P(>|z|)
## year01    -2.08 0.951 -2.19 2.86e-02
## year02    -2.18 0.365 -5.96 2.52e-09
## year03    -1.98 0.274 -7.23 4.88e-13
## year04    -2.32 0.678 -3.42 6.37e-04
## year05    -1.89 0.478 -3.95 7.78e-05
## year06    -1.76 0.294 -5.97 2.44e-09
## year07    -1.55 0.230 -6.73 1.75e-11
## year08    -1.43 0.228 -6.29 3.19e-10
## year09    -2.35 0.470 -5.00 5.64e-07
## 
## Extinction:
##        Estimate    SE      z  P(>|z|)
## year01  -1.4209 0.418 -3.401 6.72e-04
## year02  -0.4808 0.239 -2.009 4.45e-02
## year03  -1.2606 0.366 -3.440 5.83e-04
## year04  -0.0907 0.650 -0.139 8.89e-01
## year05  -0.6456 0.599 -1.078 2.81e-01
## year06  -0.9586 0.378 -2.539 1.11e-02
## year07  -1.2279 0.365 -3.362 7.74e-04
## year08  -1.1894 0.292 -4.076 4.58e-05
## year09  -0.6292 0.635 -0.991 3.22e-01
## 
## Detection:
##        Estimate    SE      z  P(>|z|)
## year01  -1.0824 0.244 -4.434 9.26e-06
## year02  -0.2232 0.148 -1.508 1.32e-01
## year03   0.2951 0.154  1.918 5.52e-02
## year04   0.0662 0.161  0.412 6.81e-01
## year05  -2.0396 0.433 -4.706 2.52e-06
## year06  -0.6982 0.232 -3.005 2.66e-03
## year07   0.2413 0.165  1.466 1.43e-01
## year08   0.0847 0.155  0.548 5.84e-01
## year09   0.6052 0.140  4.338 1.44e-05
## year10  -1.1699 0.306 -3.828 1.29e-04
## 
## AIC: 4779.172
```

## Manipulating results: prediction and plotting

Again, all estimates are shown on the logit-scale. Back-transforming
estimates when covariates, such as year, are present involves an
extra step. Specifically, we need to tell `unmarked` the values
of our covariate
at which we want an estimate. This can be done using
`backTransform` in combination with `linearComb`, although
it can be easier to use `predict`. `predict` allows the user
to supply a data.frame in which each row represents a combination of
covariate values of interest. Below, we create data.frames called
`nd` with each row representing a year.
Then we request yearly estimates of the probability of extinction,
colonization and detection,
and compare them to "truth", i.e., the values with which we
simulated the data set. Note that there are T-1 extinction and
colonization parameters in this case, so we do not need to include
year 10 in `nd`.


```r
nd <- data.frame(year=c('01','02','03','04','05','06','07','08','09'))
E.ext <- predict(m1, type='ext', newdata=nd)
E.col <- predict(m1, type='col', newdata=nd)
nd <- data.frame(year=c('01','02','03','04','05','06','07','08','09','10'))
E.det <- predict(m1, type='det', newdata=nd)
```

`predict` returns the predictions along with standard errors and
confidence intervals. These can be used to create plots. The
`with` function is used to simplify the process of requesting the
columns of `data.frame` returned by `predict`.


```r
op <- par(mfrow=c(3,1), mai=c(0.6, 0.6, 0.1, 0.1))

with(E.ext, {   # Plot for extinction probability
  plot(1:9, Predicted, pch=1, xaxt='n', xlab='Year',
    ylab=expression(paste('Extinction probability ( ', epsilon, ' )')),
    ylim=c(0,1), col=4)
  axis(1, at=1:9, labels=nd$year[1:9])
  arrows(1:9, lower, 1:9, upper, code=3, angle=90, length=0.03, col=4)
  points((1:9)-0.1, 1-phi, col=1, lwd = 1, pch=16)
  legend(7, 1, c('Parameter', 'Estimate'), col=c(1,4), pch=c(16, 1),
         cex=0.8)
  })

with(E.col, {	# Plot for colonization probability
  plot(1:9, Predicted, pch=1, xaxt='n', xlab='Year',
    ylab=expression(paste('Colonization probability ( ', gamma, ' )')),
    ylim=c(0,1), col=4)
  axis(1, at=1:9, labels=nd$year[1:9])
  arrows(1:9, lower, 1:9, upper, code=3, angle=90, length=0.03, col=4)
  points((1:9)-0.1, gamma, col=1, lwd = 1, pch=16)
  legend(7, 1, c('Parameter', 'Estimate'), col=c(1,4), pch=c(16, 1),
         cex=0.8)
  })

with(E.det, {   # Plot for detection probability: note 10 years
  plot(1:10, Predicted, pch=1, xaxt='n', xlab='Year',
    ylab=expression(paste('Detection probability ( ', p, ' )')),
    ylim=c(0,1), col=4)
  axis(1, at=1:10, labels=nd$year)
  arrows(1:10, lower, 1:10, upper, code=3, angle=90, length=0.03, col=4)
  points((1:10)-0.1, p, col=1, lwd = 1, pch=16)
  legend(7.5, 1, c('Parameter','Estimate'), col=c(1,4), pch=c(16, 1),
         cex=0.8)
  })
```

![Figure 2. Yearly estimates of parameters](colext-est-1.png)

```r
par(op)
```

Figure 2 shows that the 95% confidence intervals
include the true parameter values, and the point estimates are not too
far off.

## Derived parameters

Estimates of occupancy probability in years $T>1$ must be derived from the
estimates of first-year occupancy and the two parameters governing the
dynamics, extinction/survival and colonization.
`unmarked` does this automatically in two ways. First, the
population-level estimates of occupancy probability
$\psi_t = \psi_{t-1}\phi_{t-1} + (1-\phi_{t-1})\gamma$ are calculated
and stored in the slot named \emph{projected}. Slots can be accessed
using the `@` operator, e.g. `fm@projected`.
In some cases, interest may lie in making
inference about the proportion of the sampled sites that are occupied,
rather than the entire population of sites. These estimates are
contained in the `smoothed` slot of the fitted model. Thus, the
`projected` values are estimates of population parameters, and
the `smoothed` estimates are of the finite-sample
quantities. Discussions of the differences can be found in @Weir_2009.

Bootstrap methods can be used to compute standard errors of derived
parameter estimates. Here we employ a non-parametric bootstrap to obtain
standard errors of the smoothed estimates of occupancy probability
during each year.


```r
m1 <- nonparboot(m1, B = 10)
cbind(psi=psi, smoothed=smoothed(m1)[2,], SE=m1@smoothed.mean.bsse[2,])
```

```
##          psi  smoothed         SE
## 1  0.4000000 0.4320671 0.06783911
## 2  0.3493746 0.4110124 0.03786402
## 3  0.2977125 0.3139967 0.02780818
## 4  0.3148447 0.3278179 0.04303542
## 5  0.3192990 0.2316695 0.10858419
## 6  0.2915934 0.2528485 0.04179036
## 7  0.3114415 0.2928429 0.03113920
## 8  0.3636580 0.3504885 0.04224678
## 9  0.3654064 0.3936991 0.02103870
## 10 0.3460641 0.3095786 0.06830698
```

In practice, `B` should be much higher, possibly >1000 for complex
models .

Another derived parameters of interest is turnover probability

$$
\tau_t = \frac{\gamma_{t-1}(1-\psi_{t-1})}{\gamma_{t-1}(1-\psi_{t-1})
  + \phi_{t-1}\psi_{t-1}}
$$

The following function returns these estimates.


```r
turnover <- function(fm) {
    psi.hat <- plogis(coef(fm, type="psi"))
    if(length(psi.hat) > 1)
        stop("this function only works if psi is scalar")
    T <- getData(fm)@numPrimary
    tau.hat <- numeric(T-1)
    gamma.hat <- plogis(coef(fm, type="col"))
    phi.hat <- 1 - plogis(coef(fm, type="ext"))
    if(length(gamma.hat) != T-1 | length(phi.hat) != T-1)
        stop("this function only works if gamma and phi T-1 vectors")
    for(t in 2:T) {
        psi.hat[t] <- psi.hat[t-1]*phi.hat[t-1] +
            (1-psi.hat[t-1])*gamma.hat[t-1]
        tau.hat[t-1] <- gamma.hat[t-1]*(1-psi.hat[t-1]) / psi.hat[t]
        }
    return(tau.hat)
    }
```

The bootstrap again offers a means of estimating variance. Here we
show how to generate 95\% confidence intervals for the turnover
estimates using the parametric bootstrap.


```r
pb <- parboot(m1, statistic=turnover, nsim=2)
turnCI <- cbind(pb@t0,
  t(apply(pb@t.star, 2, quantile, probs=c(0.025, 0.975))))
colnames(turnCI) <- c("tau", "lower", "upper")
turnCI
```

```
##           tau      lower     upper
## t*1 0.1532645 0.00536045 0.1974714
## t*2 0.1911530 0.07881180 0.2119585
## t*3 0.2537292 0.19777204 0.2785973
## t*4 0.2604356 0.04063769 0.4197328
## t*5 0.3989303 0.34078483 0.4720357
## t*6 0.3758690 0.32703698 0.5370796
## t*7 0.3537473 0.32696166 0.3564059
## t*8 0.3174983 0.32925238 0.4139696
## t*9 0.1704449 0.18946470 0.3186236
```

Which bootstrap method is most appropriate for variance estimation?
For detailed distinctions between the
non-parametric and the parametric bootstrap, see @Davison_1997.
We note simply that the parametric bootstrap resamples from
the fitted model, and thus the
measures of uncertainty are purely
functions of the distributions assumed by the model. Non-parametric
bootstrap samples, in contrast, are obtained by resampling the
data, not the model, and thus are not necessarily affected by the
variance formulas of the model's distributions.

## Goodness-of-fit

In addition to estimating the variance of an estimate, the parametric
bootstrap can be used to assess goodness-of-fit. For this purpose, a
fit-statistic, i.e. one that compares
observed and expected values, is evaluated using the original fitted
model, and numerous other models fitted to simulated datasets. The
simulation yields an approximation of
the distribution of the fit-statistic, and a \emph{P}-value
can be computed as the proportion of simulated values greater than the
observed value.

@Hosmer_1997 found that a $\chi^2$ statistic performed
reasonably well in assessing lack of fit for logistic regression
models. We know of no studies formally
evaluating the performance of various fit-statistics for dynamic
occupancy models, so this approach should be
considered experimental. Fit-statistics applied to aggregated
encounter histories offer an alternative approach [@MacKenzie_2004], but are difficult to implement when J*T is high and
missing values or continuous covariates are present.


```r
chisq <- function(fm) {
    umf <- getData(fm)
    y <- getY(umf)
    sr <- fm@sitesRemoved
    if(length(sr)>0)
        y <- y[-sr,,drop=FALSE]
    fv <- fitted(fm, na.rm=TRUE)
    y[is.na(fv)] <- NA
    sum((y-fv)^2/(fv*(1-fv)))
    }

set.seed(344)
pb.gof <- parboot(m0, statistic=chisq, nsim=100)

plot(pb.gof, xlab=expression(chi^2), main="", col=gray(0.95),
     xlim=c(7300, 7700))
```

![Figure 3. Goodness-of-fit](colext-gof-1.png)

Figure 3 indicates that, as expected, the constant
parameter model does not fit the data well.

# Dynamic occupancy models for crossbill data from the Swiss MHB

## The crossbill data set

The crossbill data are included with the `unmarked` package.
The dataset contains the results of nine years of surveys (1999--2007)
for the European crossbill (*Loxia curvirostra*),
a pine-seed eating finch, in 267 1-km$^2$ sample quadrats in Switzerland.
Quadrats are surveyed annually as part of the Swiss breeding bird
survey MHB [@schmid_etal:2004].
They are laid out as a grid over Switzerland and surveyed 2 or 3 times
every breeding season (mid-April to late June)
by experienced field ornithologists along a haphazard survey route of
length 1-9 km (average 5 km).
High-elevation sites are only surveyed twice per breeding season.

## Importing, formatting, and summarizing data

The data can be loaded into an open R workspace using the `data` command.


```r
data(crossbill)
colnames(crossbill)
```

```
##  [1] "id"      "ele"     "forest"  "surveys" "det991"  "det992"  "det993" 
##  [8] "det001"  "det002"  "det003"  "det011"  "det012"  "det013"  "det021" 
## [15] "det022"  "det023"  "det031"  "det032"  "det033"  "det041"  "det042" 
## [22] "det043"  "det051"  "det052"  "det053"  "det061"  "det062"  "det063" 
## [29] "det071"  "det072"  "det073"  "date991" "date992" "date993" "date001"
## [36] "date002" "date003" "date011" "date012" "date013" "date021" "date022"
## [43] "date023" "date031" "date032" "date033" "date041" "date042" "date043"
## [50] "date051" "date052" "date053" "date061" "date062" "date063" "date071"
## [57] "date072" "date073"
```

We have three covariates that vary by site: median elevation of the
quadrat (`ele`, in metres), forest cover of the quadrat (`forest`, in
percent) and the number of surveys per season (i.e., 2 or 3,
surveys).
These are called site covariates, because they vary by sites only.
The 27 columns entitled `det991` - `det073` contain the crossbill
detection/nondetection data during all surveys over the 9 years.
They contain a 1 when at least one crossbill was recorded during a
survey and a 0 otherwise.
`NA`s indicate surveys that did not take place, either because a site is
high-elevation and has no third survey or because it failed to be
surveyed altogether in a year.
The final 27 columns entitled `date991` - `date073` give the Julian
date of each survey.
They represent a "survey-covariate" or "observation covariate".
We note that the paper by @Royle_2007 used a subset of this
data set.

AIC-based model selection (see section 5.4) requires
that all models are fit to the same data.
`unmarked` removes missing data in a context specific way. For
missing `siteCovs`, the entire row of data must be removed. However, for
missing `yearlySiteCovs` or `obsCovs`, only the
corresponding observation
are removed. Thus, if `unmarked` removes different observations
from different models, the models cannot be compared using AIC. A way
around this is to remove the detection data corresponding to
missing covariates before fitting the models.
The crossbill data have missing dates and so we remove the associated
detection/non-detection data.


```r
DATE <- as.matrix(crossbill[,32:58])
y.cross <- as.matrix(crossbill[,5:31])
y.cross[is.na(DATE) != is.na(y.cross)] <- NA
```

In addition, continuous covariates should be transformed in a way
that brings their values close to zero in order to improve
or even enable numerical convergence of the maximum-likelihood routine.
We do this "by hand" and note that we could also have used the R
function `scale`. We subtract the mean and divide by the standard
deviation.


```r
sd.DATE <- sd(c(DATE), na.rm=TRUE)
mean.DATE <- mean(DATE, na.rm=TRUE)
DATE <- (DATE - mean.DATE) / sd.DATE
```

Before we can fit occupancy models, we need to format this data set
appropriately.


```r
years <- as.character(1999:2007)
years <- matrix(years, nrow(crossbill), 9, byrow=TRUE)
umf <- unmarkedMultFrame(y=y.cross,
    siteCovs=crossbill[,2:3], yearlySiteCovs=list(year=years),
    obsCovs=list(date=DATE),
    numPrimary=9)
```

## Model fitting

We fit a series of models that represent different hypotheses about
the colonization-extinction dynamics of Swiss crossbills
at a spatial scale of 1 km$^2$.
We fit year effects on colonization and extinction in the means
parameterization,
but for detection probability, we choose an effects parameterization.
The latter is more useful for getting predictions in the presence of
other explanatory variables for that parameter.
For model `fm5` with more complex covariate relationships, we use as
starting values for the optimization routine
the solution from a "neighboring" model with slightly less
complexity, model `fm4`.
Wise choice of starting values can be decisive for success or failure
of maximum likelihood estimation.


```r
# A model with constant parameters
fm0 <- colext(~1, ~1, ~1, ~1, umf)

# Like fm0, but with year-dependent detection
fm1 <- colext(~1, ~1, ~1, ~year, umf)

# Like fm0, but with year-dependent colonization and extinction
fm2 <- colext(~1, ~year-1, ~year-1, ~1, umf)

# A fully time-dependent model
fm3 <- colext(~1, ~year-1, ~year-1, ~year, umf)

# Like fm3 with forest-dependence of 1st-year occupancy
fm4 <- colext(~forest, ~year-1, ~year-1, ~year, umf)

# Like fm4 with date- and year-dependence of detection
fm5 <- colext(~forest, ~year-1, ~year-1, ~year + date + I(date^2),
              umf, starts=c(coef(fm4), 0, 0))

# Same as fm5, but with detection in addition depending on forest cover
fm6 <- colext(~forest, ~year-1, ~year-1, ~year + date + I(date^2) +
              forest, umf)
```

## Model selection

We can compare models using the Akaike information criterion
($AIC$).
Note that `unmarked` yields $AIC$, not $AIC_c$
because the latter would require the sample size,
which is not really known for
hierarchical models such as the dynamic occupancy model.

Model selection and model-averaged prediction in `unmarked`
require that we create a list of models using `fitList`.
This function organizes models and conducts a series of tests to
ensure that the models were fit to the same data.


```r
models <- fitList('psi(.)gam(.)eps(.)p(.)'    = fm0,
                  'psi(.)gam(.)eps(.)p(Y)'    = fm1,
                  'psi(.)gam(Y)eps(Y)p(.)'    = fm2,
                  'psi(.)gam(Y)eps(Y)p(Y)'    = fm3,
                  'psi(F)gam(Y)eps(Y)p(Y)'    = fm4,
                  'psi(F)gam(Y)eps(Y)p(YD2)'  = fm5,
                  'psi(F)gam(Y)eps(Y)p(YD2F)' = fm6)
ms <- modSel(models)
ms
```

```
##                           nPars     AIC  delta   AICwt cumltvWt
## psi(F)gam(Y)eps(Y)p(YD2F)    30 4986.39   0.00 1.0e+00     1.00
## psi(F)gam(Y)eps(Y)p(YD2)     29 5059.30  72.91 1.5e-16     1.00
## psi(F)gam(Y)eps(Y)p(Y)       27 5095.38 108.99 2.2e-24     1.00
## psi(.)gam(.)eps(.)p(Y)       12 5111.32 124.93 7.5e-28     1.00
## psi(.)gam(Y)eps(Y)p(Y)       26 5127.63 141.24 2.1e-31     1.00
## psi(.)gam(Y)eps(Y)p(.)       18 5170.54 184.15 1.0e-40     1.00
## psi(.)gam(.)eps(.)p(.)        4 5193.50 207.11 1.1e-45     1.00
```

One model has overwhelming support, so we can base inference on that
one alone. Before doing so, we point out how to extract coefficients
from a `fitList` object, and convert the results to a
`data.frame`, which could be exported from R.


```r
coef(ms)                            # Estimates only
SE(ms)                              # Standard errors only
toExport <- as(ms, "data.frame")    # Everything
```

## Manipulating results: Prediction and plotting

Fitted models can be used to predict expected outcomes when given new
data. For example, one could ask "how many crossbills would you
expect to find in a quadrat with 50% forest cover?" Prediction also
offers a way of
presenting the results of an analysis. We illustrate by plotting the
predictions of $\psi$ and $p$ over the range of covariate values studied.
Note that because we standardized date, we need to transform it back
to its original scale after obtaining predictions on the
standardized scale.


```r
op <- par(mfrow=c(1,2), mai=c(0.8,0.8,0.1,0.1))

nd <- data.frame(forest=seq(0, 100, length=50))
E.psi <- predict(fm6, type="psi", newdata=nd, appendData=TRUE)

with(E.psi, {
    plot(forest, Predicted, ylim=c(0,1), type="l",
         xlab="Percent cover of forest",
         ylab=expression(hat(psi)), cex.lab=0.8, cex.axis=0.8)
    lines(forest, Predicted+1.96*SE, col=gray(0.7))
    lines(forest, Predicted-1.96*SE, col=gray(0.7))
    })

nd <- data.frame(date=seq(-2, 2, length=50),
                 year=factor("2005", levels=c(unique(years))),
                 forest=50)
E.p <- predict(fm6, type="det", newdata=nd, appendData=TRUE)
E.p$dateOrig <- E.p$date*sd.DATE + mean.DATE

with(E.p, {
    plot(dateOrig, Predicted, ylim=c(0,1), type="l",
         xlab="Julian date", ylab=expression( italic(p) ),
         cex.lab=0.8, cex.axis=0.8)
    lines(dateOrig, Predicted+1.96*SE, col=gray(0.7))
    lines(dateOrig, Predicted-1.96*SE, col=gray(0.7))
    })
```

![Figure 4. Covariates](colext-pred-1.png)

```r
par(op)
```

**Acknowledgments**

Special thanks goes to Ian Fiske, the author of `colext` and the
original developer of `unmarked`. Andy Royle provided the
initial funding and support for the package. The questions of many
people on the users' list motivated the writing of this document.

# References

```{r, echo=FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
```