vignettes/simulate.Rmd


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274

---
title: Simulating datasets
author: Ken Kellner
date: September 10, 2021
bibliography: unmarked.bib
csl: ecology.csl
output: 
  rmarkdown::html_vignette:
    fig_width: 5
    fig_height: 3.5
    number_sections: true
    toc: true
vignette: >
  %\VignetteIndexEntry{Simulating datasets}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction

Simulating datasets is a powerful and varied tool when conducting `unmarked` analyses.
Writing our own code to simulate a dataset based on a given model is an excellent learning tool, and can help us test if a given model is generating the expected results.
If we simulate a series of datasets based on a fitted model, and calculate a statistic from each of those fits, we can generate a distribution of the statistic - this is what the `parboot` function does.
This can be helpful, for example, when testing goodness of fit.
Finally, simulation can be a useful component of power analysis when a closed-form equation for power is not available.

`unmarked` provides two different ways of generating simulated datasets, depending on the stage we are at in the modeling process.

1. Generating simulated datasets from a fitted model we already have
2. Generating simulated datasets from scratch

For (1), we simply call the `simulate` method on our fitted model object and new dataset(s) are generated.
This is the approach used by `parboot`.
In this vignette we will focus on (2), a more flexible approach to simulation, also using the `simulate` method, that allows us to generate a dataset corresponding to any `unmarked` model from scratch.

# Components of a call to simulate

We will need to provide, at a minimum, four pieces of information to `simulate` in order to simulate a dataset from scratch in `unmarked`.

1. The name of the fitting function for the model we want to simulate from, as a character string
2. A list of formulas, one per submodel, containing the names of the covariates we want to include in each
3. A list of vectors of regression coefficients (intercepts and slopes), one per submodel, matching the formulas
4. A list of design components; for example, the number of sites and number of observations per site

A number of other arguments are available, e.g. for how to customize how the covariates are randomly generated or for distributions to use when simulating abundances.
We'll show those later.
The easiest way to demonstrate how to use `simulate` is to look at an example: we'll start with a simple one for occupancy.

# Simulating an occupancy dataset

Suppose we want to simulate an occupancy dataset in which site occupancy is affected by elevation.
The first piece of information needed is the name of model to use: the fitting function for occupancy is `occu`, so the first argument to `simulate` and the name of the model will be `"occu"`.

## Formulas

Second we must define the desired model structure as a list of formulas, one per submodel.
"Submodels" here are the hierarchical components of the model; for example, an occupancy model has a state (occupancy) submodel and an observation (detection) submodel.
These submodels are identified by short names: `state` and `det`.
We will use these short names repeatedly.
In order to identify which submodels are needed and what their short names are, we can simply fit any model of that type (e.g. from the example) and call `names(model)`.

```{r}
set.seed(123)
library(unmarked)
umf <- unmarkedFrameOccu(y=matrix(c(0,1,0,1,1,0,0,0,1), nrow=3))
mod <- occu(~1~1, umf)
names(mod)
```

Formulas are supplied as a named list.
The list has one element per submodel, and the names of the elements are the short names defined above.
Each list element is a formula, containing the desired number of covariates to use, and the names of these covariates.
Below we define our list of formulas, including an effect of elevation on occupancy (note we could name this whatever we want, here we call it `elev`).
We don't want any covariates on detection probability, so the formula defines the model as intercept only: `~1`.

```{r}
forms <- list(state=~elev, det=~1)
```

## Regression coefficients

Next we must tell `unmarked` what the values for the intercept and regression coefficients in each submodel should be.
Once again, this is a named list, one element for each submodel.
Each list element is a numeric vector.
The components of each numeric vector must also be named, matching the covariate names in our list of formulas.
Don't forget we also must specify a value for the intercept in each submodel (can be named `Intercept` or `intercept`).
If we are not sure exactly how to structure this list, just skip it for now: `unmarked` can generate a template for us to fill in later.

```{r}
coefs <- list(state=c(intercept=0, elev=-0.4), det=c(intercept=0))
```

We have a list with two elements, each a numeric vector.
Both contain intercept values, and the `state` vector also contains a value corresponding to the desired effect of our covariate `elev`.

## Study design information

Finally, we need to give `unmarked` information about the study design.
This is pretty simple: we just need a list containing values for `M`, the number of sites, and `J` the number of surveys per site.
For models with multiple primary periods, we'd also need a value of `T`, the number of primary periods.

```{r}
design <- list(M=300, J=8) # 300 sites, 8 occasions per site
```

## Put it all together

We're now ready to simulate a dataset.
To do this we use the `simulate` function, providing as arguments the name of the model `"occu"` and the three lists we constructed above.
Actually, first, let's not supply the `coefs` list, to show how `unmarked` will generate a template for us to use:

```{r, eval=FALSE}
simulate("occu", formulas=forms, design=design)
```

```{r, echo=FALSE}
try(simulate("occu", formulas=forms, design=design))
```

We can replicate this provided list structure and fill in our own numeric values.
Once we have our coefficients set up properly, add them to the function call:

```{r}
occu_umf <- simulate("occu", formulas=forms, coefs=coefs, design=design)
head(occu_umf)
```

`unmarked` has generated a presence-absence dataset as well as values for covariate `elev`.
We can check that it worked as expected by fitting the corresponding model to the dataset, and making sure the estimated values are similar:

```{r}
(occu(~1 ~elev, occu_umf))
```

## Customizing the covariates

By default, a covariate will be continuous and come from a standard normal distribution (mean 0, SD 1).
However, we can control this using the `guide` argument.
For example, suppose we want elevation to come from a random normal, but with a mean of 2 and a standard deviation of 0.5.
We can provide a named list to the `guide` argument as follows:

```{r}
guide <- list(elev=list(dist=rnorm, mean=2, sd=0.5))
```

`guide` contains one element, called `elev`, which is also a list and contains three components:

1. The random distribution function to use, `rnorm`
2. The mean of the distribution
3. The SD of the distribution

```{r}
occu_umf <- simulate("occu", formulas=forms, coefs=coefs, design=design, guide=guide)
head(occu_umf)
```

You can see the `elev` covariate now has values corresponding to the desired distribution.
Note that the elements of the list will depend on the arguments required by the random distribution function.
For example, to use a uniform distribution instead:

```{r}
guide <- list(elev=list(dist=runif, min=0, max=1)) 
occu_umf <- simulate("occu", formulas=forms, coefs=coefs, design=design, guide=guide)
head(occu_umf)
```

It is also possible to define a categorical (factor) covariate.
We specify an entry in the `guide` list, but instead of a list, we supply a call to `factor` which defines the desired factor levels.
For example, suppose we want to add a new `landcover` covariate to our simulated model.
First, define the new formulas:

```{r}
forms2 <- list(state=~elev+landcover, det=~1)
```

And then the new guide, including the information about factor levels:

```{r}
guide <- list(landcover=factor(levels=c("forest","grass","urban")))
```

We'd also need an updated `coefs` since we have a new covariate.
Defining the `coefs` when you have factors in your model is a little trickier, since R names the effects as a combination of the factor name and the level name.
There is no coefficient for the reference level (`"forest"` in our example), but we need to provide coefficients for both `"grass"` and `"urban"`.
When combined with the factor name the complete coefficient names for these two will be `landcovergrass` and `landcoverurban`.
The easiest way to make sure we get these names right is to let `unmarked` generate a template `coefs` for you as shown above, and then fill it in.

```{r}
# forest is the reference level for landcover since it was listed first
coefs2 <- list(state=c(intercept=0, elev=-0.4, landcovergrass=0.2, 
                       landcoverurban=-0.7), det=c(intercept=0))
```

```{r}
head(simulate("occu", formulas=forms2, coefs=coefs2, design=design, guide=guide))
```

Our output dataset now includes a new categorical covariate.

## Models that require more information

More complex models might require more information for simulation.
Nearly any argument provided to either the fitting function for the model, or the corresponding `unmarkedFrame` constructor, can be provided as an optional argument to `simulate` to customize the simulation.
For example, we may want to specify that abundance should be simulated as a negative binomial, instead of a Poisson, for `pcount`.
This information is simply added as additional arguments to `simulate`.
For example, we can simulate a `pcount` dataset using the negative binomial (`"NB"`) distribution.
The negative binomial has an additional parameter to estimate (`alpha`) so we must also add an element to `coefs`.

```{r}
coefs$alpha <- c(alpha=0.5)
head(simulate("pcount", formulas=forms, coefs=coefs, design=design, mixture="NB"))
```

In the next section we will show a more detailed example involving these additional arguments.

## Simulating a more complex dataset: gdistremoval

The `gdistremoval` function fits the model of @Amundson_2014, which estimates abundance using a combination of distance sampling and removal sampling data.
When simulating a dataset based on this model, we have to provide several additional pieces of information related to the structure of the distance and removal sampling analyses.

To begin, we will define the list of formulas.
A `gdistremoval` model, when there is only one primary period, has three submodels: abundance (`"lambda"`), distance sampling (`"dist"`), and removal sampling (`"rem"`).
We will fit a model with an effect of elevation `elev` on abundance and an effect of wind `wind` on removal probability.

```{r}
forms <- list(lambda=~elev, dist=~1, rem=~wind)
```

Next we will define the corresponding coefficients.
We will set mean abundance at 5.
The intercept is on the log scale, thus the intercept for `lambda` will be `log(5)`.
The scale parameter for the detection function will be 50, and again it is on the log scale.
The intercept for the removal probability is on the logit scale, so we will set the intercept at -1 (equivalent to a mean removal probability of about 0.27).
Don't forget the covariate effects on `lambda` and removal.

```{r}
coefs <- list(lambda=c(intercept=log(5), elev=0.7), 
              dist=c(intercept=log(50)), rem=c(intercept=-1, wind=-0.3))
```

Our study will have 300 sites.
This model is unique in that we have to specify the number of two different types of observations: (1) the number of distance sampling bins (`Jdist`), and the number of removal intervals (`Jrem`).

```{r}
design <- list(M = 300, Jdist=4, Jrem=5)
```

Finally we are ready to simulate the dataset.
In addition to the name of the model, `forms`, `coefs` and `design`, we also need to provide some additional information.
We need to define the distance breaks for the distance sampling part of the model (there should be `Jdist+1` of these), and also the key function to use when simulating the detection process.

```{r}
umf <- simulate("gdistremoval", formulas=forms, coefs=coefs, design=design,
                dist.breaks=c(0,25,50,75,100), keyfun="halfnorm", unitsIn="m")
head(umf)
```

The result is a dataset containing a combination of distance, removal, and covariate data.
We can check to see if fitting a model to this dataset recovers our specified coefficient values:

```{r}
(fit <- gdistremoval(lambdaformula=~elev, removalformula=~wind, 
                    distanceformula=~1, data=umf))
```

Looks good.

# Conclusion

The `simulate` function provides a flexible tool for simulating data from any model in `unmarked`.
These datasets can be used for a variety of purposes, such as for teaching examples, testing models, or developing new tools that work with `unmarked`.
Additionally, simulating datasets is a key component of the power analysis workflow in `unmarked` - see the power analysis vignette for more examples.

# References