diff options
author | Ken Kellner <ken@kenkellner.com> | 2021-09-10 16:23:34 -0400 |
---|---|---|
committer | Ken Kellner <ken@kenkellner.com> | 2021-09-10 16:23:34 -0400 |
commit | dba95e93d75c84b369e52dfde347c4c23645d5b2 (patch) | |
tree | 7e5a798de4705f7c51d0b3bdecb7ee16c0e81e1d /vignettes | |
parent | b513c39d50f9fc3f14215b026d23d299b1051573 (diff) |
Add simulate vignette
Diffstat (limited to 'vignettes')
-rw-r--r-- | vignettes/powerAnalysis.Rnw | 2 | ||||
-rw-r--r-- | vignettes/simulate.Rnw | 313 | ||||
-rw-r--r-- | vignettes/unmarked.bib | 16 |
3 files changed, 330 insertions, 1 deletions
diff --git a/vignettes/powerAnalysis.Rnw b/vignettes/powerAnalysis.Rnw index f2c6816..5bec099 100644 --- a/vignettes/powerAnalysis.Rnw +++ b/vignettes/powerAnalysis.Rnw @@ -186,7 +186,7 @@ coefs <- list(state=c(intercept=0, elev=-0.4), det=c(intercept=0)) @ Finally, we need to give \code{unmarked} information about the study design. -This is pretty simple: we just need a list containing values for \code{M}, the number of sites, and code{J} the number of surveys per site. +This is pretty simple: we just need a list containing values for \code{M}, the number of sites, and \code{J} the number of surveys per site. For models with multiple primary periods, we'd also need a value of \code{T}, the number of primary periods. <<>>= diff --git a/vignettes/simulate.Rnw b/vignettes/simulate.Rnw new file mode 100644 index 0000000..ede6c76 --- /dev/null +++ b/vignettes/simulate.Rnw @@ -0,0 +1,313 @@ +<<echo=false>>= +options(width=70) +options(continue=" ") +@ + +\documentclass[a4paper]{article} +\usepackage[OT1]{fontenc} +\usepackage{Sweave} +\usepackage{natbib} +%\usepackage{fullpage} +\usepackage[vmargin=1in,hmargin=1in]{geometry} +\bibliographystyle{ecology} + +\usepackage{hyperref} +\hypersetup{ + colorlinks=true, + linkcolor=blue, + urlcolor=cyan, + citecolor=black +} + +\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} +\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} +\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} +\fvset{listparameters={\setlength{\topsep}{0pt}}} +\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} + +%%\VignetteIndexEntry{Simulating datasets} + +\title{Simulating datasets} +\author{Ken Kellner} +\date{September 10, 2021} + + +\begin{document} + +\newcommand{\code}[1]{\texttt{\small{#1}}} +\newcommand{\package}[1]{\textsf{\small{#1}}} + +\maketitle + +\section{Outline} + +\begin{enumerate} + \item Introduction + \item Components of a call to \code{simulate} + \item Simulating an occupancy dataset + \item Simulating a more complex dataset: \code{gdistremoval} + \item Conclusion +\end{enumerate} + +\section{Introduction} + +Simulating datasets is a powerful and varied tool when conducting \code{unmarked} analyses. +Writing our own code to simulate a dataset based on a given model is an excellent learning tool, and can help us test if a given model is generating the expected results. +If we simulate a series of datasets based on a fitted model, and calculate a statistic from each of those fits, we can generate a distribution of the statistic - this is what the \code{parboot} function does. +This can be helpful, for example, when testing goodness of fit. +Finally, simulation can be a useful component of power analysis when a closed-form equation for power is not available. + +\code{unmarked} provides two different ways of generating simulated datasets, depending on the stage we are at in the modeling process. + +\begin{enumerate} + \item Generating simulated datasets from a fitted model we already have + \item Generating simulated datasets from scratch +\end{enumerate} + +For (1), we simply call the \code{simulate} method on our fitted model object and new dataset(s) are generated. +This is the approach used by \code{parboot}. +In this vignette we will focus on (2), a more flexible approach to simulation, also using the \code{simulate} method, that allows us to generate a dataset corresponding to any \code{unmarked} model from scratch. + +\section{Components of a call to simulate} + +We will need to provide, at a minimum, four pieces of information to \code{simulate} in order to simulate a dataset from scratch in \code{unmarked}. + +\begin{enumerate} + \item The name of the fitting function for the model we want to simulate from, as a character string + \item A list of formulas, one per submodel, containing the names of the covariates we want to include in each + \item A list of vectors of regression coefficients (intercepts and slopes), one per submodel, matching the formulas + \item A list of design components; for example, the number of sites and number of observations per site +\end{enumerate} + +A number of other arguments are available, e.g. for how to customize how the covariates are randomly generated or for distributions to use when simulating abundances. +We'll show those later. +The easiest way to demonstrate how to use \code{simulate} is to look at an example: we'll start with a simple one for occupancy. + +\section{Simulating an occupancy dataset} + +Suppose we want to simulate an occupancy dataset in which site occupancy is affected by elevation. +The first piece of information needed is the name of model to use: the fitting function for occupancy is \code{occu}, so the first argument to \code{simulate} and the name of the model will be \code{"occu"}. + +\subsection{Formulas} + +Second we must define the desired model structure as a list of formulas, one per submodel. +"Submodels" here are the hierarchical components of the model; for example, an occupancy model has a state (occupancy) submodel and an observation (detection) submodel. +These submodels are identified by short names: \code{state} and \code{det}. +We will use these short names repeatedly. +In order to identify which submodels are needed and what their short names are, we can simply fit any model of that type (e.g. from the example) and call \code{names(model)}. + +<<echo=FALSE>>= +set.seed(123) +@ + +<<>>= +library(unmarked) +umf <- unmarkedFrameOccu(y=matrix(c(0,1,0,1,1,0,0,0,1), nrow=3)) +mod <- occu(~1~1, umf) +names(mod) +@ + +Formulas are supplied as a named list. +The list has one element per submodel, and the names of the elements are the short names defined above. +Each list element is a formula, containing the desired number of covariates to use, and the names of these covariates. +Below we define our list of formulas, including an effect of elevation on occupancy (note we could name this whatever we want, here we call it \code{elev}). +We don't want any covariates on detection probability, so the formula defines the model as intercept only: \code{~1}. + +<<>>= +forms <- list(state=~elev, det=~1) +@ + +\subsection{Regression coefficients} + +Next we must tell \code{unmarked} what the values for the intercept and regression coefficients in each submodel should be. +Once again, this is a named list, one element for each submodel. +Each list element is a numeric vector. +The components of each numeric vector must also be named, matching the covariate names in our list of formulas. +Don't forget we also must specify a value for the intercept in each submodel (can be named \code{Intercept} or \code{intercept}). +If we are not sure exactly how to structure this list, just skip it for now: \code{unmarked} can generate a template for us to fill in later. + +<<>>= +coefs <- list(state=c(intercept=0, elev=-0.4), det=c(intercept=0)) +@ + +We have a list with two elements, each a numeric vector. +Both contain intercept values, and the \code{state} vector also contains a value corresponding to the desired effect of our covariate \code{elev}. + +\subsection{Study design information} + +Finally, we need to give \code{unmarked} information about the study design. +This is pretty simple: we just need a list containing values for \code{M}, the number of sites, and \code{J} the number of surveys per site. +For models with multiple primary periods, we'd also need a value of \code{T}, the number of primary periods. + +<<>>= +design <- list(M=300, J=8) # 300 sites, 8 occasions per site +@ + +\subsection{Put it all together} + +We're now ready to simulate a dataset. +To do this we use the \code{simulate} function, providing as arguments the name of the model \code{"occu"} and the three lists we constructed above. +Actually, first, let's not supply the \code{coefs} list, to show how \code{unmarked} will generate a template for us to use: + +<<eval=FALSE>>= +simulate("occu", formulas=forms, design=design) +@ + +<<echo=FALSE>>= +try(simulate("occu", formulas=forms, design=design)) +@ + +We can copy this provided list structure and fill in our own numeric values. +Once we have our covariates set up properly, add them to the function call: + +<<>>= +occu_umf <- simulate("occu", formulas=forms, coefs=coefs, design=design) +head(occu_umf) +@ + +\code{unmarked} has generated a presence-absence dataset as well as values for covariate \code{elev}. +We can check that it worked as expected by fitting the corresponding model to the dataset, and making sure the estimated values are similar: + +<<>>= +(occu(~1~elev, occu_umf)) +@ + +\subsection{Customizing the covariates} + +By default, a covariate will be continuous and come from a standard normal distribution (mean 0, SD 1). +However, we can control this using the \code{guide} argument. +For example, suppose we want elevation to come from a random normal, but with a mean of 2 and a standard deviation of 0.5. +We can provide a named list to the \code{guide} argument as follows: + +<<>>= +guide <- list(elev=list(dist=rnorm, mean=2, sd=0.5)) +@ + +\code{guide} contains one element, called \code{elev}, which is also a list and contains three components: + +\begin{enumerate} + \item{The random distribution function to use, \code{rnorm}} + \item{The mean of the distribution} + \item{The SD of the distribution} +\end{enumerate} + +<<>>= +occu_umf <- simulate("occu", formulas=forms, coefs=coefs, design=design, guide=guide) +head(occu_umf) +@ + +You can see the \code{elev} covariate now has values corresponding to the desired distribution. +Note that the elements of the list will depend on the arguments required by the random distribution function. +For example, to use a uniform distribution instead: + +<<>>= +guide <- list(elev=list(dist=runif, min=0, max=1)) +occu_umf <- simulate("occu", formulas=forms, coefs=coefs, design=design, guide=guide) +head(occu_umf) +@ + +It is also possible to define a categorical (factor) covariate. +We specify an entry in the \code{guide} list, but instead of a list, we supply a call to \code{factor} which defines the desired factor levels. +For example, suppose we want to add a new \code{landcover} covariate to our simulated model. +First, define the new formulas: + +<<>>= +forms2 <- list(state=~elev+landcover, det=~1) +@ + +And then the new guide, including the information about factor levels: + +<<>>= +guide <- list(landcover=factor(levels=c("forest","grass"))) +@ + +We'd also need an updated \code{coefs} since we have a new covariate. +Defining the \code{coefs} when you have factors in your model is a little trickier, since R names the effects as a combination of the factor name and the level name. +The easiest way to deal with this is to let \code{unmarked} generate a template \code{coefs} for you as shown above, and then fill it in. + +<<>>= +# forest is the reference level for landcover since it was listed first +coefs2 <- list(state=c(intercept=0, elev=-0.4, landcovergrass=0.2), det=c(intercept=0)) +@ + +<<>>= +head(simulate("occu", formulas=forms2, coefs=coefs2, design=design, guide=guide)) +@ + +Our output dataset now includes a new categorical covariate. + +\subsection{Models that require more information} + +More complex models might require more information for simulation. +Nearly any argument provided to either the fitting function for the model, or the corresponding \code{unmarkedFrame} constructor, can be provided as an optional argument to \code{simulate} to customize the simulation. +For example, we may want to specify that abundance should be simulated as a negative binomial, instead of a Poisson, for \code{pcount}. +This information is simply added as additional arguments to \code{simulate}. +For example, to simulate a \code{pcount} dataset using the negative binomial (\code{"NB"}) distribution: + +<<>>= +head(simulate("pcount", formulas=forms, coefs=coefs, design=design, mixture="NB")) +@ + +In the next section we will show a more detailed example involving these additional arguments. + +\section{Simulating a more complex dataset: gdistremoval} + +The \code{gdistremoval} function fits the model of \cite{Amundson_2014}, which estimates abundance using a combination of distance sampling and removal sampling data. +When simulating a dataset based on this model, we have to provide several additional pieces of information related to the structure of the distance and removal sampling analyses. + +To begin, we will define the list of formulas. +A \code{gdistremoval} model, when there is only one primary period, has three submodels: abundance (\code{"lambda"}), distance sampling (\code{"dist"}), and removal sampling (\code{"rem"}). +We will fit a model with an effect of elevation \code{elev} on abundance and an effect of wind \code{wind} on removal probability. + +<<>>= +forms <- list(lambda=~elev, dist=~1, rem=~wind) +@ + +Next we will define the corresponding coefficients. +We will set mean abundance at 5. +The intercept is on the log scale, thus the intercept for \code{lambda} will be \code{log(5)}. +The scale parameter for the detection function will be 50, and again it is on the log scale. +The intercept for the removal probability is on the logit scale, so we will set the intercept at -1 (equivalent to a mean removal probability of about 0.27). +Don't forget the covariate effects on \code{lambda} and removal. + +<<>>= +coefs <- list(lambda=c(intercept=log(5), elev=0.7), + dist=c(intercept=log(50)), rem=c(intercept=-1, wind=-0.3)) +@ + +Our study will have 300 sites. +This model is unique in that we have to specify the number of two different types of observations: (1) the number of distance sampling bins (\code{Jdist}), and the number of removal intervals (\code{Jrem}). + +<<>>= +design <- list(M = 300, Jdist=4, Jrem=5) +@ + +Finally we are ready to simulate the dataset. +In addition to the name of the model, \code{forms}, \code{coefs} and \code{design}, we also need to provide some additional information. +We need to define the distance breaks for the distance sampling part of the model (there should be \code{Jdist+1} of these), and also the key function to use when simulating the detection process. + +<<>>= +umf <- simulate("gdistremoval", formulas=forms, coefs=coefs, design=design, + dist.breaks=c(0,25,50,75,100), keyfun="halfnorm", unitsIn="m") +head(umf) +@ + +The result is a dataset containing a combination of distance, removal, and covariate data. +We can check to see if fitting a model to this dataset recovers our specified coefficient values: + +<<>>= +(fit <- gdistremoval(lambdaformula=~elev, removalformula=~wind, + distanceformula=~1, data=umf)) +@ + +Looks good. + +\section{Conclusion} + +The \code{simulate} function provides a flexible tool for simulating data from any model in \code{unmarked}. +These datasets can be used for a variety of purposes, such as for teaching examples, testing models, or developing new tools that work with \code{unmarked}. +Additionally, simulating datasets is a key component of the power analysis workflow in \code{unmarked} - see the power analysis vignette for more examples. + +\bibliography{unmarked} + +\end{document} diff --git a/vignettes/unmarked.bib b/vignettes/unmarked.bib index 996f397..672a401 100644 --- a/vignettes/unmarked.bib +++ b/vignettes/unmarked.bib @@ -331,3 +331,19 @@ url = {https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/j.2041-210X.2 year = {2012} } +@article{Amundson_2014, + author = {Amundson, Courtney L. and Royle, J. Andrew and Handel, Colleen M.}, + title = "{A hierarchical model combining distance sampling and time removal to estimate detection probability during avian point counts}", + journal = {The Auk}, + volume = {131}, + number = {4}, + pages = {476-494}, + year = {2014}, + month = {07}, + issn = {1938-4254}, + doi = {10.1642/AUK-14-11.1}, + url = {https://doi.org/10.1642/AUK-14-11.1}, + eprint = {https://academic.oup.com/auk/article-pdf/131/4/476/26883760/auk0476.pdf}, +} + + |