Add simulate vignette

author: Ken Kellner <ken@kenkellner.com> 2021-09-10 16:23:34 -0400
committer: Ken Kellner <ken@kenkellner.com> 2021-09-10 16:23:34 -0400
commit: dba95e93d75c84b369e52dfde347c4c23645d5b2 (patch)
tree: 7e5a798de4705f7c51d0b3bdecb7ee16c0e81e1d /vignettes
parent: b513c39d50f9fc3f14215b026d23d299b1051573 (diff)
3 files changed, 330 insertions, 1 deletions
diff --git a/vignettes/powerAnalysis.Rnw b/vignettes/powerAnalysis.Rnw
index f2c6816..5bec099 100644
--- a/vignettes/powerAnalysis.Rnw
+++ b/vignettes/powerAnalysis.Rnw
@@ -186,7 +186,7 @@ coefs <- list(state=c(intercept=0, elev=-0.4), det=c(intercept=0))
 @
 
 Finally, we need to give \code{unmarked} information about the study design.
-This is pretty simple: we just need a list containing values for \code{M}, the number of sites, and code{J} the number of surveys per site.
+This is pretty simple: we just need a list containing values for \code{M}, the number of sites, and \code{J} the number of surveys per site.
 For models with multiple primary periods, we'd also need a value of \code{T}, the number of primary periods.
 
 <<>>=
diff --git a/vignettes/simulate.Rnw b/vignettes/simulate.Rnw
new file mode 100644
index 0000000..ede6c76
--- /dev/null
+++ b/vignettes/simulate.Rnw
@@ -0,0 +1,313 @@
+<<echo=false>>=
+options(width=70)
+options(continue=" ")
+@
+
+\documentclass[a4paper]{article}
+\usepackage[OT1]{fontenc}
+\usepackage{Sweave}
+\usepackage{natbib}
+%\usepackage{fullpage}
+\usepackage[vmargin=1in,hmargin=1in]{geometry}
+\bibliographystyle{ecology}
+
+\usepackage{hyperref}
+\hypersetup{
+    colorlinks=true,
+    linkcolor=blue,     
+    urlcolor=cyan,
+    citecolor=black
+}
+
+\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
+\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
+\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
+\fvset{listparameters={\setlength{\topsep}{0pt}}}
+\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
+
+%%\VignetteIndexEntry{Simulating datasets}
+
+\title{Simulating datasets}
+\author{Ken Kellner}
+\date{September 10, 2021}
+
+
+\begin{document}
+
+\newcommand{\code}[1]{\texttt{\small{#1}}}
+\newcommand{\package}[1]{\textsf{\small{#1}}}
+
+\maketitle
+
+\section{Outline}
+
+\begin{enumerate}
+  \item Introduction
+  \item Components of a call to \code{simulate}
+  \item Simulating an occupancy dataset
+  \item Simulating a more complex dataset: \code{gdistremoval}
+  \item Conclusion
+\end{enumerate}
+
+\section{Introduction}
+
+Simulating datasets is a powerful and varied tool when conducting \code{unmarked} analyses.
+Writing our own code to simulate a dataset based on a given model is an excellent learning tool, and can help us test if a given model is generating the expected results.
+If we simulate a series of datasets based on a fitted model, and calculate a statistic from each of those fits, we can generate a distribution of the statistic - this is what the \code{parboot} function does.
+This can be helpful, for example, when testing goodness of fit.
+Finally, simulation can be a useful component of power analysis when a closed-form equation for power is not available.
+
+\code{unmarked} provides two different ways of generating simulated datasets, depending on the stage we are at in the modeling process.
+
+\begin{enumerate}
+  \item Generating simulated datasets from a fitted model we already have
+  \item Generating simulated datasets from scratch
+\end{enumerate}
+
+For (1), we simply call the \code{simulate} method on our fitted model object and new dataset(s) are generated.
+This is the approach used by \code{parboot}.
+In this vignette we will focus on (2), a more flexible approach to simulation, also using the \code{simulate} method, that allows us to generate a dataset corresponding to any \code{unmarked} model from scratch.
+
+\section{Components of a call to simulate}
+
+We will need to provide, at a minimum, four pieces of information to \code{simulate} in order to simulate a dataset from scratch in \code{unmarked}.
+
+\begin{enumerate}
+  \item The name of the fitting function for the model we want to simulate from, as a character string
+  \item A list of formulas, one per submodel, containing the names of the covariates we want to include in each
+  \item A list of vectors of regression coefficients (intercepts and slopes), one per submodel, matching the formulas
+  \item A list of design components; for example, the number of sites and number of observations per site
+\end{enumerate}
+
+A number of other arguments are available, e.g. for how to customize how the covariates are randomly generated or for distributions to use when simulating abundances.
+We'll show those later.
+The easiest way to demonstrate how to use \code{simulate} is to look at an example: we'll start with a simple one for occupancy.
+
+\section{Simulating an occupancy dataset}
+
+Suppose we want to simulate an occupancy dataset in which site occupancy is affected by elevation.
+The first piece of information needed is the name of model to use: the fitting function for occupancy is \code{occu}, so the first argument to \code{simulate} and the name of the model will be \code{"occu"}.
+
+\subsection{Formulas}
+
+Second we must define the desired model structure as a list of formulas, one per submodel.
+"Submodels" here are the hierarchical components of the model; for example, an occupancy model has a state (occupancy) submodel and an observation (detection) submodel.
+These submodels are identified by short names: \code{state} and \code{det}.
+We will use these short names repeatedly.
+In order to identify which submodels are needed and what their short names are, we can simply fit any model of that type (e.g. from the example) and call \code{names(model)}.
+
+<<echo=FALSE>>=
+set.seed(123)
+@
+
+<<>>=
+library(unmarked)
+umf <- unmarkedFrameOccu(y=matrix(c(0,1,0,1,1,0,0,0,1), nrow=3))
+mod <- occu(~1~1, umf)
+names(mod)
+@
+
+Formulas are supplied as a named list.
+The list has one element per submodel, and the names of the elements are the short names defined above.
+Each list element is a formula, containing the desired number of covariates to use, and the names of these covariates.
+Below we define our list of formulas, including an effect of elevation on occupancy (note we could name this whatever we want, here we call it \code{elev}).
+We don't want any covariates on detection probability, so the formula defines the model as intercept only: \code{~1}.
+
+<<>>=
+forms <- list(state=~elev, det=~1)
+@
+
+\subsection{Regression coefficients}
+
+Next we must tell \code{unmarked} what the values for the intercept and regression coefficients in each submodel should be.
+Once again, this is a named list, one element for each submodel.
+Each list element is a numeric vector.
+The components of each numeric vector must also be named, matching the covariate names in our list of formulas.
+Don't forget we also must specify a value for the intercept in each submodel (can be named \code{Intercept} or \code{intercept}).
+If we are not sure exactly how to structure this list, just skip it for now: \code{unmarked} can generate a template for us to fill in later.
+
+<<>>=
+coefs <- list(state=c(intercept=0, elev=-0.4), det=c(intercept=0))
+@
+
+We have a list with two elements, each a numeric vector.
+Both contain intercept values, and the \code{state} vector also contains a value corresponding to the desired effect of our covariate \code{elev}.
+
+\subsection{Study design information}
+
+Finally, we need to give \code{unmarked} information about the study design.
+This is pretty simple: we just need a list containing values for \code{M}, the number of sites, and \code{J} the number of surveys per site.
+For models with multiple primary periods, we'd also need a value of \code{T}, the number of primary periods.
+
+<<>>=
+design <- list(M=300, J=8) # 300 sites, 8 occasions per site
+@
+
+\subsection{Put it all together}
+
+We're now ready to simulate a dataset.
+To do this we use the \code{simulate} function, providing as arguments the name of the model \code{"occu"} and the three lists we constructed above.
+Actually, first, let's not supply the \code{coefs} list, to show how \code{unmarked} will generate a template for us to use:
+
+<<eval=FALSE>>=
+simulate("occu", formulas=forms, design=design)
+@
+
+<<echo=FALSE>>=
+try(simulate("occu", formulas=forms, design=design))
+@
+
+We can copy this provided list structure and fill in our own numeric values.
+Once we have our covariates set up properly, add them to the function call:
+
+<<>>=
+occu_umf <- simulate("occu", formulas=forms, coefs=coefs, design=design)
+head(occu_umf)
+@
+
+\code{unmarked} has generated a presence-absence dataset as well as values for covariate \code{elev}.
+We can check that it worked as expected by fitting the corresponding model to the dataset, and making sure the estimated values are similar:
+
+<<>>=
+(occu(~1~elev, occu_umf))
+@
+
+\subsection{Customizing the covariates}
+
+By default, a covariate will be continuous and come from a standard normal distribution (mean 0, SD 1).
+However, we can control this using the \code{guide} argument.
+For example, suppose we want elevation to come from a random normal, but with a mean of 2 and a standard deviation of 0.5.
+We can provide a named list to the \code{guide} argument as follows:
+
+<<>>=
+guide <- list(elev=list(dist=rnorm, mean=2, sd=0.5))
+@
+
+\code{guide} contains one element, called \code{elev}, which is also a list and contains three components:
+
+\begin{enumerate}
+  \item{The random distribution function to use, \code{rnorm}}
+  \item{The mean of the distribution}
+  \item{The SD of the distribution}
+\end{enumerate}
+
+<<>>=
+occu_umf <- simulate("occu", formulas=forms, coefs=coefs, design=design, guide=guide)
+head(occu_umf)
+@
+
+You can see the \code{elev} covariate now has values corresponding to the desired distribution.
+Note that the elements of the list will depend on the arguments required by the random distribution function.
+For example, to use a uniform distribution instead:
+
+<<>>=
+guide <- list(elev=list(dist=runif, min=0, max=1)) 
+occu_umf <- simulate("occu", formulas=forms, coefs=coefs, design=design, guide=guide)
+head(occu_umf)
+@
+
+It is also possible to define a categorical (factor) covariate.
+We specify an entry in the \code{guide} list, but instead of a list, we supply a call to \code{factor} which defines the desired factor levels.
+For example, suppose we want to add a new \code{landcover} covariate to our simulated model.
+First, define the new formulas:
+
+<<>>=
+forms2 <- list(state=~elev+landcover, det=~1)
+@
+
+And then the new guide, including the information about factor levels:
+
+<<>>=
+guide <- list(landcover=factor(levels=c("forest","grass")))
+@
+
+We'd also need an updated \code{coefs} since we have a new covariate.
+Defining the \code{coefs} when you have factors in your model is a little trickier, since R names the effects as a combination of the factor name and the level name.
+The easiest way to deal with this is to let \code{unmarked} generate a template \code{coefs} for you as shown above, and then fill it in.
+
+<<>>=
+# forest is the reference level for landcover since it was listed first
+coefs2 <- list(state=c(intercept=0, elev=-0.4, landcovergrass=0.2), det=c(intercept=0))
+@
+
+<<>>=
+head(simulate("occu", formulas=forms2, coefs=coefs2, design=design, guide=guide))
+@
+
+Our output dataset now includes a new categorical covariate.
+
+\subsection{Models that require more information}
+
+More complex models might require more information for simulation.
+Nearly any argument provided to either the fitting function for the model, or the corresponding \code{unmarkedFrame} constructor, can be provided as an optional argument to \code{simulate} to customize the simulation.
+For example, we may want to specify that abundance should be simulated as a negative binomial, instead of a Poisson, for \code{pcount}.
+This information is simply added as additional arguments to \code{simulate}.
+For example, to simulate a \code{pcount} dataset using the negative binomial (\code{"NB"}) distribution:
+
+<<>>=
+head(simulate("pcount", formulas=forms, coefs=coefs, design=design, mixture="NB"))
+@
+
+In the next section we will show a more detailed example involving these additional arguments.
+
+\section{Simulating a more complex dataset: gdistremoval}
+
+The \code{gdistremoval} function fits the model of \cite{Amundson_2014}, which estimates abundance using a combination of distance sampling and removal sampling data.
+When simulating a dataset based on this model, we have to provide several additional pieces of information related to the structure of the distance and removal sampling analyses.
+
+To begin, we will define the list of formulas.
+A \code{gdistremoval} model, when there is only one primary period, has three submodels: abundance (\code{"lambda"}), distance sampling (\code{"dist"}), and removal sampling (\code{"rem"}).
+We will fit a model with an effect of elevation \code{elev} on abundance and an effect of wind \code{wind} on removal probability.
+
+<<>>=
+forms <- list(lambda=~elev, dist=~1, rem=~wind)
+@
+
+Next we will define the corresponding coefficients.
+We will set mean abundance at 5.
+The intercept is on the log scale, thus the intercept for \code{lambda} will be \code{log(5)}.
+The scale parameter for the detection function will be 50, and again it is on the log scale.
+The intercept for the removal probability is on the logit scale, so we will set the intercept at -1 (equivalent to a mean removal probability of about 0.27).
+Don't forget the covariate effects on \code{lambda} and removal.
+
+<<>>=
+coefs <- list(lambda=c(intercept=log(5), elev=0.7), 
+              dist=c(intercept=log(50)), rem=c(intercept=-1, wind=-0.3))
+@
+
+Our study will have 300 sites.
+This model is unique in that we have to specify the number of two different types of observations: (1) the number of distance sampling bins (\code{Jdist}), and the number of removal intervals (\code{Jrem}).
+
+<<>>=
+design <- list(M = 300, Jdist=4, Jrem=5)
+@
+
+Finally we are ready to simulate the dataset.
+In addition to the name of the model, \code{forms}, \code{coefs} and \code{design}, we also need to provide some additional information.
+We need to define the distance breaks for the distance sampling part of the model (there should be \code{Jdist+1} of these), and also the key function to use when simulating the detection process.
+
+<<>>=
+umf <- simulate("gdistremoval", formulas=forms, coefs=coefs, design=design,
+                dist.breaks=c(0,25,50,75,100), keyfun="halfnorm", unitsIn="m")
+head(umf)
+@
+
+The result is a dataset containing a combination of distance, removal, and covariate data.
+We can check to see if fitting a model to this dataset recovers our specified coefficient values:
+
+<<>>=
+(fit <- gdistremoval(lambdaformula=~elev, removalformula=~wind, 
+                    distanceformula=~1, data=umf))
+@
+
+Looks good.
+
+\section{Conclusion}
+
+The \code{simulate} function provides a flexible tool for simulating data from any model in \code{unmarked}.
+These datasets can be used for a variety of purposes, such as for teaching examples, testing models, or developing new tools that work with \code{unmarked}.
+Additionally, simulating datasets is a key component of the power analysis workflow in \code{unmarked} - see the power analysis vignette for more examples.
+
+\bibliography{unmarked}
+
+\end{document}
diff --git a/vignettes/unmarked.bib b/vignettes/unmarked.bib
index 996f397..672a401 100644
--- a/vignettes/unmarked.bib
+++ b/vignettes/unmarked.bib
@@ -331,3 +331,19 @@ url = {https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/j.2041-210X.2
 year = {2012}
 }
 
+@article{Amundson_2014,
+    author = {Amundson, Courtney L. and Royle, J. Andrew and Handel, Colleen M.},
+    title = "{A hierarchical model combining distance sampling and time removal to estimate detection probability during avian point counts}",
+    journal = {The Auk},
+    volume = {131},
+    number = {4},
+    pages = {476-494},
+    year = {2014},
+    month = {07},
+    issn = {1938-4254},
+    doi = {10.1642/AUK-14-11.1},
+    url = {https://doi.org/10.1642/AUK-14-11.1},
+    eprint = {https://academic.oup.com/auk/article-pdf/131/4/476/26883760/auk0476.pdf},
+}
+
+
author	Ken Kellner <ken@kenkellner.com>	2021-09-10 16:23:34 -0400
committer	Ken Kellner <ken@kenkellner.com>	2021-09-10 16:23:34 -0400
commit	dba95e93d75c84b369e52dfde347c4c23645d5b2 (patch)
tree	7e5a798de4705f7c51d0b3bdecb7ee16c0e81e1d /vignettes
parent	b513c39d50f9fc3f14215b026d23d299b1051573 (diff)