1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/validation.R
\name{oneStepPredict}
\alias{oneStepPredict}
\title{Calculate one-step-ahead (OSA) residuals for a latent variable model.}
\usage{
oneStepPredict(
obj,
observation.name = NULL,
data.term.indicator = NULL,
method = c("oneStepGaussianOffMode", "fullGaussian", "oneStepGeneric",
"oneStepGaussian", "cdf"),
subset = NULL,
conditional = NULL,
discrete = NULL,
discreteSupport = NULL,
range = c(-Inf, Inf),
seed = 123,
parallel = FALSE,
trace = TRUE,
reverse = (method == "oneStepGaussianOffMode"),
splineApprox = TRUE,
...
)
}
\arguments{
\item{obj}{Output from \code{MakeADFun}.}
\item{observation.name}{Character naming the observation in the template.}
\item{data.term.indicator}{Character naming an indicator data variable in the template (not required by all methods - see details).}
\item{method}{Method to calculate OSA (see details).}
\item{subset}{Index vector of observations that will be added one by one during OSA. By default \code{1:length(observations)} (with \code{conditional} subtracted).}
\item{conditional}{Index vector of observations that are fixed during OSA. By default the empty set.}
\item{discrete}{Logical; Are observations discrete? (assumed FALSE by default).}
\item{discreteSupport}{Possible outcomes of discrete part of the distribution (\code{method="oneStepGeneric"} and \code{method="cdf"} only).}
\item{range}{Possible range of continuous part of the distribution (\code{method="oneStepGeneric"} only).}
\item{seed}{Randomization seed (discrete case only). If \code{NULL} the RNG seed is untouched by this routine (recommended for simulation studies).}
\item{parallel}{Run in parallel using the \code{parallel} package?}
\item{trace}{Logical; Trace progress? More options available for \code{method="oneStepGeneric"} - see details.}
\item{reverse}{Do calculations in opposite order to improve stability? (currently enabled by default for \code{oneStepGaussianOffMode} method only)}
\item{splineApprox}{Represent one-step conditional distribution by a spline to reduce number of density evaluations? (\code{method="oneStepGeneric"} only).}
\item{...}{Control parameters for OSA method}
}
\value{
\code{data.frame} with OSA \emph{standardized} residuals
in column \code{residual}. In addition, depending on the method, the output
includes selected characteristics of the predictive distribution (current row) given past observations (past rows), notably the \emph{conditional}
\describe{
\item{mean}{Expectation of the current observation}
\item{sd}{Standard deviation of the current observation}
\item{Fx}{CDF at current observation}
\item{px}{Density at current observation}
\item{nll}{Negative log density at current observation}
\item{nlcdf.lower}{Negative log of the lower CDF at current observation}
\item{nlcdf.upper}{Negative log of the upper CDF at current observation}
}
\emph{given past observations}.
If column \code{randomize} is present, it indicates that randomization has been applied for the row.
}
\description{
Calculate one-step-ahead (OSA) residuals for a latent variable
model. (\emph{Beta version; may change without notice})
}
\details{
Given a TMB latent variable model this function calculates OSA
standardized residuals that can be used for goodness-of-fit
assessment. The approach is based on a factorization of the joint
distribution of the \emph{observations} \eqn{X_1,...,X_n} into
successive conditional distributions.
Denote by
\deqn{F_n(x_n) = P(X_n \leq x_n | X_1 = x_1,...,X_{n-1}=x_{n-1} )}
the one-step-ahead CDF, and by
\deqn{p_n(x_n) = P(X_n = x_n | X_1 = x_1,...,X_{n-1}=x_{n-1} )}
the corresponding point probabilities (zero for continuous distributions).
In case of continuous observations the sequence
\deqn{\Phi^{-1}(F_1(X_1))\:,...,\:\Phi^{-1}(F_n(X_n))}
will be iid standard normal. These are referred to as the OSA residuals.
In case of discrete observations draw (unit) uniform variables
\eqn{U_1,...,U_n} and construct the randomized OSA residuals
\deqn{\Phi^{-1}(F_1(X_1)-U_1 p_1(X_1))\:,...,\:\Phi^{-1}(F_n(X_n)-U_n p_n(X_n))}
These are also iid standard normal.
}
\section{Choosing the method}{
The user must specify the method used to calculate the residuals - see detailed list of method descriptions below.
We note that all the methods are based on approximations. While the default 'oneStepGaussianoffMode' often represents a good compromise between accuracy and speed, it cannot be assumed to work well for all model classes.
As a rule of thumb, if in doubt whether a method is accurate enough, you should always compare with the 'oneStepGeneric' which is considered the most accurate of the available methods.
\describe{
\item{method="fullGaussian"}{
This method assumes that the joint distribution of data \emph{and}
random effects is Gaussian (or well approximated by a
Gaussian). It does not require any changes to the user
template. However, if used in conjunction with \code{subset}
and/or \code{conditional} a \code{data.term.indicator} is required
- see the next method.
}
\item{method="oneStepGeneric"}{
This method calculates the one-step conditional probability
density as a ratio of Laplace approximations. The approximation is
integrated (and re-normalized for improved accuracy) using 1D
numerical quadrature to obtain the one-step CDF evaluated at each
data point. The method works in the continuous case as well as the
discrete case (\code{discrete=TRUE}).
It requires a specification of a \code{data.term.indicator}
explained in the following. Suppose the template for the
observations given the random effects (\eqn{u}) looks like
\preformatted{
DATA_VECTOR(x);
...
nll -= dnorm(x(i), u(i), sd(i), true);
...
}
Then this template can be augmented with a
\code{data.term.indicator = "keep"} by changing the template to
\preformatted{
DATA_VECTOR(x);
DATA_VECTOR_INDICATOR(keep, x);
...
nll -= keep(i) * dnorm(x(i), u(i), sd(i), true);
...
}
The new data vector (\code{keep}) need not be passed from \R. It
automatically becomes a copy of \code{x} filled with ones.
Some extra parameters are essential for the method.
Pay special attention to the integration domain which must be set either via \code{range} (continuous case) or \code{discreteSupport} (discrete case). Both of these can be set simultanously to specify a mixed continuous/discrete distribution. For example, a non-negative distribution with a point mass at zero (e.g. the Tweedie distribution) should have \code{range=c(0,Inf)} and \code{discreteSupport=0}.
Several parameters control accuracy and appropriate settings are case specific. By default, a spline is fitted to the one-step density before integration (\code{splineApprox=TRUE}) to reduce the number of density evaluations. However, this setting may have negative impact on accuracy. The spline approximation can then either be disabled or improved by noting that \code{...} arguments are passed to \link{tmbprofile}: Pass e.g. \code{ystep=20, ytol=0.1}.
Finally, it may be useful to look at the one step predictive distributions on either log scale (\code{trace=2}) or natural scale (\code{trace=3}) to determine which alternative methods might be appropriate.
}
\item{method="oneStepGaussian"}{
This is a special case of the generic method where the one step
conditional distribution is approximated by a Gaussian (and can
therefore be handled more efficiently).
}
\item{method="oneStepGaussianOffMode"}{
This is an approximation of the "oneStepGaussian" method that
avoids locating the mode of the one-step conditional density.
}
\item{method="cdf"}{
The generic method can be slow due to the many function
evaluations used during the 1D integration (or summation in the
discrete case). The present method can speed up this process but
requires more changes to the user template. The above template
must be expanded with information about how to calculate the
negative log of the lower and upper CDF:
\preformatted{
DATA_VECTOR(x);
DATA_VECTOR_INDICATOR(keep, x);
...
nll -= keep(i) * dnorm(x(i), u(i), sd(i), true);
nll -= keep.cdf_lower(i) * log( pnorm(x(i), u(i), sd(i)) );
nll -= keep.cdf_upper(i) * log( 1.0 - pnorm(x(i), u(i), sd(i)) );
...
}
The specialized members \code{keep.cdf_lower} and
\code{keep.cdf_upper} automatically become copies of \code{x}
filled with zeros.
}
}
}
\examples{
######################## Gaussian case
runExample("simple")
osa.simple <- oneStepPredict(obj, observation.name = "x", method="fullGaussian")
qqnorm(osa.simple$residual); abline(0,1)
\dontrun{
######################## Poisson case (First 100 observations)
runExample("ar1xar1")
osa.ar1xar1 <- oneStepPredict(obj, "N", "keep", method="cdf", discrete=TRUE, subset=1:100)
qqnorm(osa.ar1xar1$residual); abline(0,1)
}
}
|