1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581
|
\documentclass[a4paper]{article}
\usepackage[OT1]{fontenc}
\usepackage{Sweave}
\usepackage{amssymb}
%\usepackage{hyperref}
\newcommand{\sT}{simTool}
\newcommand{\pv}{$p$-value }
\newcommand{\pvs}{$p$-values }
\newcommand{\mts}{MutossSimObject}
\newcommand{\tw}[1]{\texttt{#1}}
\begin{document}
% \VignetteIndexEntry{MuToss Simulation Tool Guide}
\title{\sT}
\author{MarselScheer}
\maketitle
\tableofcontents
<<echo=FALSE>>=
require(mutoss)
@
\section{Note}
This SweaveFile is part of $\mu$Toss package. But because some parts of
this document are computational intensiv the extension of the Sweavefile
is ".actuallyRnw".
\section{Introduction}
\label{introduction}
This document will give an introduction to the use of \sT.
We will start with a very simple application then raise the
degree of complexity in a few steps and in the end reproduce
some of the results from Benjamini, Krieger, Yekutieli (2006).
Basically for a simulation we need the following things
\begin{enumerate}
\item a function that generates the data (for example \pvs)
\item some procedures that evaluates the generated data (for
example the bonferroni correction)
\item statistics we want to calculate. For example the the false
discovery proportion (FDP).
\end{enumerate}
If we are speak for example of $1000$ replications, we mean that these 3 steps
were repeated $1000$ times. That means, after $1000$ replications there have been
$1000$ data sets generated. Every procedure was applied to these $1000$ data sets.
If we have specified for example $3$ procedures then every of these procedures
will give us an output, this means $3000$ "output objects". And every specified
statistic is applied to these $3000$ "output objects".
\subsection{Motivation}
\subsubsection{Example 1}
Suppose we want to compare some characteristics of the two procedures \tw{BH} and \tw{holm}.
Denote by $V_n(BH)$ and $V_n(holm)$ the number of true hypotheses rejected. For example
we are interested in the distribution of $V_n(BH)$ and $V_n(holm)$ and also in $E[V_n(BH)]$
and $E[V_n(holm)]$. Lets have a look on the arguments of theses procedures:
<<>>=
args(BH)
args(holm)
@
Both have the argument \tw{alpha}, which stands for the niveau
of the corresponding error measure.
\tw{BH} controls the false discovery rate (FDR) and \tw{holm}
controls the family-wise error rate (FWER). One question may be how
must \tw{alpha} be set in \tw{holm} such that for fixed \tw{alpha} = 0.1 in
\tw{BH} we have $E[V_n(BH)] \approx E[V_n(holm)]$?
And perhaps we are also interested in how a dependence structure affects the
distribution of $V_n(BH)$ and $V_n(holm)$.
\subsubsection{Example 2}
Suppose you have developed a new procedure controlling some error measure and you
want to compare this new procedure with some other already established procedures.
Then a simulation study will consist of creating data and gathering statistics for
different sample sizes, dependence structures and parameter constellations.
$$
$$
Basically it is always the same story. Data is generated by a specific function,
perhaps this depends on some parameters, and we want to apply different procedures,
again depending on some paramters, to this generated data. Nearly everyone will say,
"don't bother me with the details of programming, just do the simulation
with the above information and give me a nice \tw{data.frame} which I can analyze."
And this is exactly the purpose of the simTool.
\subsection{Data generating function}
\label{DataGenFun}
For now we only want to use the procedures \texttt{BH} and \texttt{holm}. Again, lets
have a look on the arguments of these procedures:
<<>>=
args(BH)
args(holm)
@
Not much has to be specified. The parameter \texttt{alpha} is the level at which
the corresponding error rate should be controlled and \tw{pValues} are the \pvs
of the hypotheses that should be tested. In general, if we say that we reject a
\pv this means that we reject the corresponding hypotheses.
The following function generates data and will be used throughout the whole
document. The data generating function \textbf{must have} an entry \$procInput.
All in \$procInput will be used as input for the specified procedures.
In our situation we will generate a list with 2 entries. \$procInput
will only consist of the generated \pvs. And \$groundTruth indicates which \pv
corresponds to a true or false hypothesis.
<<>>=
pValFromEquiCorrData <- function(sampleSize, rho, mu, pi0)
{
nmbOfFalseHyp <- round(sampleSize * (1-pi0))
nmbOfTrueHyp <- sampleSize - nmbOfFalseHyp
muVec <- c(rep(mu, nmbOfFalseHyp), rep(0, nmbOfTrueHyp))
Y <- sqrt(rho) * rnorm(1) + sqrt(1-rho) * rnorm(sampleSize) + muVec
return(list(
procInput=list(pValues = 1 - pnorm(Y)),
groundTruth = muVec == 0
)
)
}
@
\subsubsection{Illustration of generated Data}
We will now generate 1000 $p$-Values independently (\tw{rho} = 0).
\tw{sampleSize * (1-pi0)} = 700 of them correspond to false hypotheses and 300 to
true hypotheses.
<<>>=
set.seed(123)
data <- pValFromEquiCorrData(
sampleSize = 1000,
rho = 0,
mu = 2,
pi0 = 0.3)
@
Lets visualize the different \pvs.
<<fig=TRUE>>=
local({
pValues = data$procInput$pValues
groundTruth = data$groundTruth
plot(ecdf(pValues), do.points=FALSE, verticals=TRUE, main="ecdf's of generated p-values")
lines(ecdf(pValues[groundTruth]), do.points=FALSE, verticals=TRUE, col=2)
lines(ecdf(pValues[!groundTruth]), do.points=FALSE, verticals=TRUE, col=4)
abline(0,1,lty=2)
legend("right", legend=c("all", "true", "false"), col=c(1,2,4), lty=1, title="p-values")
})
@
\subsection{A very simple example}
Lets directly start with the function call and then analysing what happend. Just for now we
implement a simple version of the data generating function in section \ref{DataGenFun} and
a simple version of the \tw{BH}.
<<>>=
myGen <- function()
{
pValFromEquiCorrData(sampleSize = 200, rho = 0, mu = 2, pi0=0.5)
}
BH.05 = function(pValues)
{
BH(pValues=pValues, alpha = 0.05, silent=TRUE)
}
@
<<>>=
set.seed(123)
sim <- simulation(replications = 10, list(funName="myGen", fun=myGen),
list(list(funName="BH.simple", fun=BH.05)))
@
The following happend:
\begin{enumerate}
\item Call \tw{myGen}
\item Append generated data set to \tw{sim\$data}
\item Call \tw{BH.05} with the generated \pvs
\item Add to the results of \tw{BH.05} the parameter constellation used
by \tw{myGen} and \tw{BH.05} and also the position of the used
data in \tw{sim\$data}
\item Append this extended results of \tw{BH.05} to \tw{sim\$results}
\item repeat this 9 more times
\end{enumerate}
Since \tw{replications = 10} in \tw{simulation} the object \tw{sim} consists of:
<<>>=
names(sim)
length(sim$data)
length(sim$results)
@
The structure of one object in \tw{sim\$data} coincides with the structure of the return
of \tw{myGen}:
<<>>=
names(myGen())
names(sim$data[[1]])
@
Lets have a look at the additional information added by \tw{simulation} to the object returned
by \tw{BH.05}. First the entries of the pure return value of \tw{BH.05} and then of \tw{sim\$results}
<<>>=
names(BH.05(runif(100)))
names(sim$results[[1]])
@
\tw{data.set.number} is the number of the used data set in \tw{sim\$data}. Since every generated
data set is used only once we have
<<>>=
sapply(sim$results, function(x) x$data.set.number)
@
Thus it is possible
to reproduce any result directly. Let us reproduce the results of the 6th replication
<<>>=
idx <- sim$results[[6]]$data.set.number
pValues <- sim$data[[idx]]$procInput$pValues
all(BH.05(pValues)$adjPValues == sim$results[[6]]$adjPValues)
@
Since our data generating function and procedure did not really have any parameter, there is
not much information in:
<<>>=
sim$results[[1]]$parameters
@
Note, \tw{BH.05} technically has the parameter \tw{pValues} but since this parameter is
the "input channel" for the output data of the data generating function it is not really
regarded as parameter that affect the behaviour of the procedure like for example the
level $\alpha$ for the error measure.
\subsection{A simple example}
\label{a.simple.example}
Indeed, for the simple example above the simTool is not very helpful. Let us increase the
complexity of our simulation. As a first step, we want to increase the parameter \tw{rho}
of the data generating function gradually and investigate $E V_n(BH)$ and $E V^2_n(BH)$.
<<>>=
set.seed(123)
sim <- simulation(replications = 10,
list(funName="pVGen", fun=pValFromEquiCorrData,
sampleSize = 100,
rho = seq(0,1,by=0.2),
mu = 2,
pi0 = 0.5),
list(list(funName="BH", fun=BH,
alpha = c(0.5, 0.25),
silent=TRUE)))
@
The following happend:
\begin{enumerate}
\item Call \tw{myGen} with \tw{rho = 0}
\item Append generated data set to \tw{sim\$data}
\item Call \tw{BH} with \tw{alpha = 0.5} and the generated \pvs
\item Add to the results the parameter constellation used
by \tw{myGen} and \tw{BH} and also the position of the used
data in \tw{sim\$data}
\item Append this extended results to \tw{sim\$results}
\item Call \tw{BH} with \tw{alpha = 0.25} and the \textbf{same} \pvs as in 3.
\item Add to the results the parameter constellation used
by \tw{myGen} and \tw{BH} and also the position of the used
data in \tw{sim\$data}
\item Append this extended results of to \tw{sim\$results}
\item repeat this 9 more times
\item Call \tw{myGen} with \tw{rho = 0.2}
\item and so on
\end{enumerate}
So we have now a bunch of data sets and results.
<<>>=
length(sim$data)
length(sim$results)
@
60 data sets have been generated because we have \tw{rho = 0, 0.2, 0.4, 0.6, 0.8, 1} and for
each \tw{rho} we generated 10 data sets. We have applied \tw{BH} with \tw{alpha = 0.5} to all
60 data sets and \tw{BH} with \tw{alpha = 0.25}, yielding 120 objects in \tw{results}.
We see from the \tw{data.set.number} that after generating a data set it is used two times in series
<<>>=
sapply(sim$results, function(x) x$data.set.number)
@
In order to gather how many true hypotheses have been rejected corresponding to the different
parameter constellations we need a function that is able to calculate it.
<<>>=
NumberOfType1Error <- function(data, result) sum(data$groundTruth * result$rejected)
V2 <- function(data, result) NumberOfType1Error(data,result)^2
#result.all <- gatherStatistics(sim, list(NumOfType1Err = NumberOfType1Error))
@
Lets us calculate for the first result object the number of rejected true hypotheses explicitly.
<<>>=
idx = sim$results[[1]]$data.set.number
data = sim$data[[idx]]
NumberOfType1Error(data, sim$results[[1]])
@
This means for the following parameter constellation
<<>>=
sim$results[[1]]$parameters
@
we one time observed
<<echo=FALSE>>=
NumberOfType1Error(data, sim$results[[1]])
@
rejected true null hypotheses. In order to estimate $E V_n(BH)$ and $E V^2_n(BH)$
we have to search in \tw{sim\$results} for the other 9 results using the same
parameter constellation. In order to facilitate this task we provide a function.
<<>>=
result <- gatherStatistics(sim,
list(V = NumberOfType1Error, V2 = V2),
mean)
result
@
As you can see every, parameter constellation has its own row in \tw{\$statisticDF}.
By the way, again, we see that \pvs is a parameter of \tw{BH} but since it is contained in \tw{\$procInput} it is
not considered as "real parameter".\\
If we are interested in more than one statistics, then we simply provide a list with "average functions"
<<>>=
result <- gatherStatistics(sim,
list(V = NumberOfType1Error, V2 = V2),
list(mean = mean, sd = function(vec) round(sd(vec), 1)))
result$statisticDF
@
Another possibility is to call \tw{gatherStatistics} without and "average function". Then every result
get his own row in \tw{\$statisticDF}
<<>>=
result <- gatherStatistics(sim,
list(V = NumberOfType1Error, V2 = V2))
head(result$statisticDF)
tail(result$statisticDF)
@
\subsection{Plotting a bit}
<<>>=
if(!is.element("sim.plot", ls()))
{
sim.plot <- simulation(replications = 1000,
list(funName = "pVGen",
fun = pValFromEquiCorrData,
sampleSize = 100,
rho = seq(0, 1, by = 0.2),
mu = 0,
pi0 = 1
),
list(
list(funName = "BH",
fun = function(pValues, alpha) BH(
pValues,
alpha,
silent=TRUE),
alpha = c(0.5, 0.75)
)
)
)
}
length(sim.plot$data)
length(sim.plot$results)
@
The are already different kind of R functions that take data.frames and generate
histograms, boxplots and so on from them.
For some plot example we will need the lattice package.
<<>>=
require(lattice)
@
First we calculate V for every MutossSim object and then plot a histogram and a boxplot of V,
see Figure~\ref{fig:HistV1} (p.~\pageref{fig:HistV1}), Figure~\ref{fig:BWV1} (p.~\pageref{fig:BWV1}).
<<>>=
result.all <- gatherStatistics(
sim.plot,
list(V = NumberOfType1Error)
)
@
<<label=figHistV1,include=FALSE, echo=FALSE>>=
print(
histogram(~V | rho,
data = subset(
result.all$statisticDF,
alpha=="0.5")))
@
\begin{figure}
\begin{center}
<<label=fig1,fig=TRUE,echo=TRUE>>=
<<figHistV1>>
@
\end{center}
\caption{Histogram of the number of rejected true hypotheses for alpha equals 0.5}
\label{fig:HistV1}
\end{figure}
<<label=figBWV1,include=FALSE, echo=FALSE>>=
print(
bwplot(V ~ rho | alpha,
data = subset(result.all$statisticDF)))
@
\begin{figure}
\begin{center}
<<label=fig2,fig=TRUE,echo=TRUE>>=
<<figBWV1>>
@
\end{center}
\caption{Boxplot of the number of rejected true hypotheses}
\label{fig:BWV1}
\end{figure}
Also after the "average process" we again get an data.frame which can be used to generate
plots.
<<>>=
result <- gatherStatistics(
sim.plot,
list(V = NumberOfType1Error),
list(Ml = function(x)
mean(x) - 2 * sd(x)/sqrt(length(x)),
Mu = function(x)
mean(x) + 2 * sd(x)/sqrt(length(x)),
M = mean,
SD = sd)
)
subset(result$statisticDF, alpha=="0.5")[1:3,-1]
<<label=figXYPlot,include=FALSE, echo=FALSE>>=
print(
xyplot(
V.Ml + V.M + V.Mu ~ rho | alpha,
data = result$statisticDF,
type = "a",
auto.key=list(
space="right",
points=FALSE,
lines=TRUE)
)
)
@
\begin{figure}
\begin{center}
<<label=fig3,fig=TRUE,echo=TRUE>>=
<<figXYPlot>>
@
\end{center}
\caption{XYPlot of the mean number of rejected true hypotheses
with asymptotic 95\% confidence intervalls (pointwise)}
\label{fig:XYPlotV1}
\end{figure}
\subsection{Memory considerations}
Up to now we have kept any information in memory. Somtimes, it is useful not to discard any
generated Information. For example,
if some objects in results appear odd, they can directly be reproduced and of course with all
information availible we can investigate anything like distribution or any statistic we
did not consider before the simulation. The price for this liberty is in general an
extensive memory usage which of course restrict the size of our simulation. On the other
hand if we exactly know that we are only interested in $E V_n$ the mean number of true
rejected hypotheses, why should we keep the generated \pvs or the index of the rejected
\pvs ? Thus it would be enough that an obejct in \tw{\$results} contains $V_n$ the number
of rejected true hypotheses. We will know show an example were only the information needed
is kept in memory. The following is basically the "simple example" from section \ref{a.simple.example}
The main issue is to save memory thus we will calculate the $V_n$ right after applying the
procedure to the generated data set. If done so, we can also discard the generated data!
<<>>=
V.BH <- function(pValues, groundTruth, alpha)
{
out <- BH(pValues, alpha, silent=TRUE)
V <- sum(out$rejected * groundTruth)
return(list(V = V))
}
@
As already mentioned anything in \tw{\$procInput} will be used as an input for the procedures.
Since our procedure now has a parameter \tw{groundTruth} our data generated function has to
be adepted. We just move \tw{groundTruth} in \tw{\$procInput}.
<<>>=
pValFromEquiCorrData2 <- function(sampleSize, rho, mu, pi0)
{
nmbOfFalseHyp <- round(sampleSize * (1-pi0))
nmbOfTrueHyp <- sampleSize - nmbOfFalseHyp
muVec <- c(rep(mu, nmbOfFalseHyp), rep(0, nmbOfTrueHyp))
Y <- sqrt(rho) * rnorm(1) + sqrt(1-rho) * rnorm(sampleSize) + muVec
return(list(
procInput=list(pValues = 1 - pnorm(Y),
groundTruth = muVec == 0)
)
)
}
@
Obviously, it is unnecessary
to store this generated data, because the number of Type 1 Errors we are interested in
will be directly calculated.
Setting \tw{discardProcInput = TRUE} in \tw{simulation} removes \tw{\$procInput} from the object
generated by the data generating function after \tw{\$procInput} has been used by all specified
procedures. In our case theses will be the 2 procedures \tw{V.BH} with \tw{alpha = 0.5} and
\tw{alpha = 0.25}.
<<>>=
set.seed(123)
sim <- simulation(replications = 10,
list(funName="pVGen2", fun=pValFromEquiCorrData2,
sampleSize = 100,
rho = seq(0,1,by=0.2),
mu = 2,
pi0 = 0.5),
list(list(funName="V.BH", fun=V.BH, alpha = c(0.5, 0.25))),
discardProcInput=TRUE)
result <- gatherStatistics(sim,
list(V = function(data, results) results$V),
mean)
result
@
Here the corresponding table from section \ref{a.simple.example}.
<<>>=
set.seed(123)
sim.simple <- simulation(replications = 10,
list(funName="pVGen", fun=pValFromEquiCorrData,
sampleSize = 100,
rho = seq(0,1,by=0.2),
mu = 2,
pi0 = 0.5),
list(list(funName="BH", fun=BH, alpha = c(0.5, 0.25), silent=TRUE)))
result.simple <- gatherStatistics(sim.simple,
list(V = NumberOfType1Error),
mean)
print(result.simple)
@
The same results but the memory usage differ much:
<<>>=
print(s1 <- object.size(sim))
print(s2 <- object.size(sim.simple))
unclass(s2/s1)
@
\section{Reproducing some results from BKY (2006)}
Now, we will reproduce figure $1$ from the publication. For this we need to estimate the FDR.
This will be done be calculating the FDP for every object in \tw{\$results} and then calculating
the empirical mean. In this simulation where we exactly know what we want and thus retain only
the necessary information.
To be able to calculate the FDP right after applying the procedure we need to know which
pValue belongs to a true or false hypothesis. Thus, like in the foregoing section,
\tw{\$groundTruth} will be an element of \tw{\$procInput}.
<<>>=
pValFromEquiCorrData2 <- function(sampleSize, rho, mu, pi0)
{
nmbOfFalseHyp <- round(sampleSize * (1-pi0))
nmbOfTrueHyp <- sampleSize - nmbOfFalseHyp
muVec <- c(rep(mu, nmbOfFalseHyp), rep(0, nmbOfTrueHyp))
Y <- sqrt(rho) * rnorm(1) + sqrt(1-rho) * rnorm(sampleSize) + muVec
return(list(
procInput=list(pValues = 1 - pnorm(Y),
groundTruth = muVec == 0)
)
)
}
@
We also need a function that calculates the FDP.
<<>>=
FDP <- function(pValues, groundTruth, proc=c("BH", "M-S-HLF"))
{
if( proc == "BH" )
out <- BH(pValues, alpha = 0.05, silent = TRUE)
else
out <- adaptiveSTS(pValues, alpha = 0.05, silent = TRUE)
R <- sum(out$rejected)
if ( R == 0 ) return(list(FDP = 0))
V <- sum(out$rejected * groundTruth)
return(list(FDP = V/R))
}
@
Now, the (partial) reproduction of the results can start!
<<>>=
set.seed(123)
date()
sim <- simulation(replications = 10000,
list(funName="pVGen2", fun=pValFromEquiCorrData2,
sampleSize = c(16, 32, 64, 128, 256, 512),
rho = c(0,0.1,0.5),
mu = 5,
pi0 = c(0.25, 0.75)),
list(list(funName="FDP", fun=FDP, proc=c("BH", "M-S-HLF"))),
discardProcInput=TRUE)
date()
result <- gatherStatistics(sim,
list(FDP = function(data, results) results$FDP),
mean)
result
@
<<fig=TRUE>>=
print(xyplot(FDP.mean ~ sampleSize | pi0 * rho, data = result$statisticDF, group=proc,
type = "a",
auto.key=list(space="right", points=FALSE, lines=TRUE)
)
)
@
\end{document}
|