File: simToolManual.actuallyRnw

package info (click to toggle)
r-cran-mutoss 0.1-13-1
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 1,560 kB
  • sloc: sh: 13; makefile: 2
file content (581 lines) | stat: -rw-r--r-- 20,309 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
 
\documentclass[a4paper]{article}
\usepackage[OT1]{fontenc}
\usepackage{Sweave}
\usepackage{amssymb}
%\usepackage{hyperref}

\newcommand{\sT}{simTool}
\newcommand{\pv}{$p$-value }
\newcommand{\pvs}{$p$-values }
\newcommand{\mts}{MutossSimObject}
\newcommand{\tw}[1]{\texttt{#1}}

\begin{document}

% \VignetteIndexEntry{MuToss Simulation Tool Guide}

\title{\sT}
\author{MarselScheer}



\maketitle

\tableofcontents

<<echo=FALSE>>=
require(mutoss)
@

\section{Note}
This SweaveFile is part of $\mu$Toss package. But because some parts of 
this document are computational intensiv the extension of the Sweavefile
is ".actuallyRnw".


\section{Introduction}
\label{introduction}
This document will give an introduction to the use of \sT. 
We will start with a very simple application then raise the
degree of complexity in a few steps and in the end reproduce 
some of the results from Benjamini, Krieger, Yekutieli (2006).
Basically for a simulation we need the following things
\begin{enumerate}
\item a function that generates the data (for example \pvs)
\item some procedures that evaluates the generated data (for 
	example the bonferroni correction)
\item statistics we want to calculate. For example the the false 
discovery proportion (FDP).
\end{enumerate}
If we are speak for example of $1000$ replications, we mean that these 3 steps
were repeated $1000$ times. That means, after $1000$ replications there have been
$1000$ data sets generated. Every procedure was applied to these $1000$ data sets.
If we have specified for example $3$ procedures then every of these procedures
will give us an output, this means $3000$ "output objects". And every specified
statistic is applied to these $3000$ "output objects".

\subsection{Motivation}
\subsubsection{Example 1}
Suppose we want to compare some characteristics of the two procedures \tw{BH} and \tw{holm}. 
Denote by $V_n(BH)$ and $V_n(holm)$ the number of true hypotheses rejected. For example
we are interested in the distribution of $V_n(BH)$ and $V_n(holm)$ and also in $E[V_n(BH)]$
and $E[V_n(holm)]$. Lets have a look on the arguments of theses procedures:
<<>>=
args(BH)
args(holm)
@
Both have the argument \tw{alpha}, which stands for the niveau 
of the corresponding error measure. 
\tw{BH} controls the false discovery rate (FDR) and \tw{holm} 
controls the family-wise error rate (FWER). One question may be how 
must \tw{alpha} be set in \tw{holm} such that for fixed \tw{alpha} = 0.1 in 
\tw{BH} we have $E[V_n(BH)] \approx E[V_n(holm)]$?
And perhaps we are also interested in how a dependence structure affects the 
distribution of $V_n(BH)$ and $V_n(holm)$.

\subsubsection{Example 2}
Suppose you have developed a new procedure controlling some error measure and you
want to compare this new procedure with some other already established procedures. 
Then a simulation study will consist of creating data and gathering statistics for 
different sample sizes, dependence structures and parameter constellations.
$$
$$
Basically it is always the same story. Data is generated by a specific function, 
perhaps this depends on some parameters, and we want to apply different procedures,
again depending on some paramters, to this generated data. Nearly everyone will say,
"don't bother me with the details of programming, just do the simulation 
with the above information and give me a nice \tw{data.frame} which I can analyze."
And this is exactly the purpose of the simTool.


\subsection{Data generating function}
\label{DataGenFun}
For now we only want to use the procedures \texttt{BH} and \texttt{holm}. Again, lets 
have a look on the arguments of these procedures:
<<>>=
args(BH)
args(holm)
@
Not much has to be specified. The parameter \texttt{alpha} is the level at which 
the corresponding error rate should be controlled and \tw{pValues} are the \pvs 
of the hypotheses that should be tested. In general, if we say that we reject a
\pv this means that we reject the corresponding hypotheses.
The following function generates data and will be used throughout the whole
document. The data generating function \textbf{must have} an entry \$procInput.
All in \$procInput will be used as input for the specified procedures.
In our situation we will generate a list with 2 entries.  \$procInput
will only consist of the generated \pvs. And \$groundTruth indicates which \pv 
corresponds to a true or false hypothesis.  
<<>>=
pValFromEquiCorrData <- function(sampleSize, rho, mu, pi0)
{		
	nmbOfFalseHyp <- round(sampleSize * (1-pi0))
	nmbOfTrueHyp <- sampleSize - nmbOfFalseHyp
	muVec <- c(rep(mu, nmbOfFalseHyp), rep(0, nmbOfTrueHyp))
	Y <- sqrt(rho) * rnorm(1) + sqrt(1-rho) * rnorm(sampleSize) + muVec
	return(list(
		procInput=list(pValues = 1 - pnorm(Y)),
		groundTruth = muVec == 0
		)
	)
}
@


\subsubsection{Illustration of generated Data}
We will now generate 1000 $p$-Values independently (\tw{rho} = 0). 
\tw{sampleSize * (1-pi0)} = 700 of them correspond to false hypotheses and 300 to 
true hypotheses.
<<>>=
set.seed(123)
data <- pValFromEquiCorrData(
		sampleSize = 1000, 
		rho = 0, 
		mu = 2, 
		pi0 = 0.3)
@
Lets visualize the different \pvs. 
<<fig=TRUE>>=
local({
pValues = data$procInput$pValues
groundTruth = data$groundTruth
plot(ecdf(pValues), do.points=FALSE, verticals=TRUE, main="ecdf's of generated p-values")
lines(ecdf(pValues[groundTruth]), do.points=FALSE, verticals=TRUE, col=2)
lines(ecdf(pValues[!groundTruth]), do.points=FALSE, verticals=TRUE, col=4)
abline(0,1,lty=2)
legend("right", legend=c("all", "true", "false"), col=c(1,2,4), lty=1, title="p-values")
})
@

\subsection{A very simple example}
Lets directly start with the function call and then analysing what happend. Just for now we
implement a simple version of the data generating function in section \ref{DataGenFun} and
a simple version of the \tw{BH}.
<<>>=
myGen <- function()
{
	pValFromEquiCorrData(sampleSize = 200, rho = 0, mu = 2, pi0=0.5)
}
BH.05 = function(pValues)
{
	BH(pValues=pValues, alpha = 0.05, silent=TRUE)
}
@

<<>>=
set.seed(123)
sim <- simulation(replications = 10, list(funName="myGen", fun=myGen), 
		list(list(funName="BH.simple", fun=BH.05)))
@
The following happend:
\begin{enumerate}
	\item Call \tw{myGen}
	\item Append generated data set to \tw{sim\$data}
	\item Call \tw{BH.05} with the generated \pvs
	\item Add to the results of \tw{BH.05} the parameter constellation used
		by \tw{myGen} and \tw{BH.05} and also the position of the used
		data in \tw{sim\$data}
	\item Append this extended results of \tw{BH.05} to \tw{sim\$results}
	\item repeat this 9 more times    
\end{enumerate}
Since \tw{replications = 10} in \tw{simulation} the object \tw{sim} consists of:
<<>>=
names(sim)
length(sim$data)
length(sim$results)
@
The structure of one object in \tw{sim\$data} coincides with the structure of the return
of \tw{myGen}:
<<>>=
names(myGen())
names(sim$data[[1]])
@
Lets have a look at the additional information added by \tw{simulation} to the object returned
by \tw{BH.05}. First the entries of the pure return value of \tw{BH.05} and then of \tw{sim\$results} 
<<>>=
names(BH.05(runif(100)))
names(sim$results[[1]])
@
\tw{data.set.number} is the number of the used data set in \tw{sim\$data}. Since every generated
data set is used only once we have
<<>>=
sapply(sim$results, function(x) x$data.set.number)
@
 
Thus it is possible
to reproduce any result directly. Let us reproduce the results of the 6th replication
<<>>=
idx <- sim$results[[6]]$data.set.number
pValues <- sim$data[[idx]]$procInput$pValues
all(BH.05(pValues)$adjPValues == sim$results[[6]]$adjPValues)
@
Since our data generating function and procedure did not really have any parameter, there is
not much information in:
<<>>=
sim$results[[1]]$parameters
@
Note, \tw{BH.05} technically has the parameter \tw{pValues} but since this parameter is 
the "input channel" for the output data of the data generating function it is not really 
regarded as parameter that affect the behaviour of the procedure like for example the 
level $\alpha$ for the error measure.  

\subsection{A simple example}
\label{a.simple.example}
Indeed, for the simple example above the simTool is not very helpful. Let us increase the 
complexity of our simulation. As a first step, we want to increase the parameter \tw{rho} 
of the data generating function gradually and investigate $E V_n(BH)$ and $E V^2_n(BH)$.
<<>>=
set.seed(123)
sim <- simulation(replications = 10, 
		list(funName="pVGen", fun=pValFromEquiCorrData, 
				sampleSize = 100, 
				rho = seq(0,1,by=0.2),
				mu = 2,
				pi0 = 0.5), 
		list(list(funName="BH",	fun=BH, 
			alpha = c(0.5, 0.25), 
			silent=TRUE)))	
@
The following happend:
\begin{enumerate}
	\item Call \tw{myGen} with \tw{rho = 0}
	\item Append generated data set to \tw{sim\$data}
	\item Call \tw{BH} with \tw{alpha = 0.5} and the generated \pvs
	\item Add to the results the parameter constellation used
		by \tw{myGen} and \tw{BH} and also the position of the used
		data in \tw{sim\$data}
	\item Append this extended results to \tw{sim\$results}
	\item Call \tw{BH} with \tw{alpha = 0.25} and the \textbf{same} \pvs as in 3.
	\item Add to the results the parameter constellation used
		by \tw{myGen} and \tw{BH} and also the position of the used
		data in \tw{sim\$data}		
	\item Append this extended results of to \tw{sim\$results}
	\item repeat this 9 more times
	\item Call \tw{myGen} with \tw{rho = 0.2}
	\item and so on   
\end{enumerate}
So we have now a bunch of data sets and results.
<<>>=
length(sim$data)
length(sim$results)
@
60 data sets have been generated because we have \tw{rho = 0, 0.2, 0.4, 0.6, 0.8, 1} and for
each \tw{rho} we generated 10 data sets. We have applied \tw{BH} with \tw{alpha = 0.5} to all
60 data sets and \tw{BH} with \tw{alpha = 0.25}, yielding 120 objects in \tw{results}.
We see from the \tw{data.set.number} that after generating a data set it is used two times in series
<<>>=
sapply(sim$results, function(x) x$data.set.number)
@

In order to gather how many true hypotheses have been rejected corresponding to the different
parameter constellations we need a function that is able to calculate it.
<<>>=
NumberOfType1Error <- function(data, result) sum(data$groundTruth * result$rejected)
V2 <- function(data, result) NumberOfType1Error(data,result)^2
#result.all <- gatherStatistics(sim, list(NumOfType1Err = NumberOfType1Error))
@
Lets us calculate for the first result object the number of rejected true hypotheses explicitly.
<<>>=
idx = sim$results[[1]]$data.set.number
data = sim$data[[idx]]
NumberOfType1Error(data, sim$results[[1]])
@
This means for the following parameter constellation
<<>>=
sim$results[[1]]$parameters
@
we one time observed 
<<echo=FALSE>>=
NumberOfType1Error(data, sim$results[[1]])	
@
rejected true null hypotheses. In order to estimate $E V_n(BH)$ and $E V^2_n(BH)$ 
we have to search in \tw{sim\$results} for the other 9 results using the same 
parameter constellation. In order to facilitate this task we provide a function.
<<>>=
result <- gatherStatistics(sim, 
		list(V = NumberOfType1Error, V2 = V2),
		mean)
result
@
As you can see every, parameter constellation has its own row in \tw{\$statisticDF}. 
By the way, again, we see that \pvs is a parameter of \tw{BH} but since it is contained in \tw{\$procInput} it is
not considered as "real parameter".\\
If we are interested in more than one statistics, then we simply provide a list with "average functions"
<<>>=
result <- gatherStatistics(sim, 
		list(V = NumberOfType1Error, V2 = V2), 
		list(mean = mean, sd = function(vec) round(sd(vec), 1)))
result$statisticDF
@
Another possibility is to call \tw{gatherStatistics} without and "average function". Then every result
get his own row in \tw{\$statisticDF} 
<<>>=
result <- gatherStatistics(sim, 
		list(V = NumberOfType1Error, V2 = V2))
head(result$statisticDF)
tail(result$statisticDF)
@


\subsection{Plotting a bit}
<<>>=
if(!is.element("sim.plot", ls()))
{
sim.plot <- simulation(replications = 1000,
	list(funName = "pVGen",
		fun = pValFromEquiCorrData,
		sampleSize = 100,
		rho = seq(0, 1, by = 0.2),
		mu = 0,
		pi0 = 1
	),
	list(
		list(funName = "BH",
			fun = function(pValues, alpha) BH(
				pValues, 
				alpha, 
				silent=TRUE),
			alpha = c(0.5, 0.75)
		)	
	)
)
}
length(sim.plot$data)
length(sim.plot$results)
@
The are already different kind of R functions that take data.frames and generate 
histograms, boxplots and so on from them.
For some plot example we will need the lattice package.
<<>>=
require(lattice)
@
First we calculate V for every MutossSim object and then plot a histogram and a boxplot of V, 
see Figure~\ref{fig:HistV1} (p.~\pageref{fig:HistV1}), Figure~\ref{fig:BWV1} (p.~\pageref{fig:BWV1}).
<<>>=
result.all <- gatherStatistics(
		sim.plot,
		list(V = NumberOfType1Error)
)		
@
<<label=figHistV1,include=FALSE, echo=FALSE>>=
print(
	histogram(~V | rho, 
		data = subset(
			result.all$statisticDF, 
			alpha=="0.5")))
@
\begin{figure}
\begin{center}
<<label=fig1,fig=TRUE,echo=TRUE>>=
<<figHistV1>>
@
\end{center}
\caption{Histogram of the number of rejected true hypotheses for alpha equals 0.5}
\label{fig:HistV1}
\end{figure}
<<label=figBWV1,include=FALSE, echo=FALSE>>=
print(
	bwplot(V ~ rho | alpha, 
		data = subset(result.all$statisticDF)))
@
\begin{figure}
\begin{center}
<<label=fig2,fig=TRUE,echo=TRUE>>=
<<figBWV1>>
@
\end{center}
\caption{Boxplot of the number of rejected true hypotheses}
\label{fig:BWV1}
\end{figure}
Also after the "average process" we again get an data.frame which can be used to generate 
plots.
<<>>=
result <- gatherStatistics(
	sim.plot,
	list(V = NumberOfType1Error),
	list(Ml = function(x) 
			mean(x) - 2 * sd(x)/sqrt(length(x)),
		Mu = function(x) 
			mean(x) + 2 * sd(x)/sqrt(length(x)),
		M = mean, 
		SD = sd)
)
subset(result$statisticDF, alpha=="0.5")[1:3,-1]

<<label=figXYPlot,include=FALSE, echo=FALSE>>=
print(
	xyplot(
		V.Ml + V.M + V.Mu ~ rho | alpha, 
		data = result$statisticDF,
		type = "a",
		auto.key=list(
			space="right", 
			points=FALSE, 
			lines=TRUE)
	)
)
@
\begin{figure}
\begin{center}
<<label=fig3,fig=TRUE,echo=TRUE>>=
<<figXYPlot>>
@
\end{center}
\caption{XYPlot of the mean number of rejected true hypotheses 
with asymptotic 95\% confidence intervalls (pointwise)}
\label{fig:XYPlotV1}
\end{figure}


\subsection{Memory considerations}
Up to now we have kept any information in memory. Somtimes, it is useful not to discard any
generated Information. For example,
if some objects in results appear odd, they can directly be reproduced and of course with all
information availible we can investigate anything like distribution or any statistic we
did not consider before the simulation. The price for this liberty is in general an
extensive memory usage which of course restrict the size of our simulation. On the other
hand if we exactly know that we are only interested in $E V_n$ the mean number of true 
rejected hypotheses, why should we keep the generated \pvs or the index of the rejected
\pvs ?  Thus it would be enough that an obejct in \tw{\$results} contains $V_n$ the number
of rejected true hypotheses. We will know show an example were only the information needed
is kept in memory. The following is basically the "simple example" from section \ref{a.simple.example}
The main issue is to save memory thus we will calculate the $V_n$ right after applying the
procedure to the generated data set. If done so, we can also discard the generated data!
<<>>=
V.BH <- function(pValues, groundTruth, alpha)
{
	out <- BH(pValues, alpha, silent=TRUE)
	V <- sum(out$rejected * groundTruth)
	return(list(V = V))
}
@
As already mentioned anything in \tw{\$procInput} will be used as an input for the procedures.
Since our procedure now has a parameter \tw{groundTruth} our data generated function has to
be adepted. We just move \tw{groundTruth} in \tw{\$procInput}. 
<<>>=
pValFromEquiCorrData2 <- function(sampleSize, rho, mu, pi0)
{		
	nmbOfFalseHyp <- round(sampleSize * (1-pi0))
	nmbOfTrueHyp <- sampleSize - nmbOfFalseHyp
	muVec <- c(rep(mu, nmbOfFalseHyp), rep(0, nmbOfTrueHyp))
	Y <- sqrt(rho) * rnorm(1) + sqrt(1-rho) * rnorm(sampleSize) + muVec
	return(list(
		procInput=list(pValues = 1 - pnorm(Y),
				groundTruth = muVec == 0)
		)
	)
}
@
Obviously, it is unnecessary
to store this generated data, because the number of Type 1 Errors we are interested in 
will be directly calculated.
Setting \tw{discardProcInput = TRUE} in \tw{simulation} removes \tw{\$procInput} from the object
generated by the data generating function after \tw{\$procInput} has been used by all specified
procedures. In our case theses will be the 2 procedures \tw{V.BH} with \tw{alpha = 0.5} and
\tw{alpha = 0.25}. 
<<>>=
set.seed(123)
sim <- simulation(replications = 10, 
		list(funName="pVGen2", fun=pValFromEquiCorrData2, 
				sampleSize = 100, 
				rho = seq(0,1,by=0.2),
				mu = 2,
				pi0 = 0.5), 
		list(list(funName="V.BH", fun=V.BH, alpha = c(0.5, 0.25))),
		discardProcInput=TRUE)
result <- gatherStatistics(sim, 
		list(V = function(data, results) results$V),
		mean)
result	
@
Here the corresponding table from section \ref{a.simple.example}. 
<<>>=
set.seed(123)
sim.simple <- simulation(replications = 10, 
		list(funName="pVGen", fun=pValFromEquiCorrData, 
				sampleSize = 100, 
				rho = seq(0,1,by=0.2),
				mu = 2,
				pi0 = 0.5), 
		list(list(funName="BH", fun=BH, alpha = c(0.5, 0.25), silent=TRUE)))
result.simple <- gatherStatistics(sim.simple, 
		list(V = NumberOfType1Error),
		mean)
print(result.simple)	
@
The same results but the memory usage differ much:
<<>>=
print(s1 <- object.size(sim))
print(s2 <- object.size(sim.simple))	
unclass(s2/s1)		
@


\section{Reproducing some results from BKY (2006)}
Now, we will reproduce figure $1$ from the publication. For this we need to estimate the FDR. 
This will be done be calculating the FDP for every object in \tw{\$results} and then calculating
the empirical mean. In this simulation where we exactly know what we want and thus retain only
the necessary information.
To be able to calculate the FDP right after applying the procedure we need to know which 
pValue belongs to a true or false hypothesis. Thus, like in the foregoing section, 
\tw{\$groundTruth} will be an element of \tw{\$procInput}.
<<>>=
pValFromEquiCorrData2 <- function(sampleSize, rho, mu, pi0)
{		
	nmbOfFalseHyp <- round(sampleSize * (1-pi0))
	nmbOfTrueHyp <- sampleSize - nmbOfFalseHyp
	muVec <- c(rep(mu, nmbOfFalseHyp), rep(0, nmbOfTrueHyp))
	Y <- sqrt(rho) * rnorm(1) + sqrt(1-rho) * rnorm(sampleSize) + muVec
	return(list(
		procInput=list(pValues = 1 - pnorm(Y),
				groundTruth = muVec == 0)
		)
	)
}
@
We also need a function that calculates the FDP.
<<>>=
FDP <- function(pValues, groundTruth, proc=c("BH", "M-S-HLF"))
{
	if( proc == "BH" )
		out <- BH(pValues, alpha = 0.05, silent = TRUE)
	else
		out <- adaptiveSTS(pValues, alpha = 0.05, silent = TRUE)
	
	R <- sum(out$rejected)
	if ( R == 0 ) return(list(FDP = 0))
	
	V <- sum(out$rejected * groundTruth)
	return(list(FDP = V/R))		
}
@
Now, the (partial) reproduction of the results can start!
<<>>=
set.seed(123)
date()
sim <- simulation(replications = 10000, 
		list(funName="pVGen2", fun=pValFromEquiCorrData2, 
				sampleSize = c(16, 32, 64, 128, 256, 512),
				rho = c(0,0.1,0.5),
				mu = 5,
				pi0 = c(0.25, 0.75)), 
		list(list(funName="FDP", fun=FDP, proc=c("BH", "M-S-HLF"))),
		discardProcInput=TRUE)
date()
result <- gatherStatistics(sim, 
		list(FDP = function(data, results) results$FDP),
		mean)
result
@
<<fig=TRUE>>=
print(xyplot(FDP.mean ~ sampleSize | pi0 * rho, data = result$statisticDF, group=proc,
		type = "a",
		auto.key=list(space="right", points=FALSE, lines=TRUE)
	)
)		
@

\end{document}