File: simToolManual.actuallyRnw

package info (click to toggle)
r-cran-mutoss 0.1-13-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 1,560 kB
sloc: sh: 13; makefile: 2
file content (581 lines) | stat: -rw-r--r-- 20,309 bytes
parent folder | download | duplicates (3)
 
\documentclass[a4paper]{article}
\usepackage[OT1]{fontenc}
\usepackage{Sweave}
\usepackage{amssymb}
%\usepackage{hyperref}

\newcommand{\sT}{simTool}
\newcommand{\pv}{$p$-value }
\newcommand{\pvs}{$p$-values }
\newcommand{\mts}{MutossSimObject}
\newcommand{\tw}[1]{\texttt{#1}}

\begin{document}

% \VignetteIndexEntry{MuToss Simulation Tool Guide}

\title{\sT}
\author{MarselScheer}



\maketitle

\tableofcontents

<<echo=FALSE>>=
require(mutoss)
@

\section{Note}
This SweaveFile is part of $\mu$Toss package. But because some parts of 
this document are computational intensiv the extension of the Sweavefile
is ".actuallyRnw".


\section{Introduction}
\label{introduction}
This document will give an introduction to the use of \sT. 
We will start with a very simple application then raise the
degree of complexity in a few steps and in the end reproduce 
some of the results from Benjamini, Krieger, Yekutieli (2006).
Basically for a simulation we need the following things
\begin{enumerate}
\item a function that generates the data (for example \pvs)
\item some procedures that evaluates the generated data (for 
	example the bonferroni correction)
\item statistics we want to calculate. For example the the false 
discovery proportion (FDP).
\end{enumerate}
If we are speak for example of $1000$ replications, we mean that these 3 steps
were repeated $1000$ times. That means, after $1000$ replications there have been
$1000$ data sets generated. Every procedure was applied to these $1000$ data sets.
If we have specified for example $3$ procedures then every of these procedures
will give us an output, this means $3000$ "output objects". And every specified
statistic is applied to these $3000$ "output objects".

\subsection{Motivation}
\subsubsection{Example 1}
Suppose we want to compare some characteristics of the two procedures \tw{BH} and \tw{holm}. 
Denote by $V_n(BH)$ and $V_n(holm)$ the number of true hypotheses rejected. For example
we are interested in the distribution of $V_n(BH)$ and $V_n(holm)$ and also in $E[V_n(BH)]$
and $E[V_n(holm)]$. Lets have a look on the arguments of theses procedures:
<<>>=
args(BH)
args(holm)
@
Both have the argument \tw{alpha}, which stands for the niveau 
of the corresponding error measure. 
\tw{BH} controls the false discovery rate (FDR) and \tw{holm} 
controls the family-wise error rate (FWER). One question may be how 
must \tw{alpha} be set in \tw{holm} such that for fixed \tw{alpha} = 0.1 in 
\tw{BH} we have $E[V_n(BH)] \approx E[V_n(holm)]$?
And perhaps we are also interested in how a dependence structure affects the 
distribution of $V_n(BH)$ and $V_n(holm)$.

\subsubsection{Example 2}
Suppose you have developed a new procedure controlling some error measure and you
want to compare this new procedure with some other already established procedures. 
Then a simulation study will consist of creating data and gathering statistics for 
different sample sizes, dependence structures and parameter constellations.
$$
$$
Basically it is always the same story. Data is generated by a specific function, 
perhaps this depends on some parameters, and we want to apply different procedures,
again depending on some paramters, to this generated data. Nearly everyone will say,
"don't bother me with the details of programming, just do the simulation 
with the above information and give me a nice \tw{data.frame} which I can analyze."
And this is exactly the purpose of the simTool.


\subsection{Data generating function}
\label{DataGenFun}
For now we only want to use the procedures \texttt{BH} and \texttt{holm}. Again, lets 
have a look on the arguments of these procedures:
<<>>=
args(BH)
args(holm)
@
Not much has to be specified. The parameter \texttt{alpha} is the level at which 
the corresponding error rate should be controlled and \tw{pValues} are the \pvs 
of the hypotheses that should be tested. In general, if we say that we reject a
\pv this means that we reject the corresponding hypotheses.
The following function generates data and will be used throughout the whole
document. The data generating function \textbf{must have} an entry \$procInput.
All in \$procInput will be used as input for the specified procedures.
In our situation we will generate a list with 2 entries.  \$procInput
will only consist of the generated \pvs. And \$groundTruth indicates which \pv 
corresponds to a true or false hypothesis.  
<<>>=
pValFromEquiCorrData <- function(sampleSize, rho, mu, pi0)
{		
	nmbOfFalseHyp <- round(sampleSize * (1-pi0))
	nmbOfTrueHyp <- sampleSize - nmbOfFalseHyp
	muVec <- c(rep(mu, nmbOfFalseHyp), rep(0, nmbOfTrueHyp))
	Y <- sqrt(rho) * rnorm(1) + sqrt(1-rho) * rnorm(sampleSize) + muVec
	return(list(
		procInput=list(pValues = 1 - pnorm(Y)),
		groundTruth = muVec == 0
		)
	)
}
@


\subsubsection{Illustration of generated Data}
We will now generate 1000 $p$-Values independently (\tw{rho} = 0). 
\tw{sampleSize * (1-pi0)} = 700 of them correspond to false hypotheses and 300 to 
true hypotheses.
<<>>=
set.seed(123)
data <- pValFromEquiCorrData(
		sampleSize = 1000, 
		rho = 0, 
		mu = 2, 
		pi0 = 0.3)
@
Lets visualize the different \pvs. 
<<fig=TRUE>>=
local({
pValues = data$procInput$pValues
groundTruth = data$groundTruth
plot(ecdf(pValues), do.points=FALSE, verticals=TRUE, main="ecdf's of generated p-values")
lines(ecdf(pValues[groundTruth]), do.points=FALSE, verticals=TRUE, col=2)
lines(ecdf(pValues[!groundTruth]), do.points=FALSE, verticals=TRUE, col=4)
abline(0,1,lty=2)
legend("right", legend=c("all", "true", "false"), col=c(1,2,4), lty=1, title="p-values")
})
@

\subsection{A very simple example}
Lets directly start with the function call and then analysing what happend. Just for now we
implement a simple version of the data generating function in section \ref{DataGenFun} and
a simple version of the \tw{BH}.
<<>>=
myGen <- function()
{
	pValFromEquiCorrData(sampleSize = 200, rho = 0, mu = 2, pi0=0.5)
}
BH.05 = function(pValues)
{
	BH(pValues=pValues, alpha = 0.05, silent=TRUE)
}
@

<<>>=
set.seed(123)
sim <- simulation(replications = 10, list(funName="myGen", fun=myGen), 
		list(list(funName="BH.simple", fun=BH.05)))
@
The following happend:
\begin{enumerate}
	\item Call \tw{myGen}
	\item Append generated data set to \tw{sim\$data}
	\item Call \tw{BH.05} with the generated \pvs
	\item Add to the results of \tw{BH.05} the parameter constellation used
		by \tw{myGen} and \tw{BH.05} and also the position of the used
		data in \tw{sim\$data}
	\item Append this extended results of \tw{BH.05} to \tw{sim\$results}
	\item repeat this 9 more times    
\end{enumerate}
Since \tw{replications = 10} in \tw{simulation} the object \tw{sim} consists of:
<<>>=
names(sim)
length(sim$data)
length(sim$results)
@
The structure of one object in \tw{sim\$data} coincides with the structure of the return
of \tw{myGen}:
<<>>=
names(myGen())
names(sim$data[[1]])
@
Lets have a look at the additional information added by \tw{simulation} to the object returned
by \tw{BH.05}. First the entries of the pure return value of \tw{BH.05} and then of \tw{sim\$results} 
<<>>=
names(BH.05(runif(100)))
names(sim$results[[1]])
@
\tw{data.set.number} is the number of the used data set in \tw{sim\$data}. Since every generated
data set is used only once we have
<<>>=
sapply(sim$results, function(x) x$data.set.number)
@
 
Thus it is possible
to reproduce any result directly. Let us reproduce the results of the 6th replication
<<>>=
idx <- sim$results[[6]]$data.set.number
pValues <- sim$data[[idx]]$procInput$pValues
all(BH.05(pValues)$adjPValues == sim$results[[6]]$adjPValues)
@
Since our data generating function and procedure did not really have any parameter, there is
not much information in:
<<>>=
sim$results[[1]]$parameters
@
Note, \tw{BH.05} technically has the parameter \tw{pValues} but since this parameter is 
the "input channel" for the output data of the data generating function it is not really 
regarded as parameter that affect the behaviour of the procedure like for example the 
level $\alpha$ for the error measure.  

\subsection{A simple example}
\label{a.simple.example}
Indeed, for the simple example above the simTool is not very helpful. Let us increase the 
complexity of our simulation. As a first step, we want to increase the parameter \tw{rho} 
of the data generating function gradually and investigate $E V_n(BH)$ and $E V^2_n(BH)$.
<<>>=
set.seed(123)
sim <- simulation(replications = 10, 
		list(funName="pVGen", fun=pValFromEquiCorrData, 
				sampleSize = 100, 
				rho = seq(0,1,by=0.2),
				mu = 2,
				pi0 = 0.5), 
		list(list(funName="BH",	fun=BH, 
			alpha = c(0.5, 0.25), 
			silent=TRUE)))	
@
The following happend:
\begin{enumerate}
	\item Call \tw{myGen} with \tw{rho = 0}
	\item Append generated data set to \tw{sim\$data}
	\item Call \tw{BH} with \tw{alpha = 0.5} and the generated \pvs
	\item Add to the results the parameter constellation used
		by \tw{myGen} and \tw{BH} and also the position of the used
		data in \tw{sim\$data}
	\item Append this extended results to \tw{sim\$results}
	\item Call \tw{BH} with \tw{alpha = 0.25} and the \textbf{same} \pvs as in 3.
	\item Add to the results the parameter constellation used
		by \tw{myGen} and \tw{BH} and also the position of the used
		data in \tw{sim\$data}		
	\item Append this extended results of to \tw{sim\$results}
	\item repeat this 9 more times
	\item Call \tw{myGen} with \tw{rho = 0.2}
	\item and so on   
\end{enumerate}
So we have now a bunch of data sets and results.
<<>>=
length(sim$data)
length(sim$results)
@
60 data sets have been generated because we have \tw{rho = 0, 0.2, 0.4, 0.6, 0.8, 1} and for
each \tw{rho} we generated 10 data sets. We have applied \tw{BH} with \tw{alpha = 0.5} to all
60 data sets and \tw{BH} with \tw{alpha = 0.25}, yielding 120 objects in \tw{results}.
We see from the \tw{data.set.number} that after generating a data set it is used two times in series
<<>>=
sapply(sim$results, function(x) x$data.set.number)
@

In order to gather how many true hypotheses have been rejected corresponding to the different
parameter constellations we need a function that is able to calculate it.
<<>>=
NumberOfType1Error <- function(data, result) sum(data$groundTruth * result$rejected)
V2 <- function(data, result) NumberOfType1Error(data,result)^2
#result.all <- gatherStatistics(sim, list(NumOfType1Err = NumberOfType1Error))
@
Lets us calculate for the first result object the number of rejected true hypotheses explicitly.
<<>>=
idx = sim$results[[1]]$data.set.number
data = sim$data[[idx]]
NumberOfType1Error(data, sim$results[[1]])
@
This means for the following parameter constellation
<<>>=
sim$results[[1]]$parameters
@
we one time observed 
<<echo=FALSE>>=
NumberOfType1Error(data, sim$results[[1]])	
@
rejected true null hypotheses. In order to estimate $E V_n(BH)$ and $E V^2_n(BH)$ 
we have to search in \tw{sim\$results} for the other 9 results using the same 
parameter constellation. In order to facilitate this task we provide a function.
<<>>=
result <- gatherStatistics(sim, 
		list(V = NumberOfType1Error, V2 = V2),
		mean)
result
@
As you can see every, parameter constellation has its own row in \tw{\$statisticDF}. 
By the way, again, we see that \pvs is a parameter of \tw{BH} but since it is contained in \tw{\$procInput} it is
not considered as "real parameter".\\
If we are interested in more than one statistics, then we simply provide a list with "average functions"
<<>>=
result <- gatherStatistics(sim, 
		list(V = NumberOfType1Error, V2 = V2), 
		list(mean = mean, sd = function(vec) round(sd(vec), 1)))
result$statisticDF
@
Another possibility is to call \tw{gatherStatistics} without and "average function". Then every result
get his own row in \tw{\$statisticDF} 
<<>>=
result <- gatherStatistics(sim, 
		list(V = NumberOfType1Error, V2 = V2))
head(result$statisticDF)
tail(result$statisticDF)
@


\subsection{Plotting a bit}
<<>>=
if(!is.element("sim.plot", ls()))
{
sim.plot <- simulation(replications = 1000,
	list(funName = "pVGen",
		fun = pValFromEquiCorrData,
		sampleSize = 100,
		rho = seq(0, 1, by = 0.2),
		mu = 0,
		pi0 = 1
	),
	list(
		list(funName = "BH",
			fun = function(pValues, alpha) BH(
				pValues, 
				alpha, 
				silent=TRUE),
			alpha = c(0.5, 0.75)
		)	
	)
)
}
length(sim.plot$data)
length(sim.plot$results)
@
The are already different kind of R functions that take data.frames and generate 
histograms, boxplots and so on from them.
For some plot example we will need the lattice package.
<<>>=
require(lattice)
@
First we calculate V for every MutossSim object and then plot a histogram and a boxplot of V, 
see Figure~\ref{fig:HistV1} (p.~\pageref{fig:HistV1}), Figure~\ref{fig:BWV1} (p.~\pageref{fig:BWV1}).
<<>>=
result.all <- gatherStatistics(
		sim.plot,
		list(V = NumberOfType1Error)
)		
@
<<label=figHistV1,include=FALSE, echo=FALSE>>=
print(
	histogram(~V | rho, 
		data = subset(
			result.all$statisticDF, 
			alpha=="0.5")))
@
\begin{figure}
\begin{center}
<<label=fig1,fig=TRUE,echo=TRUE>>=
<<figHistV1>>
@
\end{center}
\caption{Histogram of the number of rejected true hypotheses for alpha equals 0.5}
\label{fig:HistV1}
\end{figure}
<<label=figBWV1,include=FALSE, echo=FALSE>>=
print(
	bwplot(V ~ rho | alpha, 
		data = subset(result.all$statisticDF)))
@
\begin{figure}
\begin{center}
<<label=fig2,fig=TRUE,echo=TRUE>>=
<<figBWV1>>
@
\end{center}
\caption{Boxplot of the number of rejected true hypotheses}
\label{fig:BWV1}
\end{figure}
Also after the "average process" we again get an data.frame which can be used to generate 
plots.
<<>>=
result <- gatherStatistics(
	sim.plot,
	list(V = NumberOfType1Error),
	list(Ml = function(x) 
			mean(x) - 2 * sd(x)/sqrt(length(x)),
		Mu = function(x) 
			mean(x) + 2 * sd(x)/sqrt(length(x)),
		M = mean, 
		SD = sd)
)
subset(result$statisticDF, alpha=="0.5")[1:3,-1]

<<label=figXYPlot,include=FALSE, echo=FALSE>>=
print(
	xyplot(
		V.Ml + V.M + V.Mu ~ rho | alpha, 
		data = result$statisticDF,
		type = "a",
		auto.key=list(
			space="right", 
			points=FALSE, 
			lines=TRUE)
	)
)
@
\begin{figure}
\begin{center}
<<label=fig3,fig=TRUE,echo=TRUE>>=
<<figXYPlot>>
@
\end{center}
\caption{XYPlot of the mean number of rejected true hypotheses 
with asymptotic 95\% confidence intervalls (pointwise)}
\label{fig:XYPlotV1}
\end{figure}


\subsection{Memory considerations}
Up to now we have kept any information in memory. Somtimes, it is useful not to discard any
generated Information. For example,
if some objects in results appear odd, they can directly be reproduced and of course with all
information availible we can investigate anything like distribution or any statistic we
did not consider before the simulation. The price for this liberty is in general an
extensive memory usage which of course restrict the size of our simulation. On the other
hand if we exactly know that we are only interested in $E V_n$ the mean number of true 
rejected hypotheses, why should we keep the generated \pvs or the index of the rejected
\pvs ?  Thus it would be enough that an obejct in \tw{\$results} contains $V_n$ the number
of rejected true hypotheses. We will know show an example were only the information needed
is kept in memory. The following is basically the "simple example" from section \ref{a.simple.example}
The main issue is to save memory thus we will calculate the $V_n$ right after applying the
procedure to the generated data set. If done so, we can also discard the generated data!
<<>>=
V.BH <- function(pValues, groundTruth, alpha)
{
	out <- BH(pValues, alpha, silent=TRUE)
	V <- sum(out$rejected * groundTruth)
	return(list(V = V))
}
@
As already mentioned anything in \tw{\$procInput} will be used as an input for the procedures.
Since our procedure now has a parameter \tw{groundTruth} our data generated function has to
be adepted. We just move \tw{groundTruth} in \tw{\$procInput}. 
<<>>=
pValFromEquiCorrData2 <- function(sampleSize, rho, mu, pi0)
{		
	nmbOfFalseHyp <- round(sampleSize * (1-pi0))
	nmbOfTrueHyp <- sampleSize - nmbOfFalseHyp
	muVec <- c(rep(mu, nmbOfFalseHyp), rep(0, nmbOfTrueHyp))
	Y <- sqrt(rho) * rnorm(1) + sqrt(1-rho) * rnorm(sampleSize) + muVec
	return(list(
		procInput=list(pValues = 1 - pnorm(Y),
				groundTruth = muVec == 0)
		)
	)
}
@
Obviously, it is unnecessary
to store this generated data, because the number of Type 1 Errors we are interested in 
will be directly calculated.
Setting \tw{discardProcInput = TRUE} in \tw{simulation} removes \tw{\$procInput} from the object
generated by the data generating function after \tw{\$procInput} has been used by all specified
procedures. In our case theses will be the 2 procedures \tw{V.BH} with \tw{alpha = 0.5} and
\tw{alpha = 0.25}. 
<<>>=
set.seed(123)
sim <- simulation(replications = 10, 
		list(funName="pVGen2", fun=pValFromEquiCorrData2, 
				sampleSize = 100, 
				rho = seq(0,1,by=0.2),
				mu = 2,
				pi0 = 0.5), 
		list(list(funName="V.BH", fun=V.BH, alpha = c(0.5, 0.25))),
		discardProcInput=TRUE)
result <- gatherStatistics(sim, 
		list(V = function(data, results) results$V),
		mean)
result	
@
Here the corresponding table from section \ref{a.simple.example}. 
<<>>=
set.seed(123)
sim.simple <- simulation(replications = 10, 
		list(funName="pVGen", fun=pValFromEquiCorrData, 
				sampleSize = 100, 
				rho = seq(0,1,by=0.2),
				mu = 2,
				pi0 = 0.5), 
		list(list(funName="BH", fun=BH, alpha = c(0.5, 0.25), silent=TRUE)))
result.simple <- gatherStatistics(sim.simple, 
		list(V = NumberOfType1Error),
		mean)
print(result.simple)	
@
The same results but the memory usage differ much:
<<>>=
print(s1 <- object.size(sim))
print(s2 <- object.size(sim.simple))	
unclass(s2/s1)		
@


\section{Reproducing some results from BKY (2006)}
Now, we will reproduce figure $1$ from the publication. For this we need to estimate the FDR. 
This will be done be calculating the FDP for every object in \tw{\$results} and then calculating
the empirical mean. In this simulation where we exactly know what we want and thus retain only
the necessary information.
To be able to calculate the FDP right after applying the procedure we need to know which 
pValue belongs to a true or false hypothesis. Thus, like in the foregoing section, 
\tw{\$groundTruth} will be an element of \tw{\$procInput}.
<<>>=
pValFromEquiCorrData2 <- function(sampleSize, rho, mu, pi0)
{		
	nmbOfFalseHyp <- round(sampleSize * (1-pi0))
	nmbOfTrueHyp <- sampleSize - nmbOfFalseHyp
	muVec <- c(rep(mu, nmbOfFalseHyp), rep(0, nmbOfTrueHyp))
	Y <- sqrt(rho) * rnorm(1) + sqrt(1-rho) * rnorm(sampleSize) + muVec
	return(list(
		procInput=list(pValues = 1 - pnorm(Y),
				groundTruth = muVec == 0)
		)
	)
}
@
We also need a function that calculates the FDP.
<<>>=
FDP <- function(pValues, groundTruth, proc=c("BH", "M-S-HLF"))
{
	if( proc == "BH" )
		out <- BH(pValues, alpha = 0.05, silent = TRUE)
	else
		out <- adaptiveSTS(pValues, alpha = 0.05, silent = TRUE)
	
	R <- sum(out$rejected)
	if ( R == 0 ) return(list(FDP = 0))
	
	V <- sum(out$rejected * groundTruth)
	return(list(FDP = V/R))		
}
@
Now, the (partial) reproduction of the results can start!
<<>>=
set.seed(123)
date()
sim <- simulation(replications = 10000, 
		list(funName="pVGen2", fun=pValFromEquiCorrData2, 
				sampleSize = c(16, 32, 64, 128, 256, 512),
				rho = c(0,0.1,0.5),
				mu = 5,
				pi0 = c(0.25, 0.75)), 
		list(list(funName="FDP", fun=FDP, proc=c("BH", "M-S-HLF"))),
		discardProcInput=TRUE)
date()
result <- gatherStatistics(sim, 
		list(FDP = function(data, results) results$FDP),
		mean)
result
@
<<fig=TRUE>>=
print(xyplot(FDP.mean ~ sampleSize | pi0 * rho, data = result$statisticDF, group=proc,
		type = "a",
		auto.key=list(space="right", points=FALSE, lines=TRUE)
	)
)		
@

\end{document}