1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358
|
---
title: "*HTSFilter* package: Quick-start guide"
author:
- name: Andrea Rau
affiliation: INRAE
email: andrea.rau@inrae.fr
- name: Mélina Gallopin
- name: Gilles Celeux
- name: Florence Jaffrézic
package: HTSFilter
date: "`r Sys.Date()`"
bibliography: vignette_refs.bib
output:
BiocStyle::html_document:
toc_float: true
BiocStyle::pdf_document: default
vignette: >
%\VignetteIndexEntry{HTSFilter}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, fig.align = "center", echo=FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
```
This vignette illustrates the use of the *HTSFilter* package to filter
replicated data from transcriptome sequencing experiments (e.g., RNA sequencing
data) for a variety of different data classes: `matrix`,
`data.frame`, the S3 classes associated with the *edgeR* package (`DGEExact`
and `DGELRT`), and the S4 class associated with the *DESeq2* package
(`DESeqDataSet`).
# Introduction
High-throughput sequencing (HTS) data, such as RNA-sequencing (RNA-seq) data,
are increasingly used to conduct differential analyses, in which statistical
tests are performed for each biological feature (e.g., a gene, transcript, exon)
in order to identify those whose expression levels show systematic covariation
with a particular condition, such as a treatment or phenotype of interest.
For the remainder of this vignette, we will focus on gene-level differential
analyses, although these methods may also be applied to differential analyses
of (count-based measures of) transcript- or exon-level expression.
Because hypothesis tests are performed for gene-by-gene differential analyses,
the obtained $p$-values must be adjusted to correct for multiple testing.
However, procedures to adjust $p$-values to control the number of detected
false positives often lead to a loss of power to detect truly differentially
expressed (DE) genes due to the large number of hypothesis tests performed.
To reduce the impact of such procedures, independent data filters are often
used to identify and remove genes that appear to generate an uninformative
signal [@Bourgon2010]; this in turn moderates the correction needed to adjust
for multiple testing. For independent filtering methods for microarray data,
see for example the
[*genefilter*](https://bioconductor.org/packages/release/bioc/html/genefilter.html)
Bioconductor package [@Gentleman].
The *HTSFilter* package implements a novel data-based filtering procedure
based on the calculation of a similarity index among biological replicates
for read counts arising from replicated transcriptome sequencing (RNA-seq)
data. This technique provides an intuitive data-driven way to filter
high-throughput transcriptome sequencing data and to effectively remove genes
with low, constant expression levels without incorrectly removing those that
would otherwise have been identified as DE. The two fundamental assumptions of
the filter implemented in the *HTSFilter* package are as follows:
* Biological replicates are present for each experimental condition, and
* Data can be appropriately normalized (scaled) to correct for systematic
inter-sample biases.
Assuming these conditions hold, *HTSFilter* implements a method to identify a
filtering threshold that maximizes the *filtering similarity* among replicates,
that is, one where most genes tend to either have normalized counts less than
or equal to the cutoff in all samples (i.e., filtered genes) or greater than
the cutoff in all samples (i.e., non-filtered genes). This filtering similarity
is defined using the global Jaccard index, that is, the average Jaccard index
calculated between pairs of replicates within each experimental condition;
see @Rau2012a for more details.
For more information about between-sample normalization strategies, see
@Dillies2012; in particular, strategies for normalizing data with differences
in library size and composition may be found in @Anders2010 and @Robinson2010,
and strategies for normalizing data exhibiting sample-specific biases due to
GC content may be found in @Risso2011 and @Hansen2012. Within the *HTSFilter*
package, the Trimmed Means of M-values (TMM) [@Robinson2010] and DESeq
[@Anders2010] normalization strategies may be used prior to calculating an
appropriate data-based filter. If an alternative normalization strategy is
needed or desired, the normalization may be applied prior to filtering the
data with `normalization="none" `in the `HTSFilter` function;
see Section \@ref(edaseqnorm) for an example.
The *HTSFilter* package is able to accommodate unnormalized or normalized
replicated count data in the form of a `matrix` or `data.frame` (in which each
row corresponds to a biological feature and each column to a biological
sample), one of the S3 classes associated with the *edgeR* package (`DGEList`,
`DGEExact`, `DGEGLM`, and `DGELRT`), or `DESeqDataSet` (the S4 class associated
with the *DESeq2* package), as illustrated in the following sections.
Finally, we note that the filtering method implemented in the *HTSFilter*
package is designed to filter transcriptome sequencing, and not microarray,
data; in particular, the proposed filter is effective for data with features
that take on values over a large order of magnitude and with a subset of
features exhibiting small levels of expression across samples (see, for example,
Figure \@ref(fig:loadingPackages)). In this vignette, we illustrate its use on
count-based measures of gene expression, although its use is not strictly
limited to discrete data.
# Input data
For the purposes of this vignette, we make use of data from a study of
sex-specific expression of liver cells in human and the *DESeq2* and *edgeR*
packages for differential analysis. @Sultan2008 obtained a high-throughput
sequencing data (using a 1G Illumina Genome Analyzer sequencing machine) from a
human embryonic kidney and a B cell line, with two biological replicates each.
The raw read counts and phenotype tables were obtained from the ReCount online
resource [@Frazee2011].
To begin, we load the *HTSFilter* package, attach the gene-level count
data contained in `sultan`, and take a quick look at a histrogram of gene counts:
```{r loadingPackages, message=FALSE, fig.cap="Histogram of log transformed counts from the Sultan data. This illustrates the large number of genes with very small counts as well as the large heterogeneity in counts observed."}
library(HTSFilter)
library(edgeR)
library(DESeq2)
data("sultan")
hist(log(exprs(sultan)+1), col="grey", breaks=25, main="",
xlab="Log(counts+1)")
pData(sultan)
dim(sultan)
```
The unfiltered data contain 9010 genes in four samples (two replicates per
condition).
# Use of *HTSFilter* with varying data types
## `matrix` and `data.frame` classes
To filter high-throughput sequencing data in the form of a `matrix` or
`data.frame`, we first access the expression data, contained in `exprs(sultan)`,
and create a vector identifying the condition labels for each of the samples
via the `pData` *Biobase* function. We then filter the data using the
`HTSFilter` function, specifying that the number of tested thresholds be only 25
(`s.len=25`) rather than the default value of 100 to reduce computation time
for this example. Note that as it is unspecified, the default normalization
method is used for filtering the data, namely the Trimmed Mean of M-values
(TMM) method of @Robinson2010. To use the DESeq normalization method
[@Anders2010] `normalization="DESeq"` may be specified.
```{r matrix, fig.cap="Global Jaccard index for the Sultan data. The index is calculated for a variety of threshold values after TMM normalization, with a loess curve (blue line) superposed and data-based threshold values (red cross and red dotted line) equal to 11.764."}
mat <- exprs(sultan)
conds <- as.character(pData(sultan)$cell.line)
## Only 25 tested thresholds to reduce computation time
filter <- HTSFilter(mat, conds, s.min=1, s.max=200, s.len=25)
mat <- filter$filteredData
dim(mat)
dim(filter$removedData)
```
For this example, we find a data-based threshold equal to 11.764; genes with
normalized values less than this threshold in all samples are filtered from
subsequent analyses. The proposed filter thus removes 4015 genes from further
analyses, leaving 4995 genes.
We note that an important part of the filter proposed in the *HTSFilter*
package is a check of the behavior of the global similarity index calculated
over a range of threshold values, and in particular, to verify that a
reasonable maximum value is reached for the global similarity index over the
range of tested threshold values (see Figure \@ref(fig:matrix)); the maximum
possible value for the global Jaccard index is nearly 1. To illustrate the
importance of this check, we attempt to re-apply the proposed filter to the
previously filtered data (in practice, of course, this would be nonsensical):
```{r refilter, fig.cap="HTSFilter on pre-filtered Sultan data. (left) The global Jaccard index on the pre-filtered data is calculated for a variety of threshold values after TMM normalization, with a loess curve (blue line) superposed and data-based threshold values (red cross and red dotted line) equal to 1.64. (right) Histogram of the re-filtered data after applying HTSFilter."}
par(mfrow = c(1,2), mar = c(4,4,2,2))
filter.2 <- HTSFilter(mat, conds, s.len=25)
dim(filter.2$removedData)
hist(log(filter.2$filteredData+1), col="grey", breaks=25, main="",
xlab="Log(counts+1)")
```
In the lefthand panel of Figure \@ref(fig:refilter), we note a plateau of large
global Jaccard index values for thresholds less than 2, with a decrease
thereafter; this corresponds to filtering no genes, unsurprising given that
genes with low, constant levels of expression have already been filtered from
the analysis (see the righthand panel of Figure \@ref(fig:refilter)).
## *edgeR* package pipeline
We next illustrate the use of *HTSFilter* within the *edgeR* pipeline
for differential analysis (S3 classes `DGEList`, `DGEExact`, `DGEGLM`, or
`DGELRT`). For the purposes of this vignette, we will consider the S3 classes
`DGEExact` and `DGELRT`. The former is the class containing the results of the
differential expression analysis between two groups of count libraries
(resulting from a call to the function `exactTest` in *edgeR*); the latter is
the class containing the results of a generalized linear model (GLM)-based
differential analysis (resulting from a call to the function `glmLRT` in
*edgeR*). Although the filter may be applied earlier in the *edgeR*
pipeline (i.e., to objects of class `DGEList` or `DGEGLM`), we do not
recommend doing so, as parameter estimation makes use of counts adjusted
using a quantile-to-quantile method (pseudo-counts).
### S3 class `DGEExact`
We first coerce the data into the appropriate class with the function
`DGEList`, where the `group` variable is set to contain a vector of
condition labels for each of the samples. Next, after calculating normalizing
factors to scale library sizes (`calcNormFactors`), we estimate common and
tagwise dispersion parameters using `estimateDisp` (using the quantile
conditional likelihood for each gene) and obtain differential analysis
results using `exactTest`. Finally, we apply the filter using the `HTSFilter`
function, again specifying that the number of tested thresholds be only 25
(`s.len=25`) rather than the default value of 100. Note that as it is
unspecified, the default normalization method is used for filtering the data,
namely the Trimmed Mean of M-values (TMM) method [@Robinson2010]; alternative
normalization, including `"pseudo.counts"` for the quantile-to-quantile
adjusted counts used for parameter estimation, may also be specified.
We suppress the plot of the global Jaccard index below using
`plot = FALSE`, as it is identical to that shown in Figure \@ref(fig:matrix).
```{r dge}
dge <- DGEList(counts=exprs(sultan), group=conds)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge)
et <- exactTest(dge)
et <- HTSFilter(et, DGEList=dge, s.len=25, plot=FALSE)$filteredData
dim(et)
class(et)
topTags(et)
```
Note that the filtered data are of the class `DGEExact`, allowing for a call
to the `topTags` function.
```{r dgetoptags}
topTags(et)
```
### S3 class `DGELRT`
We follow the same steps as the previous example, where the `estimateDisp`
function is now used to obtain per-gene dispersion parameter estimates using
the adjusted profile loglikelihood, the `glmFit` function is used to fit a
negative binomial generalized log-linear model to the read counts for each
gene, and the `glmLRT` function is used to conduct likelihood ratio tests for
one or more coefficients in the GLM. The output of `glmLRT` is an S3 object of
class `DGELRT` and contains the GLM differential analysis results.
As before, we apply the filter using the `HTSFilter` function, again
suppressing the plot of the global Jaccard index using \Rcode{plot = FALSE},
as it is identical to that shown in Figure \@ref(fig:matrix).
```{r dge2}
design <- model.matrix(~conds)
dge <- DGEList(counts=exprs(sultan), group=conds)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge, design)
fit <- glmFit(dge,design)
lrt <- glmLRT(fit,coef=2)
lrt <- HTSFilter(lrt, DGEGLM=fit, s.len=25, plot=FALSE)$filteredData
dim(lrt)
class(lrt)
```
Note that the filtered data are of the class `DGEList`, allowing for a call
to the `topTags` function.
```{r dge2toptags}
topTags(lrt)
```
## *DESeq2* package pipeline: S4 class `DESeqDataSet`}
The *HTSFilter* package allows for a straightforward integration within the
*DESeq2* analysis pipeline, most notably allowing for $p$-values to be
adjusted only for those genes passing the filter. Note that *DESeq2* now
implements an independent filtering procedure by default in the `results`
function; this filter is a potential alternative filtering technique and does
not need to be used in addition to the one included in *HTSFilter*. In fact,
each filter is targeting the same weakly expressed genes to be filtered from
the analysis. As such, if the user wishes to make use of *HTSFilter* within
the *DESeq2* pipeline, the argument `independentFiltering=FALSE` should be
used when calling the `results` function in *DESeq2*.
To illustrate the application of a filter for high-throughput sequencing data
in the form of a `DESeqDataSet` (the class used within the *DESeq2* pipeline
for differential analysis), we coerce `sultan` into an object of the class
`DESeqDataSet` using the function `DESeqDataSetFromMatrix`. Once again,
we specify that the number of tested thresholds be only 25 (`s.len=25`)
rather than the default value of 100 to reduce computation time.
or objects in the form of a `DESeqDataSet`, the default normalization
strategy is `"DESeq"`, although alternative normalization strategies may also
be used. Note that below we replace the spaces in condition names with "."
prior to creating a `"DESeqDataSet"` object.
```{r cds2}
conds <- gsub(" ", ".", conds)
dds <- DESeqDataSetFromMatrix(countData = exprs(sultan),
colData = data.frame(cell.line = factor(conds)),
design = ~ cell.line)
dds <- DESeq(dds)
filter <- HTSFilter(dds, s.len=25, plot=FALSE)$filteredData
class(filter)
dim(filter)
res <- results(filter, independentFiltering=FALSE)
head(res)
```
The filtered data remain an object of class `DESeqDataSet`, and subsequent
functions from *DESeq2* (such as the results summary function \Rcode{results})
may be called directly upon it.
# Alternative normalization using *EDAseq* {#edaseqnorm}
As a final example, we illustrate the use of the *HTSFilter* package with an
alternative normalization strategy, namely the full quantile normalization
method in the *EDAseq* package; such a step may be useful when the TMM or
DESeq normalization methods are not appropriate for a given dataset. Once
again, we create a new object of the appropriate class with the function
`newSeqExpressionSet` and normalize data using the `betweenLaneNormalization`
function (with `which="full"`) in *EDAseq*.
```{r ses2, eval=FALSE}
library(EDASeq)
ses <- newSeqExpressionSet(exprs(sultan),
phenoData=pData(sultan))
ses.norm <- betweenLaneNormalization(ses, which="full")
```
Subsequently, `HTSFilter` is applied to the normalized data (again using
`s.len=25`), and the normalization method is set to `norm="none"`. We may then
make use of the `on` vector in the results, which identifies filtered and
unfiltered genes (respectively) with 0 and 1, to identify rows in the original
data matrix to be retained.
```{r ses3, eval=FALSE}
filter <- HTSFilter(counts(ses.norm), conds, s.len=25, norm="none",
plot=FALSE)
head(filter$on)
table(filter$on)
```
# Session Info
```{r sessionInfo}
sessionInfo()
```
# References
|