File: tximeta.Rmd

package info (click to toggle)
r-bioc-tximeta 1.24.0%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 1,256 kB
  • sloc: makefile: 2
file content (850 lines) | stat: -rw-r--r-- 32,631 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
---
title: "Tximeta: transcript quantification import with automatic metadata"
author: "Michael I. Love, Charlotte Soneson, Peter F. Hickey, Rob Patro"
date: "`r format(Sys.time(), '%m/%d/%Y')`"
output: 
  rmarkdown::html_document:
    highlight: tango
    toc: true
    toc_float: true
abstract: >
  Tximeta performs numerous annotation and metadata gathering tasks on
  behalf of users during the import of transcript quantifications from
  *Salmon*, *alevin*, or *piscem-infer* into R/Bioconductor. Metadata 
  and transcript ranges are added automatically, facilitating genomic 
  analyses and assisting in computational reproducibility.
bibliography: library.bib
vignette: |
  %\VignetteIndexEntry{Transcript quantification import with automatic metadata}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

The `tximeta` package [@Love2020] extends the `tximport` package
[@Soneson2015] for import of transcript-level quantification data into
R/Bioconductor. It automatically adds annotation metadata when the
RNA-seq data has been quantified with *Salmon* [@Patro2017] or 
[piscem-infer](https://piscem-infer.readthedocs.io/en/latest/), or the
scRNA-seq data quantified with *alevin* [@Srivastava2019]. To our
knowledge, `tximeta` is the only package for RNA-seq data import that
can automatically identify and attach transcriptome metadata based on
the unique sequence of the reference transcripts.
For more details on these packages -- including the motivation for
`tximeta` and description of similar work -- consult the
**References** below.

**Note:** `tximeta` requires that the **entire output** of
*Salmon* / *piscem-infer* / *alevin* is present and unmodified in
order to identify the provenance of the reference transcripts. In
general, it's a good idea to not modify or re-arrange the output
directory of bioinformatic software as other downstream software rely
on and assume a consistent directory structure. For sharing multiple
samples, one can use, for example, `tar -czf` to bundle up a set of
Salmon output directories, or to bundle one alevin output
directory. For tips on using `tximeta` with other quantifiers see the
[other quantifiers](#other_quantifiers) section below.

```{r echo=FALSE}
knitr::include_graphics("images/diagram.png")
```

# Tximeta import starts with sample table

The first step using `tximeta` is to read in the sample table, which
will become the *column data*, `colData`, of the final object,
a *SummarizedExperiment*. The sample table should contain all the
information we need to identify the *Salmon* quantification
directories. 
For *alevin* quantification, one should point to the `quants_mat.gz`
file that contains the counts for all of the cells (also, in order to
`tximeta` to work with *alevin* quantification, it requires that
*alevin* was run using gene IDs in the `tgMap` step and not gene
symbols).

Here we will use a *Salmon* quantification file in the
*tximportData* package to demonstrate the usage of `tximeta`. We do
not have a sample table, so we construct one in R. It is recommended
to keep a sample table as a CSV or TSV file while working on an
RNA-seq project with multiple samples.

```{r}
dir <- system.file("extdata/salmon_dm", package="tximportData")
files <- file.path(dir, "SRR1197474", "quant.sf") 
file.exists(files)
coldata <- data.frame(files, names="SRR1197474", condition="A", stringsAsFactors=FALSE)
coldata
```

`tximeta` expects at least two columns in `coldata`: 

1. `files` - a pointer to the `quant.sf` files
2. `names` - the unique names that should be used to identify samples

# Running tximeta

Normally, we would just run `tximeta` like so:

```{r eval=FALSE}
library(tximeta)
se <- tximeta(coldata)
```

However, to avoid downloading remote GTF files during this vignette,
we will point to a GTF file saved locally (in the *tximportData*
package). We link the transcriptome of the *Salmon* index to its
locally saved GTF. The standard recommended usage of `tximeta` would
be the code chunk above, or to specify a remote GTF source, not a
local one. **This following code is therefore not recommended for a
typically workflow, but is particular to the vignette code.**

```{r}
indexDir <- file.path(dir, "Dm.BDGP6.22.98_salmon-0.14.1")
fastaFTP <- c("ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.22.cdna.all.fa.gz",
              "ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.22.ncrna.fa.gz")
gtfPath <- file.path(dir,"Drosophila_melanogaster.BDGP6.22.98.gtf.gz")
suppressPackageStartupMessages(library(tximeta))
makeLinkedTxome(indexDir=indexDir,
                source="LocalEnsembl",
                organism="Drosophila melanogaster",
                release="98",
                genome="BDGP6.22",
                fasta=fastaFTP,
                gtf=gtfPath,
                write=FALSE)
```

```{r message=FALSE}
library(tximeta)
```

```{r}
se <- tximeta(coldata)
```

# What happened? 

`tximeta` recognized the computed *digest* of the transcriptome that
the files were quantified against, it accessed the GTF file of the
transcriptome source, found and attached the transcript ranges, and
added the appropriate transcriptome and genome metadata.
A *digest* is a small string of alphanumeric characters that uniquely
identifies the collection of sequences that were used for
quantification (it is the application of a hash function). We
sometimes also call this value a "checksum" (in the tximeta paper).

A remote GTF is only
downloaded once, and a local or remote GTF is only parsed to build a
*TxDb* or *EnsDb* once: if `tximeta` recognizes that it has seen this *Salmon*
index before, it will use a cached version of the metadata and
transcript ranges.

Note the warning above that 5 of the transcripts are missing from the
GTF file and so are dropped from the final output. This is a problem
coming from the annotation source, and not easily avoided by
`tximeta`. 

# TxDb, EnsDb, and AnnotationHub

`tximeta` makes use of Bioconductor packages for storing
transcript databases as *TxDb* or *EnsDb* objects, which both are 
connected by default to `sqlite` backends.
For GENCODE and RefSeq GTF files, `tximeta` uses the *txdbmaker*
package [@granges] to parse the GTF and build a *TxDb*. For Ensembl
GTF files, `tximeta` will first attempt to obtain the correct *EnsDb*
object using *AnnotationHub*. The *ensembldb* package [@ensembldb]
contains classes and methods for extracting relevant data from Ensembl
files. If the *EnsDb* has already been made available on
AnnotationHub, `tximeta` will download the database directly, which
saves the user time parsing the GTF into a database (to avoid this,
set `useHub=FALSE`). If the relevant *EnsDb* is not available on
AnnotationHub, `tximeta` will build an *EnsDb* using *ensembldb* after
downloading the GTF file. Again, the download/construction of a
transcript database occurs only once, and upon subsequent usage of
*tximeta* functions, the cached version will be used.

# Pre-computed digests

We plan to support a wide variety of sources and organisms for
transcriptomes with pre-computed digests, though for now the
software focuses on predominantly human and mouse transcriptomes 

The following digests are supported in this version of `tximeta`:

```{r echo=FALSE}
dir2 <- system.file("extdata", package="tximeta")
tab <- read.csv(file.path(dir2, "hashtable.csv"),
                stringsAsFactors=FALSE)
release.range <- function(tab, source, organism) {
  tab.idx <- tab$organism == organism & tab$source == source
  rels <- tab$release[tab.idx]
  if (organism == "Mus musculus" & source == "GENCODE") {
    paste0("M", range(as.numeric(sub("M","",rels))))
  } else if (source == "RefSeq") {
    paste0("p", range(as.numeric(sub(".*p","",rels))))
  } else {
    range(as.numeric(rels))
  }
}
dat <- data.frame(
  source=rep(c("GENCODE","Ensembl","RefSeq"),c(2,3,2)),
  organism=c("Homo sapiens","Mus musculus",
             "Drosophila melanogaster")[c(1:2,1:3,1:2)]
)
rng <- t(sapply(seq_len(nrow(dat)), function(i)
  release.range(tab, dat[i,1], dat[i,2])))
dat$releases <- paste0(rng[,1], "-", rng[,2])
knitr::kable(dat)
```

For Ensembl transcriptomes, we support the combined protein coding
(cDNA) and non-coding (ncRNA) sequences, as well as the protein coding
alone (although the former approach combining coding and non-coding
transcripts is recommended for more accurate quantification).

`tximeta` also has functions to support *linked transcriptomes*,
where one or more sources for transcript sequences have been combined
or filtered. See the **Linked transcriptome** section below for a
demonstration. (The *makeLinkedTxome* function was used above to avoid
downloading the GTF during the vignette building process.)

# SummarizedExperiment output

We have our coldata from before. Note that we've removed `files`.

```{r}
suppressPackageStartupMessages(library(SummarizedExperiment))
colData(se)
```

Here we show the three matrices that were imported. 

```{r}
assayNames(se)
```

If there were inferential replicates (Gibbs samples or
bootstrap samples), these would be imported as additional assays named
`"infRep1"`, `"infRep2"`, ...

`tximeta` has imported the correct ranges for the transcripts:

```{r}
rowRanges(se)
```

We have appropriate genome information, which prevents us from making 
bioinformatic mistakes:

```{r}
seqinfo(se)
```

# Retrieve the transcript database

The `se` object has associated metadata that allows `tximeta` to link
to locally stored cached databases and other Bioconductor objects. In
further sections, we will show examples functions that leverage this
databases for adding exon information, summarize transcript-level data
to the gene level, or add identifiers. However, first we mention that
the user can easily access the cached database with the following
helper function. In this case, `tximeta` has an associated *EnsDb*
object that we can retrieve and use in our R session:

```{r}
edb <- retrieveDb(se)
class(edb)
```

The database returned by `retrieveDb` is either a *TxDb* in the case
of GENCODE or RefSeq GTF annotation file, or an *EnsDb* in the case of
an Ensembl GTF annotation file. For further use of these two database
objects, consult the *GenomicFeatures* vignettes and the *ensembldb*
vignettes, respectively (both Bioconductor packages).

# Add exons per transcript

Because the SummarizedExperiment maintains all the metadata of its
creation, it also keeps a pointer to the necessary database for
pulling out additional information, as demonstrated in the following
sections. 

If necessary, the *tximeta* package can pull down the remote source to
build a TxDb, but given that we've already built a TxDb once, it
simply loads the cached version. In order to remove the cached TxDb
and regenerate, one can remove the relevant entry from the `tximeta`
file cache that resides at the location given by `getTximetaBFC()`.

The `se` object created by `tximeta`, has the start, end, and strand
information for each transcript. Here, we swap out the transcript
*GRanges* for exons-by-transcript *GRangesList* (it is a list of
*GRanges*, where each element of the list gives the exons for a
particular transcript).

```{r}
se.exons <- addExons(se)
rowRanges(se.exons)[[1]]
```

As with the transcript ranges, the exon ranges will be generated once
and cached locally. As it takes a non-negligible amount of time to
generate the exon-by-transcript *GRangesList*, this local caching
offers substantial time savings for repeated usage of `addExons` with
the same transcriptome.

We have implemented `addExons` to work only on the transcript-level
*SummarizedExperiment* object. We provide some motivation for this
choice in `?addExons`. Briefly, if it is desired to know the exons
associated with a particular gene, we feel that it makes more sense to
pull out the relevant set of exons-by-transcript for the transcripts
for this gene, rather than losing the hierarchical structure (exons to
transcripts to genes) that would occur with a *GRangesList* of exons
grouped per gene.

# Easy summarization to gene-level

Likewise, the *tximeta* package can make use of the cached TxDb
database for the purpose of summarizing transcript-level
quantifications and bias corrections to the gene-level. After
summarization, the `rowRanges` reflect the start and end position of
the gene, which in Bioconductor are defined by the leftmost and
rightmost genomic coordinates of all the transcripts. As with the
transcript and exons, the gene ranges are cached locally for repeated
usage. The transcript IDs are stored as a *CharacterList* column
`tx_ids`.

```{r}
gse <- summarizeToGene(se)
rowRanges(gse)
```

## Assign ranges by abundance

We also offer a new type of range assignment, based on the most
abundant isoform rather than the leftmost to rightmost coordinate. See
the `assignRanges` argument of `?summarizeToGene`. Using the most
abundant isoform arguably will reflect more accurate genomic distances
than the default option.

```{r eval=FALSE}
# unevaluated code chunk
gse <- summarizeToGene(se, assignRanges="abundant")
```

For more explanation about why this may be a better choice, see the
following tutorial chapter:

<https://tidyomics.github.io/tidy-ranges-tutorial/gene-ranges-in-tximeta.html>

In the below diagram, the pink feature is the set of all exons
belonging to any isoform of the gene, such that the TSS is on the
right side of this minus strand feature. However, the blue feature is
the most abundant isoform (the brown features are the next most
abundant isoforms). The pink feature is therefore not a good
representation for the locus.

```{r echo=FALSE}
knitr::include_graphics("images/assignRanges-abundant.png")
```

# Add different identifiers

We would like to add support to easily map transcript or gene
identifiers from one annotation to another. This is just a prototype
function, but we show how we can easily add alternate IDs given that we
know the organism and the source of the transcriptome. (This function
currently only works for GENCODE and Ensembl gene or transcript IDs
but could be extended to work for arbitrary sources.)

```{r}
library(org.Dm.eg.db)
gse <- addIds(gse, "REFSEQ", gene=TRUE)
mcols(gse)
```

# Differential expression analysis

The following code chunk demonstrates how to build a *DESeqDataSet*
and begin a differential expression analysis. 

```{r}
suppressPackageStartupMessages(library(DESeq2))
# here there is a single sample so we use ~1.
# expect a warning that there is only a single sample...
suppressWarnings({dds <- DESeqDataSet(gse, ~1)})
# ... see DESeq2 vignette
```

The *Swish* method in the *fishpond* package directly works with the
*SummarizedExperiment* output from *tximeta*, and can perform
differential analysis on transcript expression taking into account
inferential replicates, e.g. bootstrap or Gibbs samples, which are
imported and arranged by `tximeta` if these were generated during
quantification.

```{r}
library(fishpond)
y <- se
# y <- scaleInfReps(y)
# y <- labelKeep(y)
# y <- swish(y, x="condition")
# ... see Swish vignette in fishpond package
```

We have a convenient wrapper function that will build a *DGEList*
object for use with *edgeR*.

```{r}
suppressPackageStartupMessages(library(edgeR))
y <- makeDGEList(gse)
# ... see edgeR User's Guide for further steps
```

The following code chunk demonstrates the code inside of the above
wrapper function, and produces the same output.

```{r}
cts <- assays(gse)[["counts"]]
normMat <- assays(gse)[["length"]]
normMat <- normMat / exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y <- scaleOffset(y, t(t(log(normMat)) + o))
# ... see edgeR User's Guide for further steps
```

The following code chunk demonstrates how one could use the *Swish*
method in the fishpond Bioconductor package. Here we use the
transcript-level object `se`. This dataset only has a single sample
and no inferential replicates, but the analysis would begin with such
code. See the Swish vignette in the fishpond package for a complete
example: 

```{r eval=FALSE}
y <- se # rename the object to 'y'
library(fishpond)
# if inferential replicates existed in the data,
# analysis would begin with:
#
# y <- scaleInfReps(y)
# ... see Swish vignette in the fishpond package
```

For *limma* with *voom* transformation we recommend, as in the
*tximport* vignette to generate counts-from-abundance instead of
providing an offset for average transcript length.

```{r}
gse <- summarizeToGene(se, countsFromAbundance="lengthScaledTPM")
library(limma)
y <- DGEList(assays(gse)[["counts"]])
# see limma User's Guide for further steps
```

Above we generated counts-from-abundance when calling
`summarizeToGene`. The counts-from-abundance status is then stored in
the metadata:

```{r}
metadata(gse)$countsFromAbundance 
```

# Additional metadata

The following information is attached to the *SummarizedExperiment* by
`tximeta`: 

```{r}
names(metadata(se))
str(metadata(se)[["quantInfo"]])
str(metadata(se)[["txomeInfo"]])
str(metadata(se)[["tximetaInfo"]])
str(metadata(se)[["txdbInfo"]])
```

# Errors connecting to a database

`tximeta` makes use of *BiocFileCache* to store transcript and other
databases, so saving the relevant databases in a centralized location
used by other Bioconductor packages as well. It is possible that an
error can occur in connecting to these databases, either if the files
were accidentally removed from the file system, or if there was an
error generating or writing the database to the cache location. In
each of these cases, it is easy to remove the entry in the
*BiocFileCache* so that `tximeta` will know to regenerate the
transcript database or any other missing database.

If you have used the default cache location, then you can obtain
access to your BiocFileCache with:

```{r}
library(BiocFileCache)
bfc <- BiocFileCache()
```

Otherwise, you can recall your particular `tximeta` cache location
with `getTximetaBFC()`.

You can then inspect the entries in your BiocFileCache using `bfcinfo`
and remove the entry associated with the missing database with `bfcremove`. 
See the BiocFileCache vignette for more details on finding and
removing entries from a BiocFileCache.

Note that there may be many entries in the BiocFileCache location,
including `.sqlite` database files and serialized `.rds` files. You
should only remove the entry associated with the missing database,
e.g. if R gave an error when trying to connect to the TxDb associated
with GENCODE v99 human transcripts, you should look for the `rid` of
the entry associated with the human v99 GTF from GENCODE.

# What if digest isn't known?

`tximeta` automatically imports relevant metadata when the
transcriptome matches a known source -- *known* in the sense that it
is in the set of pre-computed hashed digests in `tximeta` (GENCODE,
Ensembl, and RefSeq for human and mouse). `tximeta` also facilitates the
linking of transcriptomes used in building the *Salmon* index with
relevant public sources, in the case that these are not part of this
pre-computed set known to `tximeta`.
The linking of the transcriptome source with the quantification files is
important in the case that the transcript sequence no longer matches a
known source (uniquely combined or filtered FASTA files), or if the
source is not known to `tximeta`. Combinations of coding and
non-coding human, mouse, and fruit fly *Ensembl* transcripts should be
automatically recognized by `tximeta` and does not require making a
*linkedTxome*. As the package is further developed, we plan to roll
out support for all common transcriptomes, from all sources.

**Note:** if you are using Salmon in alignment mode, then there is no
Salmon index, and without the Salmon index, there is no digest. We
don't have a perfect solution for this yet, but you can still
summarize transcript counts to gene with a `tx2gene` table that you
construct manually (see `tximport` vignette for example code). 
Just specify the arguments,
`skipMeta=TRUE, txOut=FALSE, tx2gene=tx2gene`, 
when calling `tximeta` and it will perform summarization to gene level
as in `tximport`.

We now demonstrate how to make a *linkedTxome* and how to share and
load a *linkedTxome*.
We point to a *Salmon* quantification file which was quantified
against a transcriptome that included the coding and non-coding
*Drosophila melanogaster* transcripts, as well as an artificial
transcript of 960 bp (for demonstration purposes only).

```{r}
file <- file.path(dir, "SRR1197474.plus", "quant.sf")
file.exists(file)
coldata <- data.frame(files=file, names="SRR1197474", sample="1",
                      stringsAsFactors=FALSE)
```

Trying to import the files gives a message that `tximeta` couldn't find
a matching transcriptome, so it returns an non-ranged
*SummarizedExperiment*. 

```{r}
se <- tximeta(coldata)
```

# Linked transcriptomes

If the transcriptome used to generate the *Salmon* index does not
match any transcriptomes from known sources (e.g. from combining or
filtering known transcriptome files), there is not much that can be
done to automatically populate the metadata during quantification
import. However, we can facilitate the following two cases:

1) the transcriptome was created locally and has been linked to its
public source(s) 
2) the transcriptome was produced by another group, and
they have produced and shared a file that links the transcriptome to
public source(s)

`tximeta` offers functionality to assist reproducible analysis in both
of these cases.

To make this quantification reproducible, we make a `linkedTxome`
which records key information about the sources of the transcript
FASTA files, and the location of the relevant GTF file. It also
records the digest of the transcriptome that was computed by
*Salmon* during the `index` step.

**Source:** when creating the `linkedTxome` one must specify the 
`source` of the transcriptome. See `?linkedTxome` for a note on 
the implications of this text string. For canonical GENCODE or
Ensembl transcriptomes, one can use `"GENCODE"` or `"Ensembl"`,
but for modified or otherwise any transcriptomes defined by a local
database, it is recommended to use a different string,
`"LocalGENCODE"` or `"LocalEnsembl", which will avoid *tximeta*
pulling canonical GENCODE or Ensembl resources from AnnotationHub.

**Multiple GTF/GFF files:** `linkedTxome` and `tximeta` do not
currently support multiple GTF/GFF files, which is a more complicated
case than multiple FASTA, which is supported. Currently, we recommend
that users should add or combine GTF/GFF files themselves to create
a single GTF/GFF file that contains all features used in
quantification, and then upload such a file to *Zenodo*, which can
then be linked as shown below. Feel free to contact the developers on
the Bioconductor support site or GitHub Issue page for further details
or feature requests.

**Stringtie:** A special note for building on top of
Stringtie-generated transcripts: it is a good idea to change gene
identifiers, to _not_ include a period `.`, as the period will later
be used to separate transcript versions from gene identifiers. This
can be done before building the Salmon index, by changing periods in
the gene identifier to an underscore. See 
[this GitHub issue](https://github.com/thelovelab/tximeta/issues/68) for details.



By default, `linkedTxome` will write out a JSON file which can be
shared with others, linking the digest of the index with the other
metadata, including FASTA and GTF sources. By default, it will write
out to a file with the same name as the `indexDir`, but with a `.json`
extension added. This can be prevented with `write=FALSE`, and the
file location can be changed with `jsonFile`.

First we specify the path where the *Salmon* index is located. 

Typically you would not use `system.file` and `file.path` to locate
this directory, but simply define `indexDir` to be the path of the
*Salmon* directory on your machine. Here we use `system.file` and
`file.path` because we have included parts of a *Salmon* index
directory in the *tximeta* package itself for demonstration of
functionality in this vignette.

```{r}
indexDir <- file.path(dir, "Dm.BDGP6.22.98.plus_salmon-0.14.1")
```

Now we provide the location of the FASTA files and the GTF file for
this transcriptome. 

**Note:** the basename for the GTF file is used as a unique identifier
for the cached versions of the *TxDb* and the transcript ranges, which
are stored on the user's behalf via *BiocFileCache*. This is not an
issue, as GENCODE, Ensembl, and RefSeq all provide GTF files which are
uniquely identified by their filename,
e.g. `Drosophila_melanogaster.BDGP6.22.98.gtf.gz`.

The recommended usage of `tximeta` would be to specify a remote GTF
source, as seen in the commented-out line below: 

```{r}
fastaFTP <- c("ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.22.cdna.all.fa.gz",
              "ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.22.ncrna.fa.gz",
              "extra_transcript.fa.gz")
#gtfFTP <- "ftp://path/to/custom/Drosophila_melanogaster.BDGP6.22.98.plus.gtf.gz"
```

Instead of the above commented-out FTP location for the GTF file, we
specify a location within an R package. This step is just to avoid
downloading from a remote FTP during vignette building. This use of
`file.path` to point to a file in an R package is specific to this
vignette and should not be used in a typical workflow. The following
GTF file is a modified version of the release 98 from Ensembl, which
includes description of a one transcript, one exon artificial gene
which was inserted into the transcriptome (for demonstration purposes
only). 

```{r}
gtfPath <- file.path(dir,"Drosophila_melanogaster.BDGP6.22.98.plus.gtf.gz")
```

Finally, we create a *linkedTxome*.  In this vignette, we point to a
temporary directory for the JSON file, but a more typical workflow
would write the JSON file to the same location as the *Salmon* index
by not specifying `jsonFile`.

`makeLinkedTxome` performs two operation: (1) it creates a new entry in
an internal table that links the transcriptome used in the *Salmon*
index to its sources, and (2) it creates a JSON file such that this
*linkedTxome* can be shared.

```{r}
tmp <- tempdir()
jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json"))
makeLinkedTxome(indexDir=indexDir,
                source="LocalEnsembl", organism="Drosophila melanogaster",
                release="98", genome="BDGP6.22",
                fasta=fastaFTP, gtf=gtfPath,
                jsonFile=jsonFile)
```

After running `makeLinkedTxome`, the connection between this *Salmon*
index (and its digest) with the sources is saved for persistent
usage. Note that because we added a single transcript of 960bp to the
FASTA file used for quantification, `tximeta` could tell that this was
not quantified against release 98 of the Ensembl transcripts for
*Drosophila melanogaster*. Only when the correct set of transcripts
were specified does `tximeta` recognize and import the correct
metadata.

With use of `tximeta` and a *linkedTxome*, the software
figures out if the remote GTF has been accessed and compiled into a
*TxDb* before, and on future calls, it will simply load the
pre-computed metadata and transcript ranges.

Note the warning that 5 of the transcripts are missing from the GTF
file and so are dropped from the final output. This is a problem
coming from the annotation source, and not easily avoided by
`tximeta`. 

```{r}
se <- tximeta(coldata)
```

We can see that the appropriate metadata and transcript ranges are
attached.

```{r}
rowRanges(se)
seqinfo(se)
```

# Clear *linkedTxomes*

The following code removes the entire table with information about the
*linkedTxomes*. This is just for demonstration, so that we can show
how to load a JSON file below.

**Note:** Running this code will clear any information about
*linkedTxomes*. Don't run this unless you really want to clear this
table!

```{r}
library(BiocFileCache)
if (interactive()) {
  bfcloc <- getTximetaBFC()
} else {
  bfcloc <- tempdir()
}
bfc <- BiocFileCache(bfcloc)
bfcinfo(bfc)
bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid)
bfcinfo(bfc)
```

# Loading *linkedTxome* JSON files

If a collaborator or the Suppmentary Files for a publication shares a
`linkedTxome` JSON file, we can likewise use `tximeta` to
automatically assemble the relevant metadata and transcript
ranges. This implies that the other person has used `tximeta` with the
function `makeLinkedTxome` demonstrated above, pointing to their
*Salmon* index and to the FASTA and GTF source(s).

We point to the JSON file and use `loadLinkedTxome` and then the
relevant metadata is saved for persistent usage. In this case, we
saved the JSON file in a temporary directory.

```{r}
jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json"))
loadLinkedTxome(jsonFile)
```

Again, using `tximeta` figures out whether it needs to access the
remote GTF or not, and assembles the appropriate object on the user's
behalf.

```{r}
se <- tximeta(coldata)
```

# Clear *linkedTxomes* again

Finally, we clear the *linkedTxomes* table again so that the above
examples will work. This is just for the vignette code and not part of
a typical workflow.

**Note:** Running this code will clear any information about
*linkedTxomes*. Don't run this unless you really want to clear this
table!

```{r}
if (interactive()) {
  bfcloc <- getTximetaBFC()
} else {
  bfcloc <- tempdir()
}
bfc <- BiocFileCache(bfcloc)
bfcinfo(bfc)
bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid)
bfcinfo(bfc)
```

# Other quantifiers

`tximeta` can import the output from any quantifiers that are
supported by `tximport`, and if these are not *Salmon*, *alevin*, or
*Sailfish* output, it will simply return a non-ranged
*SummarizedExperiment* by default.

An alternative solution is to wrap other quantifiers in workflows
that include metadata information JSON files along with each
quantification file. One can place these files in
`aux_info/meta_info.json` or any relative location specified by
`customMetaInfo`, for example `customMetaInfo="meta_info.json"`. This
JSON file is located relative to the quantification file and should
contain a tag `index_seq_hash` with an associated value of the SHA-256
hash of the reference transcripts. For computing the hash value of the
reference transcripts, see the
[FastaDigest](https://github.com/COMBINE-lab/FastaDigest) python
package. The hash value used by *Salmon* is the SHA-256 hash value
of the reference sequences stripped of the header lines, and
concatenated together with the empty string (so only cDNA sequences
combined without any new line characters). *FastaDigest* can be
installed with `pip install fasta_digest`.

# Automated analysis with ARMOR

This vignette described the use of `tximeta` to import quantification
data into R/Bioconductor with automatic detection and addition of
metadata. The *SummarizedExperiment* produced by `tximeta` can then be
provided to downstream statistical analysis packages as described
above. The *tximeta* package does not contain any functionality for
automated differential analysis. 

The [ARMOR](https://github.com/csoneson/ARMOR) workflow does automate 
a variety of differential analyses, and make use of `tximeta` for
creation of a *SummarizedExperiment* with attached annotation
metadata. ARMOR stands for 
``An Automated Reproducible MOdular Workflow for Preprocessing
and Differential Analysis of RNA-seq Data'' and is described in more
detail in the article by @Orjuelag2019.

# Default BiocFileCahce Caching Location Update

*tximeta* makes use of the default BiocFileCache location, unless
otherwise specified by the user. As of BiocFileCache version > 1.15.1,
the default caching location used by BiocFileCache has changed. In
order to continue to use the same cache (without re-downloading
files), please follow the steps in the BiocFileCache vignette, under
the heading **Default Caching Location Update**.

# Acknowledgments

The development of *tximeta* has benefited from suggestions from these
and other individuals in the community:

* Vincent Carey
* Lori Shepherd
* Martin Morgan
* Koen Van den Berge
* Johannes Rainer
* James Ashmore
* Ben Johnson
* Tim Triche
* Kristoffer Vitting-Seerup

# Session info

```{r}
library(devtools)
session_info()
```

# References