File: tutorials-lca.md

package info (click to toggle)
sourmash 4.9.4-4
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 54,688 kB
  • sloc: python: 59,380; ansic: 332; makefile: 277; sh: 6
file content (203 lines) | stat: -rw-r--r-- 6,627 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
# Using sourmash LCA to do taxonomic classification

The `sourmash lca` sub-commands do k-mer classification using an
"lowest common ancestor" approach.  See "Some discussion" below for
links and details.

This tutorial should run without modification on Linux or Mac OS X,
under [Miniconda](https://docs.conda.io/en/latest/miniconda.html).

You'll need about 5 GB of free disk space to download the database,
and about 5 GB of RAM to search it.  The tutorial should take about
20 minutes total to run.

Note, we have successfully tested it on
[binder.pangeo.io](https://binder.pangeo.io/v2/gh/binder-examples/r-conda/master?urlpath=urlpath%3Drstudio)
if you want to give it a try!

## Install miniconda

If you don't have the `conda` command installed, you'll need to install
miniconda for Python 3.x.

On Linux, this should work:
```
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
echo export PATH="$HOME/miniconda3/bin:$PATH" >> ~/.bash_profile
source ~/.bash_profile
```
otherwise, follow
[the miniconda install](https://docs.conda.io/en/latest/miniconda.html).

## Enable [bioconda](https://bioconda.github.io/)

```
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
```

## Install sourmash

To install sourmash, create a new environment named `smash` and install sourmash:

```
conda create -y -n smash sourmash
```

and then activate:
```
conda activate smash
```

You should now be able to use the `sourmash` command:

```
sourmash info
```

## Download some files

Next, download a genbank LCA database for k=31:

```
curl -L -o genbank-k31.lca.json.gz https://osf.io/4f8n3/download
```

Download a random genome from genbank:
```
curl -L -o some-genome.fa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/178/875/GCF_000178875.2_ASM17887v2/GCF_000178875.2_ASM17887v2_genomic.fna.gz
```

Create a signature for this genome:
```
sourmash sketch dna -p scaled=1000,k=31 --name-from-first some-genome.fa.gz
```

Now, classify the signature with sourmash `lca classify`,
```
sourmash lca classify --db genbank-k31.lca.json.gz \
    --query some-genome.fa.gz.sig
```
and this will give you a taxonomic identification of your genome bin,
classified using all of the genbank microbial genomes:

```
loaded 1 LCA databases. ksize=31, scaled=10000
finding query signatures...
outputting classifications to stdout
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
... classifying NC_016901.1 Shewanella baltica OS678, complete genome (file 1 of"NC_016901.1 Shewanella baltica OS678, complete genome",found,Bacteria,Proteobacteria,Gammaproteobacteria,Alteromonadales,Shewanellaceae,Shewanella,Shewanella baltica,
classified 1 signatures total
```

You can also summarize the taxonomic distribution of the content with
`lca summarize`:
```
sourmash lca summarize --db genbank-k31.lca.json.gz \
    --query some-genome.fa.gz.sig
```

which will show you:
```
loaded 1 LCA databases. ksize=31, scaled=10000
finding query signatures...
loaded 1 signatures from 1 files total.
97.9%   520   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella
97.9%   520   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae
97.9%   520   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales
99.6%   529   Bacteria;Proteobacteria;Gammaproteobacteria
99.6%   529   Bacteria;Proteobacteria
99.6%   529   Bacteria
45.4%   241   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica
```

To apply this to your own genome(s), replace `some-genome.fa.gz` above
with your own filename(s).

You can also specify multiple databases and multiple query signatures
on the command line; separate them with `--db` or `--query`.

### Building your own LCA database

(This is an abbreviated version of [this blog post](http://ivory.idyll.org/blog/2017-classify-genome-bins-with-custom-db-try-again.html), updated to use the `sourmash lca` commands.)

Download some pre-calculated signatures:

```
curl -L https://osf.io/bw8d7/download -o delmont-subsample-sigs.tar.gz
tar xzf delmont-subsample-sigs.tar.gz
```

Next, grab the associated taxonomy spreadsheet

```
curl -O -L https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv
```

Build a sourmash LCA database named `delmont.lca.json`:

```
sourmash lca index -f tara-delmont-SuppTable3.csv delmont.lca.json delmont-subsample-sigs/*.sig
```

### Using the LCA database to classify signatures

We can now use `delmont.lca.json` to classify signatures with k-mers
according to the database we just created.  (Note, the database is
completely self-contained at this point.)

Let's classify a single signature:
```
sourmash lca classify --db delmont.lca.json \
    --query delmont-subsample-sigs/TARA_RED_MAG_00003.fa.gz.sig
```

and you should see:

```
loaded 1 databases for LCA use.
ksize=31 scaled=10000
outputting classifications to stdout
ID,status,superkingdom,phylum,class,order,family,genus,species
TARA_RED_MAG_00003,found,Bacteria,Proteobacteria,Gammaproteobacteria,,,,
classified 1 signatures total
```

You can classify a bunch of signatures and also specify an output
location for the CSV:

```
sourmash lca classify --db delmont.lca.json \
    --query delmont-subsample-sigs/*.sig \
    -o out.csv
```

The `lca classify` command supports multiple databases as well as
multiple queries; e.g. `sourmash lca classify --db delmont.lca.json
other.lca.json` will classify based on the combination of taxonomies
in the two databases.

## Some discussion

Sourmash LCA is using k-mers to do taxonomic classification, using the
"lowest common ancestor" approach (pioneered by
[Kraken](http://ccb.jhu.edu/software/kraken/MANUAL.html), and described
[here](http://ivory.idyll.org/blog/2017-something-about-kmers.html)),
to identify each k-mer.  From this it can either find a consensus
taxonomy between all the k-mers (`sourmash classify`) or it can summarize
the mixture of k-mers present in one or more signatures (`sourmash summarize`).

The `sourmash lca index` command can be used to prepare custom taxonomy
databases; sourmash will happily ingest any taxonomy, whether or not
it matches NCBI. See
[the spreadsheet from Delmont et al., 2017](https://github.com/ctb/2017-sourmash-lca/blob/master/tara-delmont-SuppTable3.csv)
for an example format.

[Return to index][3]

[0]:http://ivory.idyll.org/blog/2016-sourmash-sbt-more.html
[1]:databases.md
[2]:https://www.ncbi.nlm.nih.gov/pubmed/233877
[3]:index.md