File: tutorials-lca.md

package info (click to toggle)
sourmash 4.9.4-4
links: PTS, VCS
area: main
in suites: sid
size: 54,688 kB
sloc: python: 59,380; ansic: 332; makefile: 277; sh: 6
file content (203 lines) | stat: -rw-r--r-- 6,627 bytes
parent folder | download | duplicates (2)
# Using sourmash LCA to do taxonomic classification

The `sourmash lca` sub-commands do k-mer classification using an
"lowest common ancestor" approach.  See "Some discussion" below for
links and details.

This tutorial should run without modification on Linux or Mac OS X,
under [Miniconda](https://docs.conda.io/en/latest/miniconda.html).

You'll need about 5 GB of free disk space to download the database,
and about 5 GB of RAM to search it.  The tutorial should take about
20 minutes total to run.

Note, we have successfully tested it on
[binder.pangeo.io](https://binder.pangeo.io/v2/gh/binder-examples/r-conda/master?urlpath=urlpath%3Drstudio)
if you want to give it a try!

## Install miniconda

If you don't have the `conda` command installed, you'll need to install
miniconda for Python 3.x.

On Linux, this should work:
```
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
echo export PATH="$HOME/miniconda3/bin:$PATH" >> ~/.bash_profile
source ~/.bash_profile
```
otherwise, follow
[the miniconda install](https://docs.conda.io/en/latest/miniconda.html).

## Enable [bioconda](https://bioconda.github.io/)

```
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
```

## Install sourmash

To install sourmash, create a new environment named `smash` and install sourmash:

```
conda create -y -n smash sourmash
```

and then activate:
```
conda activate smash
```

You should now be able to use the `sourmash` command:

```
sourmash info
```

## Download some files

Next, download a genbank LCA database for k=31:

```
curl -L -o genbank-k31.lca.json.gz https://osf.io/4f8n3/download
```

Download a random genome from genbank:
```
curl -L -o some-genome.fa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/178/875/GCF_000178875.2_ASM17887v2/GCF_000178875.2_ASM17887v2_genomic.fna.gz
```

Create a signature for this genome:
```
sourmash sketch dna -p scaled=1000,k=31 --name-from-first some-genome.fa.gz
```

Now, classify the signature with sourmash `lca classify`,
```
sourmash lca classify --db genbank-k31.lca.json.gz \
    --query some-genome.fa.gz.sig
```
and this will give you a taxonomic identification of your genome bin,
classified using all of the genbank microbial genomes:

```
loaded 1 LCA databases. ksize=31, scaled=10000
finding query signatures...
outputting classifications to stdout
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
... classifying NC_016901.1 Shewanella baltica OS678, complete genome (file 1 of"NC_016901.1 Shewanella baltica OS678, complete genome",found,Bacteria,Proteobacteria,Gammaproteobacteria,Alteromonadales,Shewanellaceae,Shewanella,Shewanella baltica,
classified 1 signatures total
```

You can also summarize the taxonomic distribution of the content with
`lca summarize`:
```
sourmash lca summarize --db genbank-k31.lca.json.gz \
    --query some-genome.fa.gz.sig
```

which will show you:
```
loaded 1 LCA databases. ksize=31, scaled=10000
finding query signatures...
loaded 1 signatures from 1 files total.
97.9%   520   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella
97.9%   520   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae
97.9%   520   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales
99.6%   529   Bacteria;Proteobacteria;Gammaproteobacteria
99.6%   529   Bacteria;Proteobacteria
99.6%   529   Bacteria
45.4%   241   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica
```

To apply this to your own genome(s), replace `some-genome.fa.gz` above
with your own filename(s).

You can also specify multiple databases and multiple query signatures
on the command line; separate them with `--db` or `--query`.

### Building your own LCA database

(This is an abbreviated version of [this blog post](http://ivory.idyll.org/blog/2017-classify-genome-bins-with-custom-db-try-again.html), updated to use the `sourmash lca` commands.)

Download some pre-calculated signatures:

```
curl -L https://osf.io/bw8d7/download -o delmont-subsample-sigs.tar.gz
tar xzf delmont-subsample-sigs.tar.gz
```

Next, grab the associated taxonomy spreadsheet

```
curl -O -L https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv
```

Build a sourmash LCA database named `delmont.lca.json`:

```
sourmash lca index -f tara-delmont-SuppTable3.csv delmont.lca.json delmont-subsample-sigs/*.sig
```

### Using the LCA database to classify signatures

We can now use `delmont.lca.json` to classify signatures with k-mers
according to the database we just created.  (Note, the database is
completely self-contained at this point.)

Let's classify a single signature:
```
sourmash lca classify --db delmont.lca.json \
    --query delmont-subsample-sigs/TARA_RED_MAG_00003.fa.gz.sig
```

and you should see:

```
loaded 1 databases for LCA use.
ksize=31 scaled=10000
outputting classifications to stdout
ID,status,superkingdom,phylum,class,order,family,genus,species
TARA_RED_MAG_00003,found,Bacteria,Proteobacteria,Gammaproteobacteria,,,,
classified 1 signatures total
```

You can classify a bunch of signatures and also specify an output
location for the CSV:

```
sourmash lca classify --db delmont.lca.json \
    --query delmont-subsample-sigs/*.sig \
    -o out.csv
```

The `lca classify` command supports multiple databases as well as
multiple queries; e.g. `sourmash lca classify --db delmont.lca.json
other.lca.json` will classify based on the combination of taxonomies
in the two databases.

## Some discussion

Sourmash LCA is using k-mers to do taxonomic classification, using the
"lowest common ancestor" approach (pioneered by
[Kraken](http://ccb.jhu.edu/software/kraken/MANUAL.html), and described
[here](http://ivory.idyll.org/blog/2017-something-about-kmers.html)),
to identify each k-mer.  From this it can either find a consensus
taxonomy between all the k-mers (`sourmash classify`) or it can summarize
the mixture of k-mers present in one or more signatures (`sourmash summarize`).

The `sourmash lca index` command can be used to prepare custom taxonomy
databases; sourmash will happily ingest any taxonomy, whether or not
it matches NCBI. See
[the spreadsheet from Delmont et al., 2017](https://github.com/ctb/2017-sourmash-lca/blob/master/tara-delmont-SuppTable3.csv)
for an example format.

[Return to index][3]

[0]:http://ivory.idyll.org/blog/2016-sourmash-sbt-more.html
[1]:databases.md
[2]:https://www.ncbi.nlm.nih.gov/pubmed/233877
[3]:index.md