File: data.rst

package info (click to toggle)
mash 2.2.2%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 33,476 kB
  • sloc: cpp: 5,759; ansic: 181; makefile: 110; python: 31; sh: 14
file content (88 lines) | stat: -rw-r--r-- 4,607 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
Supporting Data
===============

Mash: fast genome and metagenome distance estimation using MinHash
------------------------------------------------------------------

`RefSeqSketches.msh.gz <http://gembox.cbcb.umd.edu/mash/RefSeqSketches.msh.gz>`_: Mash sketch database (k=16, s=400) for RefSeq release 70 (48MB)

`RefSeqSketchesDefaults.msh.gz <https://gembox.cbcb.umd.edu/mash/RefSeqSketchesDefaults.msh.gz>`_: Mash sketch database (k=21, s=1000) for RefSeq release 70 (255MB)

`Escherichia.tar.gz <http://gembox.cbcb.umd.edu/mash/Escherichia.tar.gz>`_: Names and accessions for 500 selected Escherichia genomes, pairwise ANI, and pairwise Jaccard indexes for various k-mer and sketch sizes (24MB)

`mash-1.0.tar.gz <http://gembox.cbcb.umd.edu/mash/mash-1.0.tar.gz>`_: Mash version 1.0 codebase (93KB)

`SRR2671867.BaAmes.poretools.fastq.gz <http://gembox.cbcb.umd.edu/mash/SRR2671867.BaAmes.poretools.fastq.gz>`_: Nanopore 1D + 2D sequences generated by poretools (157MB)

`SRR2671868.Bc10987.poretools.fastq.gz <http://gembox.cbcb.umd.edu/mash/SRR2671868.Bc10987.poretools.fastq.gz>`_: Nanopore 1D + 2D sequences generated by poretools (250MB)

Mash Screen: High-throughput sequence containment estimation for genome discovery
---------------------------------------------------------------------------------

Custom scripts and intermediate data:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`MashScreen_supp.tar.gz <https://obj.umiacs.umd.edu/mash/screen/MashScreen_supp.tar.gz>`_

Data files:
~~~~~~~~~~~

Mash Sketch databases for RefSeq release 88:
 * `RefSeq88n.msh.gz <https://obj.umiacs.umd.edu/mash/screen/RefSeq88n.msh.gz>`_: Genomes (k=21, s=1000), 1.2Gb uncompressed
 * `RefSeq88p.msh.gz <https://obj.umiacs.umd.edu/mash/screen/RefSeq88p.msh.gz>`_: Proteomes (k=9, s=1000), 1.1Gb uncompressed

`art.fastq.gz <https://obj.umiacs.umd.edu/mash/screen/art.fastq.gz>`_: Simulated reads for Shakya experiment

Figure 5:
 * `fig5.html <https://obj.umiacs.umd.edu/mash/screen/fig5/fig5.html>`_: Interactive version
 * `fig5.tsv <https://obj.umiacs.umd.edu/mash/screen/fig5/fig5.tsv>`_: Source data

Screen of SRA metagenomes vs. RefSeq
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 * `sra_meta_nucl_95idy.tsv.gz <https://obj.umiacs.umd.edu/mash/screen/tables/sra_meta_nucl_95idy.tsv.gz>`_ (2.3Gb uncompressed)
 * `sra_meta_nucl_80idy_3x.tsv.gz <https://obj.umiacs.umd.edu/mash/screen/tables/sra_meta_nucl_80idy_3x.tsv.gz>`_ (6.7Gb uncompressed)
 * `sra_meta_prot_95idy.tsv.gz <https://obj.umiacs.umd.edu/mash/screen/tables/sra_meta_prot_95idy.tsv.gz>`_ (2.1Gb uncompressed)
 * `sra_meta_prot_80idy_3x.tsv.gz <https://obj.umiacs.umd.edu/mash/screen/tables/sra_meta_prot_80idy_3x.tsv.gz>`_ (8.3Gb uncompressed)

These files have a line for each RefSeq genome listing all metagenomic SRA runs
(as of August 2018) with Mash Containment Scores above the specified threshold.
They are provided for two screen modes:

* ``nucl``: Genomic RefSeq sequences
* ``prot``: Proteomic RefSeq sequences (combined amino acid sequences per organism). **NOTE:** Protein tables above are not p-value filtered and thus large (> ~50Gb) runs may have spurious hits. They also do not contain plasmids. Updates coming soon!

...and at two thresholds:

* ``95idy``: 95% Mash Containment Score, any coverage. Useful for finding runs containing a specific genome.
* ``80idy_3x``: 80% Mash Containment Score, at least 3x median k-mer multiplicity.
  Useful for finding related, but novel, sequences.

The files are tab separated, with each line beginning with a RefSeq assembly accession, followed by SRA accessions, for example:

::
  
  GCF_000001215.4	SRR3401361	SRR3540373
  GCF_000001405.36	SRR5127794	ERR1539652	SRR413753	ERR206081
  GCF_000001405.38	SRR5127794	ERR1539652	ERR1711677	SRR413753	ERR206081

We also provide simple scripts for searching these files: `search.tar <https://obj.umiacs.umd.edu/mash/screen/search.tar>`_

Public data sources
~~~~~~~~~~~~~~~~~~~

The BLAST ``nr`` database was downloaded from ``ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*``.

HMP data were downloaded from ``ftp://public-ftp.ihmpdcc.org/``, reads from the ``Ilumina/`` directory
and coding sequences from the ``HMGI/`` directory. Within these folders, sample SRS015937 resides in
``tongue_dorsum/`` and SRS020263 in ``right_retroauricular_crease/``.

SRA runs downloaded with the `SRA Toolkit <https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/>`_.

RefSeq genomes downloaded from the ``genomes/refseq/`` directory of ``ftp.ncbi.nlm.nih.gov``.

Public data products
~~~~~~~~~~~~~~~~~~~~

Quebec Polyomavirus is submitted to GenBank as BK010702.