1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
|
Supporting Data
===============
Mash: fast genome and metagenome distance estimation using MinHash
------------------------------------------------------------------
`RefSeqSketches.msh.gz <http://gembox.cbcb.umd.edu/mash/RefSeqSketches.msh.gz>`_: Mash sketch database (k=16, s=400) for RefSeq release 70 (48MB)
`RefSeqSketchesDefaults.msh.gz <https://gembox.cbcb.umd.edu/mash/RefSeqSketchesDefaults.msh.gz>`_: Mash sketch database (k=21, s=1000) for RefSeq release 70 (255MB)
`Escherichia.tar.gz <http://gembox.cbcb.umd.edu/mash/Escherichia.tar.gz>`_: Names and accessions for 500 selected Escherichia genomes, pairwise ANI, and pairwise Jaccard indexes for various k-mer and sketch sizes (24MB)
`mash-1.0.tar.gz <http://gembox.cbcb.umd.edu/mash/mash-1.0.tar.gz>`_: Mash version 1.0 codebase (93KB)
`SRR2671867.BaAmes.poretools.fastq.gz <http://gembox.cbcb.umd.edu/mash/SRR2671867.BaAmes.poretools.fastq.gz>`_: Nanopore 1D + 2D sequences generated by poretools (157MB)
`SRR2671868.Bc10987.poretools.fastq.gz <http://gembox.cbcb.umd.edu/mash/SRR2671868.Bc10987.poretools.fastq.gz>`_: Nanopore 1D + 2D sequences generated by poretools (250MB)
Mash Screen: High-throughput sequence containment estimation for genome discovery
---------------------------------------------------------------------------------
Custom scripts and intermediate data:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`MashScreen_supp.tar.gz <https://obj.umiacs.umd.edu/mash/screen/MashScreen_supp.tar.gz>`_
Data files:
~~~~~~~~~~~
Mash Sketch databases for RefSeq release 88:
* `RefSeq88n.msh.gz <https://obj.umiacs.umd.edu/mash/screen/RefSeq88n.msh.gz>`_: Genomes (k=21, s=1000), 1.2Gb uncompressed
* `RefSeq88p.msh.gz <https://obj.umiacs.umd.edu/mash/screen/RefSeq88p.msh.gz>`_: Proteomes (k=9, s=1000), 1.1Gb uncompressed
`art.fastq.gz <https://obj.umiacs.umd.edu/mash/screen/art.fastq.gz>`_: Simulated reads for Shakya experiment
Figure 5:
* `fig5.html <https://obj.umiacs.umd.edu/mash/screen/fig5/fig5.html>`_: Interactive version
* `fig5.tsv <https://obj.umiacs.umd.edu/mash/screen/fig5/fig5.tsv>`_: Source data
Screen of SRA metagenomes vs. RefSeq
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* `sra_meta_nucl_95idy.tsv.gz <https://obj.umiacs.umd.edu/mash/screen/tables/sra_meta_nucl_95idy.tsv.gz>`_ (2.3Gb uncompressed)
* `sra_meta_nucl_80idy_3x.tsv.gz <https://obj.umiacs.umd.edu/mash/screen/tables/sra_meta_nucl_80idy_3x.tsv.gz>`_ (6.7Gb uncompressed)
* `sra_meta_prot_95idy.tsv.gz <https://obj.umiacs.umd.edu/mash/screen/tables/sra_meta_prot_95idy.tsv.gz>`_ (2.1Gb uncompressed)
* `sra_meta_prot_80idy_3x.tsv.gz <https://obj.umiacs.umd.edu/mash/screen/tables/sra_meta_prot_80idy_3x.tsv.gz>`_ (8.3Gb uncompressed)
These files have a line for each RefSeq genome listing all metagenomic SRA runs
(as of August 2018) with Mash Containment Scores above the specified threshold.
They are provided for two screen modes:
* ``nucl``: Genomic RefSeq sequences
* ``prot``: Proteomic RefSeq sequences (combined amino acid sequences per organism). **NOTE:** Protein tables above are not p-value filtered and thus large (> ~50Gb) runs may have spurious hits. They also do not contain plasmids. Updates coming soon!
...and at two thresholds:
* ``95idy``: 95% Mash Containment Score, any coverage. Useful for finding runs containing a specific genome.
* ``80idy_3x``: 80% Mash Containment Score, at least 3x median k-mer multiplicity.
Useful for finding related, but novel, sequences.
The files are tab separated, with each line beginning with a RefSeq assembly accession, followed by SRA accessions, for example:
::
GCF_000001215.4 SRR3401361 SRR3540373
GCF_000001405.36 SRR5127794 ERR1539652 SRR413753 ERR206081
GCF_000001405.38 SRR5127794 ERR1539652 ERR1711677 SRR413753 ERR206081
We also provide simple scripts for searching these files: `search.tar <https://obj.umiacs.umd.edu/mash/screen/search.tar>`_
Public data sources
~~~~~~~~~~~~~~~~~~~
The BLAST ``nr`` database was downloaded from ``ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*``.
HMP data were downloaded from ``ftp://public-ftp.ihmpdcc.org/``, reads from the ``Ilumina/`` directory
and coding sequences from the ``HMGI/`` directory. Within these folders, sample SRS015937 resides in
``tongue_dorsum/`` and SRS020263 in ``right_retroauricular_crease/``.
SRA runs downloaded with the `SRA Toolkit <https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/>`_.
RefSeq genomes downloaded from the ``genomes/refseq/`` directory of ``ftp.ncbi.nlm.nih.gov``.
Public data products
~~~~~~~~~~~~~~~~~~~~
Quebec Polyomavirus is submitted to GenBank as BK010702.
|