1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
|
---
title: 'sourmash v4: A multitool to quickly search, compare, and analyze genomic
and metagenomic data sets'
tags:
- FracMinHash
- MinHash
- k-mers
- Python
- Rust
authors:
- name: Luiz Irber
orcid: 0000-0003-4371-9659
equal-contrib: true
affiliation: 1
- name: N. Tessa Pierce-Ward
orcid: 0000-0002-2942-5331
equal-contrib: true
affiliation: 1
- name: Mohamed Abuelanin
orcid: 0000-0002-3419-4785
affiliation: 1
- name: Harriet Alexander
orcid: 0000-0003-1308-8008
affiliation: 2
- name: Abhishek Anant
orcid: 0000-0002-5751-2010
affiliation: 9
- name: Keya Barve
orcid: 0000-0003-3241-2117
affiliation: 1
- name: Colton Baumler
orcid: 0000-0002-5926-7792
affiliation: 1
- name: Olga Botvinnik
orcid: 0000-0003-4412-7970
affiliation: 3
- name: Phillip Brooks
orcid: 0000-0003-3987-244X
affiliation: 1
- name: Daniel Dsouza
orcid: 0000-0001-7843-8596
affiliation: 9
- name: Laurent Gautier
orcid: 0000-0003-0638-3391
affiliation: 9
- name: Mahmudur Rahman Hera
orcid: 0000-0002-5992-9012
affiliation: 4
- name: Hannah Eve Houts
orcid: 0000-0002-7954-4793
affiliation: 1
- name: Lisa K. Johnson
orcid: 0000-0002-3600-7218
affiliation: 1
- name: Fabian Klötzl
orcid: 0000-0002-6930-0592
affiliation: 5
- name: David Koslicki
orcid: 0000-0002-0640-954X
affiliation: 4
- name: Marisa Lim
orcid: 0000-0003-2097-8818
affiliation: 1
- name: Ricky Lim
orcid: 0000-0003-1313-7076
affiliation: 9
- name: Bradley Nelson
orcid: 0009-0001-1553-932X
affiliation: 9
- name: Ivan Ogasawara
orcid: 0000-0001-5049-4289
affiliation: 9
- name: Taylor Reiter
orcid: 0000-0002-7388-421X
affiliation: 1
- name: Camille Scott
orcid: 0000-0001-8822-8779
affiliation: 1
- name: Andreas Sjödin
orcid: 0000-0001-5350-4219
affiliation: 6
- name: Daniel Standage
orcid: 0000-0003-0342-8531
affiliation: 7
- name: S. Joshua Swamidass
orcid: 0000-0003-2191-0778
affiliation: 8
- name: Connor Tiffany
orcid: 0000-0001-8188-7720
affiliation: 9
- name: Pranathi Vemuri
orcid: 0000-0002-5748-9594
affiliation: 3
- name: Erik Young
orcid: 0000-0002-9195-9801
affiliation: 1
- name: C. Titus Brown
orcid: 0000-0001-6001-2677
corresponding: true
affiliation: 1
affiliations:
- name: University of California Davis, Davis, CA, United States of America
index: 1
- name: Woods Hole Oceanic Institution, Woods Hole, MA, Unites States of America
index: 2
- name: Chan-Zuckerberg Biohub, San Francisco, CA, United States of America
index: 3
- name: Pennsylvania State University, University Park, PA, United States of America
index: 4
- name: Max Planck Institute for Evolutionary Biology, Plön, Germany
index: 5
- name: Swedish Defence Research Agency (FOI), Stockholm, Sweden
index: 6
- name: National Bioforensic Analysis Center, Fort Detrick, MD, United States of America
index: 7
- name: Washington University in St Louis, St Louis, MO, United States of America
index: 8
- name: No affiliation
index: 9
date: 31 Jan 2024
bibliography: paper.bib
---
# Summary
sourmash is a command line tool and Python library for sketching collections
of DNA, RNA, and amino acid k-mers for biological sequence search, comparison,
and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and
accurate sequence comparisons between datasets of different sizes [@gather],
including taxonomic profiling [@portik2022evaluation], functional profiling
[@hera2023fast], and petabase-scale sequence search [@branchwater]. From
release 4.x, sourmash is built on top of Rust and provides an experimental
Rust interface.
FracMinHash sketching is a lossy compression approach that represents data
sets using a "fractional" sketch containing $1/S$ of the original k-mers. Like
other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash
provides a lightweight way to store representations of large DNA or RNA
sequence collections for comparison and search. Sketches can be used to
identify samples, find similar samples, identify data sets with shared
sequences, and build phylogenetic trees. FracMinHash sketching supports
estimation of overlap, bidirectional containment, and Jaccard similarity
between data sets and is accurate even for data sets of very different sizes.
Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded
to support new database types and many more command line functions.
In particular, sourmash now has robust support for both Jaccard similarity
and Containment calculations, which enables analysis and comparison of data
sets of different sizes, including large metagenomic samples. As of v4.4,
sourmash can convert these to estimated Average Nucleotide Identity (ANI)
values, which can provide improved biological context to sketch comparisons
[@hera2023deriving].
# Statement of Need
Large collections of genomes, transcriptomes, and raw sequencing data sets are
readily available in biology, and the field needs lightweight computational
methods for searching and summarizing the content of both public and private
collections. sourmash provides a flexible set of programmatic tools
for this purpose, together with a robust and well-tested command-line
interface. It has been used in over 350 publications (based on citations of
@Brown:2016 and @Pierce:2019) and it continues to expand in functionality.
# Acknowledgements
This work was funded in part by the Gordon and Betty Moore Foundation’s
Data-Driven Discovery Initiative [GBMF4551 to CTB]. It is also funded in
part by the National Science Foundation [#2018522 to CTB] and PIG-PARADIGM
(Preventing Infection in the Gut of developing Piglets–and thus Antimicrobial
Resistance – by disentAngling the interface of DIet, the host and the
Gastrointestinal Microbiome) from the Novo Nordisk Foundation to CTB.
Notice: This manuscript has been authored by BNBI under Contract
No. HSHQDC-15-C-00064 with the DHS. The US Government retains and the
publisher, by accepting the article for publication, acknowledges that the USG
retains a non-exclusive, paid-up, irrevocable, world-wide license to publish
or reproduce the published form of this manuscript, or allow others to do
so, for USG purposes. Views and conclusions contained herein are those of
the authors and should not be interpreted to represent policies, expressed
or implied, of the DHS.
# References
|