File: paper.md

package info (click to toggle)
sourmash 4.8.14-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 41,976 kB
  • sloc: python: 56,131; ansic: 288; makefile: 269; sh: 6
file content (183 lines) | stat: -rw-r--r-- 6,516 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: 'sourmash v4: A multitool to quickly search, compare, and analyze genomic
and metagenomic data sets'
tags:
  - FracMinHash
  - MinHash
  - k-mers
  - Python
  - Rust
authors:
 - name: Luiz Irber
   orcid: 0000-0003-4371-9659
   equal-contrib: true
   affiliation: 1
 - name: N. Tessa Pierce-Ward
   orcid: 0000-0002-2942-5331
   equal-contrib: true
   affiliation: 1
 - name: Mohamed Abuelanin
   orcid: 0000-0002-3419-4785
   affiliation: 1
 - name: Harriet Alexander
   orcid: 0000-0003-1308-8008
   affiliation: 2
 - name: Abhishek Anant
   orcid:  0000-0002-5751-2010
   affiliation: 9
 - name: Keya Barve
   orcid: 0000-0003-3241-2117
   affiliation: 1
 - name: Colton Baumler
   orcid: 0000-0002-5926-7792
   affiliation: 1
 - name: Olga Botvinnik
   orcid: 0000-0003-4412-7970
   affiliation: 3
 - name: Phillip Brooks
   orcid: 0000-0003-3987-244X
   affiliation: 1
 - name: Daniel Dsouza
   orcid: 0000-0001-7843-8596
   affiliation: 9
 - name: Laurent Gautier
   orcid: 0000-0003-0638-3391
   affiliation: 9
 - name: Mahmudur Rahman Hera
   orcid: 0000-0002-5992-9012
   affiliation: 4
 - name: Hannah Eve Houts
   orcid: 0000-0002-7954-4793
   affiliation: 1
 - name: Lisa K. Johnson
   orcid: 0000-0002-3600-7218
   affiliation: 1
 - name: Fabian Klötzl
   orcid: 0000-0002-6930-0592
   affiliation: 5
 - name: David Koslicki
   orcid: 0000-0002-0640-954X
   affiliation: 4
 - name: Marisa Lim
   orcid: 0000-0003-2097-8818
   affiliation: 1
 - name: Ricky Lim
   orcid: 0000-0003-1313-7076
   affiliation: 9
 - name: Bradley Nelson
   orcid: 0009-0001-1553-932X
   affiliation: 9
 - name: Ivan Ogasawara
   orcid: 0000-0001-5049-4289
   affiliation: 9
 - name: Taylor Reiter
   orcid: 0000-0002-7388-421X
   affiliation: 1
 - name: Camille Scott
   orcid: 0000-0001-8822-8779
   affiliation: 1
 - name: Andreas Sjödin
   orcid: 0000-0001-5350-4219
   affiliation: 6
 - name: Daniel Standage
   orcid: 0000-0003-0342-8531
   affiliation: 7
 - name: S. Joshua Swamidass
   orcid: 0000-0003-2191-0778
   affiliation: 8
 - name: Connor Tiffany
   orcid: 0000-0001-8188-7720
   affiliation: 9
 - name: Pranathi Vemuri
   orcid: 0000-0002-5748-9594
   affiliation: 3
 - name: Erik Young
   orcid: 0000-0002-9195-9801
   affiliation: 1
 - name: C. Titus Brown
   orcid: 0000-0001-6001-2677
   corresponding: true
   affiliation: 1
affiliations:
 - name: University of California Davis, Davis, CA, United States of America
   index: 1
 - name: Woods Hole Oceanic Institution, Woods Hole, MA, Unites States of America
   index: 2
 - name: Chan-Zuckerberg Biohub, San Francisco, CA, United States of America
   index: 3
 - name:  Pennsylvania State University, University Park, PA, United States of America
   index: 4
 - name:  Max Planck Institute for Evolutionary Biology, Plön, Germany
   index: 5
 - name: Swedish Defence Research Agency (FOI), Stockholm, Sweden
   index: 6
 - name: National Bioforensic Analysis Center, Fort Detrick, MD, United States of America
   index: 7
 - name: Washington University in St Louis, St Louis, MO, United States of America
   index: 8
 - name: No affiliation
   index: 9

date: 31 Jan 2024
bibliography: paper.bib
---

# Summary

sourmash is a command line tool and Python library for sketching collections
of DNA, RNA, and amino acid k-mers for biological sequence search, comparison,
and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and
accurate sequence comparisons between datasets of different sizes [@gather],
including taxonomic profiling [@portik2022evaluation], functional profiling
[@hera2023fast], and petabase-scale sequence search [@branchwater]. From
release 4.x, sourmash is built on top of Rust and provides an experimental
Rust interface.

FracMinHash sketching is a lossy compression approach that represents data
sets using a "fractional" sketch containing $1/S$ of the original k-mers. Like
other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash
provides a lightweight way to store representations of large DNA or RNA
sequence collections for comparison and search. Sketches can be used to
identify samples, find similar samples, identify data sets with shared
sequences, and build phylogenetic trees. FracMinHash sketching supports
estimation of overlap, bidirectional containment, and Jaccard similarity
between data sets and is accurate even for data sets of very different sizes.

Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded
to support new database types and many more command line functions.
In particular, sourmash now has robust support for both Jaccard similarity
and Containment calculations, which enables analysis and comparison of data
sets of different sizes, including large metagenomic samples. As of v4.4,
sourmash can convert these to estimated Average Nucleotide Identity (ANI)
values, which can provide improved biological context to sketch comparisons
[@hera2023deriving].

# Statement of Need

Large collections of genomes, transcriptomes, and raw sequencing data sets are
readily available in biology, and the field needs lightweight computational
methods for searching and summarizing the content of both public and private
collections. sourmash provides a flexible set of programmatic tools
for this purpose, together with a robust and well-tested command-line
interface. It has been used in over 350 publications (based on citations of
@Brown:2016 and @Pierce:2019) and it continues to expand in functionality.

# Acknowledgements

This work was funded in part by the Gordon and Betty Moore Foundation’s
Data-Driven Discovery Initiative [GBMF4551 to CTB]. It is also funded in
part by the National Science Foundation [#2018522 to CTB] and PIG-PARADIGM
(Preventing Infection in the Gut of developing Piglets–and thus Antimicrobial
Resistance – by disentAngling the interface of DIet, the host and the
Gastrointestinal Microbiome) from the Novo Nordisk Foundation to CTB.

Notice: This manuscript has been authored by BNBI under Contract
No. HSHQDC-15-C-00064 with the DHS. The US Government retains and the
publisher, by accepting the article for publication, acknowledges that the USG
retains a non-exclusive, paid-up, irrevocable, world-wide license to publish
or reproduce the published form of this manuscript, or allow others to do
so, for USG purposes. Views and conclusions contained herein are those of
the authors and should not be interpreted to represent policies, expressed
or implied, of the DHS.

# References