File: samtools-ampliconstats.1

package info (click to toggle)
samtools 1.21-1
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 21,804 kB
  • sloc: ansic: 33,931; perl: 7,702; sh: 452; makefile: 298; java: 158
file content (428 lines) | stat: -rw-r--r-- 15,233 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
'\" t
.TH samtools-ampliconstats 1 "12 September 2024" "samtools-1.21" "Bioinformatics tools"
.SH NAME
samtools ampliconstats \- produces statistics from amplicon sequencing alignment file
.\"
.\" Copyright (C) 2020-2021 Genome Research Ltd.
.\"
.\" Author: James Bonfield <jkb@sanger.ac.uk>
.\"
.\" Permission is hereby granted, free of charge, to any person obtaining a
.\" copy of this software and associated documentation files (the "Software"),
.\" to deal in the Software without restriction, including without limitation
.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
.\" and/or sell copies of the Software, and to permit persons to whom the
.\" Software is furnished to do so, subject to the following conditions:
.\"
.\" The above copyright notice and this permission notice shall be included in
.\" all copies or substantial portions of the Software.
.\"
.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
.\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
.\" DEALINGS IN THE SOFTWARE.
.
.\" For code blocks and examples (cf groff's Ultrix-specific man macros)
.de EX

.  in +\\$1
.  nf
.  ft CR
..
.de EE
.  ft
.  fi
.  in

..
.
.SH SYNOPSIS
.PP
samtools ampliconstats
.RI [ options ]
.IR primers.bed
.IR in.sam | in.bam | in.cram ...

.SH DESCRIPTION
.PP
samtools ampliconstats collects statistics from one or more input
alignment files and produces tables in text format.  The output can be
visualized graphically using plot-ampliconstats.

The alignment files should have previously been clipped of primer
sequence, for example by "samtools ampliconclip" and the sites of
these primers should be specified as a bed file in the arguments.
Each amplicon must be present in the bed file with one or more LEFT
primers (direction "+") followed by one or more RIGHT primers.  For
example:

.EX 2
MN908947.3  1875  1897  nCoV-2019_7_LEFT        60  +
MN908947.3  1868  1890  nCoV-2019_7_LEFT_alt0   60  +
MN908947.3  2247  2269  nCoV-2019_7_RIGHT       60  -
MN908947.3  2242  2264  nCoV-2019_7_RIGHT_alt5  60  -
MN908947.3  2181  2205  nCoV-2019_8_LEFT        60  +
MN908947.3  2568  2592  nCoV-2019_8_RIGHT       60  -
.EE

Ampliconstats will identify which read belongs to which amplicon.  For
purposes of computing coverage statistics for amplicons with multiple
primer choices, only the innermost primer locations are used.

A summary of output sections is listed below, followed by more
detailed descriptions.

.PD 0
.TP 12
.B SS
Amplicon and file counts.  Always comes first
.TP
.B AMPLICON
Amplicon primer locations
.TP
.B FSS
File specific: summary stats
.TP
.B FRPERC
File specific: read percentage distribution between amplicons
.TP
.B FDEPTH
File specific: average read depth per amplicon
.TP
.B FVDEPTH
File specific: average read depth per amplicon, full length only
.TP
.B FREADS
File specific: numbers of reads per amplicon
.TP
.B FPCOV
File specific: percent coverage per amplicon
.TP
.B FTCOORD
File specific: template start,end coordinate frequencies per amplicon
.TP
.B FAMP
File specific: amplicon correct / double / treble length counts
.TP
.B FDP_ALL
File specific: template depth per reference base, all templates
.TP
.B FDP_VALID
File specific: template depth per reference base, valid templates only
.TP
.B CSS
Combined  summary stats
.TP
.B CRPERC
Combined: read percentage distribution between amplicons
.TP
.B CDEPTH
Combined: average read depth per amplicon
.TP
.B CVDEPTH
Combined: average read depth per amplicon, full length only
.TP
.B CREADS
Combined: numbers of reads per amplicon
.TP
.B CPCOV
Combined: percent coverage per amplicon
.TP
.B CTCOORD
Combined: template coordinates per amplicon
.TP
.B CAMP
Combined: amplicon correct / double / treble length counts
.TP
.B CDP_ALL
Combined: template depth per reference base, all templates
.TP
.B CDP_VALID
Combined: template depth per reference base, valid templates only
.PD
.PP
File specific sections start with both the section key and the
filename basename (minus directory and .sam, .bam or .cram suffix).

Note that the file specific sections are interleaved, ordered first by
file and secondly by the file specific stats.  To collate them
together, use "grep" to pull out all data of a specific type.

The combined sections (C*) follow the same format as the file specific
sections, with a different key.  For simplicity of parsing they also
have a filename column which is filled out with "COMBINED".  These
rows contain stats aggregated across all input files.

.SH SS / AMPLICON

This section is once per file and includes summary information to be
utilised for scaling of plots, for example the total number of
amplicons and files present, tool version number, and command line
arguments.  The second column is the filename or "COMBINED".  This is
followed by the reference name (unless single-ref mode is enabled),
and the summary statistic name and value.

The AMPLICON section is a reformatting of the input BED file.  Each
line consists of the reference name (unless single-ref mode is
enable), the amplicon number and the \fIstart\fR-\fIend\fR coordinates
of the left and right primers.  Where multiple primers are available
these are comma separated, for example \fB10-30,15-40\fR in the left
primer column indicates two primers have been multiplex together
covering genome coordinates 10-30 inclusive and 14-40 inclusively.


.SH CSS SECTION

This section consists of summary counts for the entire set of input
files.   These may be useful for automatic scaling of plots.

.TS
lb l .
Number of amplicons	Total number of amplicons listed in primer.bed
Number of files	Total number of SAM, BAM or CRAM files
End of summary	Always the last item.  Marker for end of CSS block.
.TE


.SH FSS SECTION

This lists summary statistics specific to an individual input file.
The values reported are:

.TS
lb l .
raw total sequences	Total number of sequences found in the file
filtered sequences	Number of sequences filtered with -F option
failed primer match	Number of sequences that did not correspond to
	a known primer location
matching sequences	Number of sequences allocated to an amplicon
.TE

.SH FREADS / CREADS SECTION

For each amplicon, this simply reports the count of reads that have
been assigned to it.  A read is assigned to an amplicon if the start
and/or end of the read is within a specified number of bases of the
primer sites listed in the bed file.  This distance is controlled via
the -m option.

.SH FRPERC / CRPERC SECTION

For each amplicon, this lists what percentage of reads were assigned
to this amplicon out of the total number of assigned reads.  This may
be used to diagnose how uniform this distribution is.

Note this is a pure read count and has no relation to amplicon size.

.SH FDEPTH / CDEPTH / FVDEPTH / CVDEPTH SECTION

Using the reads assigned to each amplicon and their start / end
locations on that reference, computed using the POS and CIGAR fields,
we compute the total number of bases aligned to this amplicon and
corresponding the average depth.  The VDEPTH variants are filtered to
only include templates with end-to-end coverage across the amplicon.
These can be considered to be "valid" or "usable" templates and give
an indication of the minimum depth for the amplicon rather than the
average depth.

To compute the depth the length of the amplicon is computed using the
innermost set of primers, if multiple choices are listed in the bed
file.

.SH FPCOV / CPCOV SECTION

Similar to the FDEPTH section, this is a binary status of covered or
not covered per position in each amplicon.  This is then expressed as
a percentage by dividing by the amplicon length, which is computed
using the innermost set of primers covering this amplicon.

The minimum depth necessary to constitute a position as being
"covered" is specifiable using the -d option.


.SH FTCOORD / CTCOORD / FAMP / CAMP SECTION

It is possible for an amplicon to be produced using incorrect primers,
giving rise to extra-long amplicons (typically double or treble
length).

The FTCOORD field holds a distribution of observed template
coordinates from the input data.  Each row consists of the file name,
the amplicon number in question, and tab separated tuples of start,
end, frequency and status (0 for OK, 1 for skipping amplicon, 2 for
unknown location).  Each template is only counted for one amplicon, so
if the read-pairs span amplicons the count will show up in the
left-most amplicon covered.

Th COORD data may indicate which primers are being utilised if there
are alternates available for a given amplicon.

For COORD lines amplicon number 0 holds the frequency data for data
that reads that have not been assigned to any amplicon.  That is, they
may lie within an amplicon, but they do not start or end at a known
primer location.  It is not recorded for BED files containing multiple
references.

The FAMP / CAMP section is a simple count per amplicon of the number
of templates coming from this amplicon.  Templates are counted once
per amplicon, but and like the FTCOORD field if a read-pair spans
amplicons it is only counted in the left-most amplicon.  Each line
consists of the file name, amplicon number and 3 counts for the number
of templates with both ends within this amplicon, the number of
templates with the rightmost end in another amplicon, and the number
of templates where the other end has failed to be assigned to an
amplicon.

Note FAMP / CAMP amplicon number 0 is the summation of data for all
amplicons (1 onwards).

.SH FDP_ALL / CDP_ALL / FDP_VALID / CDP_VALID section

These are for depth plots per base rather than per amplicon.  They
distinguish between all reads in all templates, and only reads in
templates considered to be "valid".  Such templates have both reads
(if paired) matching known primer locations from he same amplicon and
have full length coverage across the entire amplicon.

This FDP_VALID can be considered to be the minimum template depth
across the amplicon.

The difference between the VALID and ALL plots represents additional
data that for some reason may not be suitable for producing a
consensus.  For example an amplicon that skips a primer, pairing
10_LEFT with 12_RIGHT, will have coverage for the first half of
amplicon 10 and the last half of amplicon 12.  Counting the number of
reads or bases alone in the amplicon does not reveal the potential for
non-uniformity of coverage.

The lines start with the type keyword, file / sample name, reference
name (unless single-ref mode is enabled), followed by a variable
number of tab separated tuples consisting of \fIdepth,length\fR.  The
length field is a basic form of run-length encoding where all depth
values within a specified fraction of each other (e.g. >=
(1-fract)*midpoint and <= (1+fract)*midpoint) are combined into a
single run.  This fraction is controlled via the \fB-D\fR option.

.SH OPTIONS
.TP 8
.BI "-f, --required-flag " INT|STR
Only output alignments with all bits set in
.I INT
present in the FLAG field.
.I INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0],
or in string form by specifying a comma-separated list of keywords as
listed by the "samtools flags" subcommand.

.TP
.BI "-F, --filter-flag " INT|STR
Do not output alignments with any bits set in
.I INT
present in the FLAG field.
.I INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0],
or in string form by specifying a comma-separated list of keywords as
listed by the "samtools flags" subcommand.

.TP
.BI "-a, --max-amplicons " INT
Specify the maximum number of amplicons permitted.

.TP
.BI "-b, --tcoord-bin " INT
Bin the template start,end positions into multiples of \fINT\fR prior
to counting their frequency and reporting in the FTCOORD / CTCOORD
lines.  This may be useful for technologies with higher errors rates
where the alignment ends will vary slightly.
Defaults to 1, which is equivalent to no binning.

.TP
.BI "-c, --tcoord-min-count " INT
In the FTCOORD and CTCOORD lines, only record template start,end
coordinate combination if they occur at least \fIINT\fR times.

.TP
.BI "-d, --min-depth " INT
Specifies the minimum base depth to consider a reference position to
be covered, for purposes of the FRPERC and CRPERC sections.

.TP
.BI "-D, --depth-bin " FRACTION
Controls the merging of neighbouring similar depths for the FDP_ALL
and FDP_VALID plots.  The default FRACTION is 0.01, meaning depths
within +/- 1% of a mid point will be aggregated together as a run of
the same value.  This merging is useful to reduce the file size.  Use
\fB-D 0\fR to record every depth.

.TP
.BI "-l, --max-amplicon-length " INT
Specifies the maximum length of any individual amplicon.

.TP
.BI "-m, --pos-margin " INT
Reads are compared against the primer start and end locations
specified in the BED file.  An aligned sequence should start precisely
at these locations, but sequencing errors may cause the primer
clipping to be a few bases out or for the alignment to add a few extra
bases of soft clip.  This option specifies the margin of error
permitted when matching a read to an amplicon number.

.TP
.B "-o " FILE
Output stats to FILE.  The default is to write to stdout.

.TP
.B "-s, --use-sample-name"
Instead of using the basename component of the input path names, use
the SM field from the first @RG header line.

.TP
.B "-S, --single-ref"
Force the output format to match the older single-reference style
used in Samtools 1.12 and earlier.  This removes the reference names
from the SS, AMPLICON, DP_ALL and DP_VALID sections.  It cannot be
enabled if the input BED file has more than one reference present.
Note that plot-ampliconstats can process both output styles.

.TP
.BI "-t, --tlen-adjust " INT
Adjust the TLEN field by +/- \fIINT\fR to compensate for primer clipping.
This defaults to zero, but if the primers have been clipped and the
TLEN field has not been updated using samtools fixmate then the
template length will be wrong by the sum of the forward and reverse
primer lengths.

This adjustment does not have to be precise as the --pos-margin field
permits some leeway.  Hence if required, it should be set to
approximately double the average primer length.

.TP
.BI "-@ " INT
Number of BAM/CRAM (de)compression threads to use in addition to main thread [0].

.SH EXAMPLE

To run ampliconstats on a directory full of CRAM files and then
produce a series of PNG images named "mydata*.png":

.EX 2
samtools ampliconstats V3/nCoV-2019.bed /path/*.cram > astats
plot-ampliconstats -size 1200,900 mydata astats
.EE

.SH AUTHOR
.PP
Written by James Bonfield from the Sanger Institute.

.SH SEE ALSO
.IR samtools (1),
.IR samtools-ampliconclip (1)
.IR samtools-stats (1),
.IR samtools-flags (1)
.PP
Samtools website: <http://www.htslib.org/>