File: PbbarcodeFunctionalSpecification.rst

package info (click to toggle)
pbbarcode 0.8.0-4
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 280 kB
  • ctags: 124
  • sloc: python: 798; makefile: 186; ansic: 48; sh: 16
file content (409 lines) | stat: -rw-r--r-- 19,596 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
=========
pbbarcode
=========

---------------------------------------------------------
annotate PacBio sequencing reads with barcode information
---------------------------------------------------------

:Date: December 2015
:Manual section: 1


DESCRIPTION
===========
The **pbbarcode** package provides
utilities for annotating individual ZMWs directly from a bas.h5 file,
emitting fast[a|q] files for each barcode, labeling alignments stored
in a cmp.h5 file, and calling consensus on small amplicons (requires
**pbdagcon**\ (1))

At the moment, Barcodes can be scored in two different ways:
``symmetric`` and ``paired``. Symmetric mode supports barcode designs
with two identical barcodes on both sides of a SMRTbell, e.g., for
barcodes (A, B), molecules are labeled as A--A or B--B. The ``paired``
mode supports designs with two distinct barcodes on each side of the
molecule, but neither barcode appears without its mate. The minimum
example is given with the following barcodes: (ALeft, ARight, BLeft,
BRight), where the following barcode sets are checked: ALeft--ARight,
BLeft--BRight.

It is important to highlight that a barcode FASTA file specifies a
list of available barcodes to evaluate. Depending on the scoring mode,
the barcodes are grouped together in different ways. For instance, in
the ``symmetric`` case, the number of possible barcode outcomes are
simply the number of barcodes that are supplied to the routine in the
FASTA file (see below for usage) plus an additional ``NULL`` barcode
indicating that no barcode could be evaluated (denoted by:
'--'). Labels like this (A--A) are used in the final outputs. In the
``paired`` mode, the number of possible barcode outcomes are half the
number of the sequences in the FASTA file plus the ``NULL``
barcode. The ``NULL`` barcode indicates that no attempt was made to
score the molecule or it was filtered out by the user's criteria. The
majority of cases when a molecule is not scored are related to not
observing any adapters. If a user has executed a "hot-start" run, the
user can try the '--scoreFirst' parameter to attempt to label the
first adapter's barcode. This increases the yield of the labeleing
procedure at the expense of some probably false positives. 

The software is implemented as a standard python package. Barcodes are
labeled according to the following high-level logic. For each
molecule, all adapters are found. For each adapter, we align (using
standard Smith-Watterman alignment) each barcode and its reverse
complement to flanking sequence of the adapter. If two complete
flanking sequences are available, we divide by 2, else 1 if only one
flanking sequence was available (average score at adapter). This
allows the scores across adapters to be on the same scale (chimera
detection). Depending on the ``mode``, we then determine which
barcode(s) are maximally scoring. We store the two maximally scoring
barcodes, the sum of their alignment scores across the adapters. The
average barcode score then can be given approximately by:
total-score/number-of-adapters. At the moment, the alignment
parameters are fixed at:


.. table:: SW Match Parameters
+----------+----------+
|type      |score     |
|          |          |
+----------+----------+
|insertion |-1        |
|          |          |
+----------+----------+
|deletion  |-1        |
|          |          |
+----------+----------+
|mismatch  |-2        |
|          |          |
+----------+----------+
|match     |2         |
|          |          |
+----------+----------+

Input and output
````````````````

labelZmws
---------
  usage: pbbarcode labelZmws [-h] [--outDir OUTDIR] [--outFofn OUTFOFN]
                                [--adapterSidePad ADAPTERSIDEPAD]
                                [--insertSidePad INSERTSIDEPAD]
                                [--scoreMode {symmetric,paired}]
                                [--maxAdapters MAXADAPTERS] [--scoreFirst]
                                [--startTimeCutoff STARTTIMECUTOFF]
                                [--nZmws NZMWS] [--nProcs NPROCS]
                                [--saveExtendedInfo]
                                barcode.fasta input.fofn

  Creates a barcode.h5 file from base h5 files.

  positional arguments:
    barcode.fasta         Input barcode fasta file
    input.fofn            Input base fofn

  optional arguments:
    -h, --help            show this help message and exit
    --outDir OUTDIR       Where to write the newly created barcode.h5 files.
                          (default: /home/UNIXHOME/jbullard/projects/software/bioinformatics/tools/pbbarcode/doc)
    --outFofn OUTFOFN     Write to outFofn (default: barcode.fofn)
    --adapterSidePad ADAPTERSIDEPAD
                          Pad with adapterSidePad bases (default: 4)
    --insertSidePad INSERTSIDEPAD
                          Pad with insertSidePad bases (default: 4)
    --scoreMode {symmetric,paired}
                          The mode in which the barcodes should be scored.
                          (default: symmetric)
    --maxAdapters MAXADAPTERS
                          Only score the first maxAdapters (default: 20)
    --scoreFirst          Whether to try to score the leftmost barcode in a
                          trace. (default: False)
    --startTimeCutoff STARTTIMECUTOFF
                          Reads must start before this value in order to be
                          included when scoreFirst is set. (default: 10.0)
    --nZmws NZMWS         Use the first n ZMWs for testing (default: -1)
    --nProcs NPROCS       How many processes to use (default: 8)
    --saveExtendedInfo    Whether to save extended information tothe barcode.h5
                          files; this information is useful for debugging and
                          chimera detection (default: False)

The ``labelZmws`` command takes an input.fofn representing a set of
bas.h5 files to operate on. Additionally, it takes a barcode.fasta
file. Depending on ``scoreMode``, the FASTA file will be processed in
different ways. Specifically, in ``paired`` mode, each two consecutive
barcodes in the file are considered a set.

The parameters, ``adapterSidePad`` and ``insertSidePad`` represents
how many bases should be considered on each side of the putative
barcode. These parameters are constrained such that:
``|adapterSidePad| + |insertSidePad| + |barcode| < 65``.

Users have the option to specify a different output location
for the various outputs. Specifically, for each bas.h5 file in
input.fofn, a bc.h5 (barcode hdf5) file is generated. These files are
listed in the file ``outFofn`` which is typically just called
``barcode.fofn``. See below for a description of the barcode hdf5
file.


labelAlignments
---------------
  usage: pbbarcode labelAlignments [-h]
                                      [--minAvgBarcodeScore MINAVGBARCODESCORE]
                                      [--minNumBarcodes MINNUMBARCODES]
                                      [--minScoreRatio MINSCORERATIO]
                                      barcode.fofn aligned_reads.cmp.h5

  Adds information about barcode alignments to a cmp.h5 file from a previous
  call to "labelZmws".

  positional arguments:
    barcode.fofn          input barcode fofn file
    aligned_reads.cmp.h5  cmp.h5 file to add barcode labels

  optional arguments:
    -h, --help            show this help message and exit
    --minAvgBarcodeScore MINAVGBARCODESCORE
                          ZMW Filter: exclude ZMW if average barcode score is
                          less than this value (default: 0.0)
    --minNumBarcodes MINNUMBARCODES
                          ZMW Filter: exclude ZMW if number of barcodes observed
                          is less than this value (default: 1)
    --minScoreRatio MINSCORERATIO
                          ZMW Filter: exclude ZMWs whose best score divided by
                          the 2nd best score is less than this ratio (default:
                          1.0)
                          

The ``labelAlignments`` command takes as input a barcode.fofn computed
from a call to ``labelZMWs`` and a cmp.h5 file where the barcode
information is written to. See below for a description of the cmp.h5
file additions.  



emitFastqs
----------
  usage: pbbarcode emitFastqs [-h] [--outDir output.dir] [--subreads]
                                 [--unlabeledZmws] [--trim TRIM] [--fasta]
                                 [--minMaxInsertLength MINMAXINSERTLENGTH]
                                 [--hqStartTime HQSTARTTIME]
                                 [--minReadScore MINREADSCORE]
                                 [--minAvgBarcodeScore MINAVGBARCODESCORE]
                                 [--minNumBarcodes MINNUMBARCODES]
                                 [--minScoreRatio MINSCORERATIO]
                                 input.fofn barcode.fofn

  Takes a bas.h5 fofn and a barcode.h5 fofn and produces a fast[a|q] file for
  each barcode.

  positional arguments:
    input.fofn            input base or CCS fofn file
    barcode.fofn          input barcode.h5 fofn file

  optional arguments:
    -h, --help            show this help message and exit
    --outDir output.dir   output directory to write fastq files (default: /home/
                          UNIXHOME/jbullard/projects/software/bioinformatics/too
                          ls/pbbarcode/doc)
    --subreads            whether to produce fastq files for the subreads;the
                          default is to use the CCS reads. This option
                          onlyapplies when input.fofn has both consensus and raw
                          reads,otherwise the read type from input.fofn will be
                          returned. (default: False)
    --unlabeledZmws       whether to emit a fastq file for the unlabeled ZMWs.
                          These are the ZMWs where no adapters are found
                          typically (default: False)
    --trim TRIM           trim off barcodes and any excess constant sequence
                          (default: 20)
    --fasta               whether the files produced should be FASTA files
                          asopposed to FASTQ (default: False)
    --minMaxInsertLength MINMAXINSERTLENGTH
                          ZMW Filter: exclude ZMW if the longest subreadis less
                          than this amount (default: 0)
    --hqStartTime HQSTARTTIME
                          ZMW Filter: exclude ZMW if start time of HQ
                          regiongreater than this value (seconds) (default: inf)
    --minReadScore MINREADSCORE
                          ZMW Filter: exclude ZMW if readScore is less thanthis
                          value (default: 0)
    --minAvgBarcodeScore MINAVGBARCODESCORE
                          ZMW Filter: exclude ZMW if average barcode score is
                          less than this value (default: 0.0)
    --minNumBarcodes MINNUMBARCODES
                          ZMW Filter: exclude ZMW if number of barcodes observed
                          is less than this value (default: 1)
    --minScoreRatio MINSCORERATIO
                          ZMW Filter: exclude ZMWs whose best score divided by
                          the 2nd best score is less than this ratio (default:
                          1.0)
                          

The ``emitFastqs`` command takes as input both an input.fofn for the
bas.h5 files as well as a barcode.fofn from a call to labelZmws. The
optional parameter ``outDir`` dictates where the files will be
written. For each detected barcode, a fast[a|q] file will be emitted
with all of the reads for that barcode. The ``trim`` parameter
dictates how much of the read should be trimmed off. The default
parameter for ``trim`` is the length of the barcode (which is stored
in the barcode hdf5 files). At the moment, all barcodes in the barcode
FASTA file must be the same length, therefore only a constant trim
value is supported. In practice, one can aggressively trim in order to
ensure that extra bases aren't left on the ends of reads. Finally, the
``subreads`` parameter dictates whether subreads or CCS reads should
be returned with the default being the appropriate reads according to
the input file type, either CCS or subreads. This parameter is only
inspected if the input.fofn contains both CCS and subread data, if the
input.fofn contains only subread or CCS data then that is returned
irrespective of the state of the the ``subreads`` parameter and a
warning is issued.

consensus
---------
  usage: pbbarcode consensus [-h] [--subsample SUBSAMPLE] [--nZmws NZMWS]
                                [--outDir OUTDIR] [--keepTmpDir]
                                [--ccsFofn CCSFOFN] [--nProcs NPROCS]
                                [--noQuiver]
                                [--minMaxInsertLength MINMAXINSERTLENGTH]
                                [--hqStartTime HQSTARTTIME]
                                [--minReadScore MINREADSCORE]
                                [--minAvgBarcodeScore MINAVGBARCODESCORE]
                                [--minNumBarcodes MINNUMBARCODES]
                                [--minScoreRatio MINSCORERATIO]
                                [--barcode BARCODE [BARCODE ...]]
                                input.fofn barcode.fofn

  Compute consensus sequences for each barcode.

  positional arguments:
    input.fofn            input bas.h5 fofn file
    barcode.fofn          input bc.h5 fofn file

  optional arguments:
    -h, --help            show this help message and exit
    --subsample SUBSAMPLE
                          Subsample ZMWs (default: 1)
    --nZmws NZMWS         Take n ZMWs (default: -1)
    --outDir OUTDIR       Use this directory to output results (default: .)
    --keepTmpDir
    --ccsFofn CCSFOFN     Obtain CCS data from ccsFofn instead of input.fofn
                          (default: )
    --nProcs NPROCS       Use nProcs to execute. (default: 16)
    --noQuiver
    --minMaxInsertLength MINMAXINSERTLENGTH
                          ZMW Filter: exclude ZMW if the longest subreadis less
                          than this amount (default: 0)
    --hqStartTime HQSTARTTIME
                          ZMW Filter: exclude ZMW if start time of HQ
                          regiongreater than this value (seconds) (default: inf)
    --minReadScore MINREADSCORE
                          ZMW Filter: exclude ZMW if readScore is less thanthis
                          value (default: 0)
    --minAvgBarcodeScore MINAVGBARCODESCORE
                          ZMW Filter: exclude ZMW if average barcode score is
                          less than this value (default: 0.0)
    --minNumBarcodes MINNUMBARCODES
                          ZMW Filter: exclude ZMW if number of barcodes observed
                          is less than this value (default: 1)
    --minScoreRatio MINSCORERATIO
                          ZMW Filter: exclude ZMWs whose best score divided by
                          the 2nd best score is less than this ratio (default:
                          1.0)
    --barcode BARCODE [BARCODE ...]
                          Use this to extract consensus for just one barcode.
                          (default: None)

The ``emitFastqs`` command takes as input both an input.fofn for the
bas.h5 files as well as a barcode.fofn from a call to labelZmws. The
results are a FASTA file with an entry for each barcode containing the
consensus amplicon sequence. This mode utilizes ``Quiver`` and
``pbdagcon`` to compute consensus.   

In cases where the amplicon is fewer than 2.5k bases, using CCS data
is quite helpful. The ``--ccsFofn`` allows one to pass directly the
ccs files. In many cases, both the CCS and raw basecalls are in the
same file so you can check by passing the same parameter to input.fofn
as to ccsFofn. 

Dependencies
````````````

The pbbarcode package depends on a standard pbcore installation
(https://github.com/PacificBiosciences/pbcore). If one wishes to use
the ``consensus`` tool, ``pbdagcon`` needs to be installed
(https://github.com/PacificBiosciences/pbdagcon).


Barcode HDF5 File
`````````````````

The barcode hdf5 file, ``bc.h5``, represents a simple data store for
barcode calls and their scores for each ZMW. Generally, a user need
not interact with barcode hdf5 files, but can use the results stored in
either the resulting cmp.h5 file or fast[a|q] files. The barcode hdf5
file contains the following structure:

/BarcodeCalls/best - (nZMWs, 6)[32-bit integer] dataset with the
following columns: 

    ``holeNumber,nAdapters,barcodeIdx1,barcodeScore1,barcodeIdx2,barcodeScore2``

Additionally, the ``best`` dataset has the following attributes:

+-----------+------------------------------------------------------------------------+
|movieName  |m120408_042614_richard_c100309392550000001523011508061222_s1_p0         |
|           |                                                                        |
+-----------+------------------------------------------------------------------------+
|columnNames|holeNumber,nAdapters,barcodeIdx1,barcodeScore1,barcodeIdx2,             |
|           |barcodeScore2                                                           |
+-----------+------------------------------------------------------------------------+
|scoreMode  |[symmetric|paired]                                                      |
|           |                                                                        |
+-----------+------------------------------------------------------------------------+
|barcodes   |'bc_1', 'bc_2', ...., 'bc_N'                                            |
|           |                                                                        |
+-----------+------------------------------------------------------------------------+

The two barcodeIdx1 and barcodeIdx2 columns are indices into
``barcodes`` attribute. The ``scoreMode`` is scoring mode used to
align the barcodes. The ``barcodes`` attribute correspond to the
barcode.fasta sequence names. 

Additionally, in some circumstances, it is useful to retain the entire
history of the scoring, i.e., each barcode scored to each adapter
across all ZMWs. In oder to retain this information, one must call:

    ``pbbarcode labelZmws --saveExtendedInfo ...``

In this mode, the resultant HDF5 file will have an additional dataset
under the BarcodeCalls group, named: ``all``. This dataset has the
following format:

/BarcodeCalls/all - (nbarcodes * nadapters[zmw_i], 4) \forall i in 1 ... nZMWs 

    ```holeNumber, adapterIdx, barcodeIdx, score```

The ``adapterIdx`` is the index of the adapter along the molecule,
i.e., adapterIdx 1 is the first adapter scored.

Additions to the compare HDF5 (cmp.h5) File
```````````````````````````````````````````

In addition to the barcode hdf5 file, a call to ``labelAlignments``
will annotate a cmp.h5 file. This annotation is stored in ways
consistent with the cmp.h5 file format. Specifically, a new group: 

| /BarcodeInfo/
|   ID   (nBarcodeLabels + 1, 1)[32-bit integer] 
|   Name (nBarcodeLabels + 1, 1)[variable length string]

In addition to the /BarcodeInfo/ group, the key dataset which assigns
alignments to barcodes is located at:

/AlnInfo/Barcode (nAlignments, 3)[32-bit integer] with the following
columns:

     ``index,count,bestIndex,bestScore,secondBestIndex,secondBestScore``

Here index refers to the index into the ``Name`` vector, score
corresponds to the sum of the scores for the barcodes, and finally,
count refers to the number of adapters found in the molecule.