1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383
|
pbcore.io.dataset
=============================
The Python DataSet XML API is designed to be a lightweight interface for
creating, opening, manipulating and writing DataSet XML files. It provides both
a native Python API and console entry points for use in manual dataset curation
or as a resource for P_Module developers.
The API and console entry points are designed with the set operations one might
perform on the various types of data held by a DataSet XML in mind: merge,
split, write etc. While various types of DataSets can be found in XML files,
the API (and in a way the console entry point, dataset.py) has DataSet as its
base type, with various subtypes extending or replacing functionality as
needed.
Console Entry Point Usage
=============================
The following entry points are available through the main script: dataset.py::
usage: dataset.py [-h] [-v] [--debug]
{create,filter,merge,split,validate,loadstats,consolidate}
...
Run dataset.py by specifying a command.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--debug Turn on debug level logging
DataSet sub-commands:
{create,filter,merge,split,validate,loadstats,consolidate}
Type {command} -h for a command's options
Create::
usage: dataset.py create [-h] [--type DSTYPE] [--novalidate] [--relative]
outfile infile [infile ...]
Create an XML file from a fofn or bam
positional arguments:
outfile The XML to create
infile The fofn or BAM file(s) to make into an XML
optional arguments:
-h, --help show this help message and exit
--type DSTYPE The type of XML to create
--novalidate Don't validate the resulting XML, don't touch paths
--relative Make the included paths relative instead of absolute (not
compatible with --novalidate)
Filter::
usage: dataset.py filter [-h] infile outfile filters [filters ...]
Add filters to an XML file. Suggested fields: ['bcf', 'bcq', 'bcr',
'length', 'pos', 'qend', 'qname', 'qstart', 'readstart', 'rname', 'rq',
'tend', 'tstart', 'zm']. More expensive fields: ['accuracy', 'bc', 'movie',
'qs']
positional arguments:
infile The xml file to filter
outfile The resulting xml file
filters The values and thresholds to filter (e.g. 'rq>0.85')
optional arguments:
-h, --help show this help message and exit
Union::
usage: dataset.py union [-h] outfile infiles [infiles ...]
Combine XML (and BAM) files
positional arguments:
outfile The resulting XML file
infiles The XML files to merge
optional arguments:
-h, --help show this help message and exit
Validate::
usage: dataset.py validate [-h] infile
Validate ResourceId files (XML validation only available in testing)
positional arguments:
infile The XML file to validate
optional arguments:
-h, --help show this help message and exit
Load PipeStats::
usage: dataset.py loadstats [-h] [--outfile OUTFILE] infile statsfile
Load an sts.xml file into a DataSet XML file
positional arguments:
infile The XML file to modify
statsfile The .sts.xml file to load
optional arguments:
-h, --help show this help message and exit
--outfile OUTFILE The XML file to output
Split::
usage: dataset.py split [-h] [--contigs] [--chunks CHUNKS] [--subdatasets]
[--outdir OUTDIR]
infile ...
Split the dataset
positional arguments:
infile The xml file to split
outfiles The resulting xml files
optional arguments:
-h, --help show this help message and exit
--contigs Split on contigs
--chunks CHUNKS Split contigs into <chunks> total windows
--subdatasets Split on subdatasets
--outdir OUTDIR Specify an output directory
Consolidate::
usage: dataset.py consolidate [-h] [--numFiles NUMFILES] [--noTmp]
infile datafile xmlfile
Consolidate the XML files
positional arguments:
infile The XML file to consolidate
datafile The resulting data file
xmlfile The resulting XML file
optional arguments:
-h, --help show this help message and exit
--numFiles NUMFILES The number of data files to produce (1)
--noTmp Don't copy to a tmp location to ensure local disk
use
Usage Examples
=============================
Filter Reads (CLI version)
--------------------------
In this scenario we have one or more bam files worth of subreads, aligned or
otherwise, that we want to filter and put in a single bam file. This is
possible using the CLI with the following steps, starting with a DataSet XML
file::
# usage: dataset.py filter <in_fn.xml> <out_fn.xml> <filters>
dataset.py filter in_fn.subreadset.xml filtered_fn.subreadset.xml 'rq>0.85'
# usage: dataset.py consolidate <in_fn.xml> <out_data_fn.bam> <out_fn.xml>
dataset.py consolidate filtered_fn.subreadset.xml consolidate.subreads.bam out_fn.subreadset.xml
The filtered DataSet and the consolidated DataSet should be read for read
equivalent when used with SMRT Analysis software.
Filter Reads (API version)
--------------------------
The API version of filtering allows for more advanced filtering criteria::
ss = SubreadSet('in_fn.subreadset.xml')
ss.filters.addRequirement(rname=[('=', 'E.faecalis.2'),
('=', 'E.faecalis.2')],
tStart=[('<', '99'),
('<', '299')],
tEnd=[('>', '0'),
('>', '200')])
Produces the following conditions for a read to be considered passing:
(rname = E.faecalis.2 AND tstart < 99 AND tend > 0)
OR
(rname = E.faecalis.2 AND tstart < 299 AND tend > 200)
You can add sets of filters by providing equal length lists of requirements for
each filter.
Additional requirements added singly will be added to all filters::
ss.filters.addRequirement(rq=[('>', '0.85')])
(rname = E.faecalis.2 AND tstart < 99 AND tend > 0 AND rq > 0.85)
OR
(rname = E.faecalis.2 AND tstart < 299 AND tend > 100 AND rq > 0.85)
Additional requirements added with a plurality of options will duplicate the
previous requiremnts for each option::
ss.filters.addRequirement(length=[('>', 500), ('>', 1000)])
(rname = E.faecalis.2 AND tstart < 99 AND tend > 0 AND rq > 0.85 AND length > 500)
OR
(rname = E.faecalis.2 AND tstart < 299 AND tend > 100 AND rq > 0.85 AND length > 500)
OR
(rname = E.faecalis.2 AND tstart < 99 AND tend > 0 AND rq > 0.85 AND length > 1000)
OR
(rname = E.faecalis.2 AND tstart < 299 AND tend > 100 AND rq > 0.85 AND length > 1000)
Of course you can always wipe the filters and start over::
ss.filters = None
Consolidation is more similar to the CLI version::
ss.consolidate('cons.bam')
ss.write('cons.xml')
Resequencing Pipeline (CLI version)
-----------------------------------
In this scenario, we have two movies worth of subreads in two SubreadSets that
we want to align to a reference, merge together, split into DataSet chunks by
contig, then send through quiver on a chunkwise basis (in parallel).
1. Align each movie to the reference, producing a dataset with one bam file for
each execution::
pbalign movie1.subreadset.xml referenceset.xml movie1.alignmentset.xml
pbalign movie2.subreadset.xml referenceset.xml movie2.alignmentset.xml
2. Merge the files into a FOFN-like dataset (bams aren't touched)::
# dataset.py merge <out_fn> <in_fn> [<in_fn> <in_fn> ...]
dataset.py merge merged.alignmentset.xml movie1.alignmentset.xml movie2.alignmentset.xml
3. Split the dataset into chunks by contig (rname) (bams aren't touched). Note
that supplying output files splits the dataset into that many output files
(up to the number of contigs), with multiple contigs per file. Not supplying
output files splits the dataset into one output file per contig, named
automatically. Specifying a number of chunks instead will produce that many
files, with contig or even sub contig (reference window) splitting.::
dataset.py split --contigs --chunks 8 merged.alignmentset.xml
4. Quiver then consumes these chunks::
variantCaller.py --alignmentSetRefWindows --referenceFileName referenceset.xml --outputFilename chunk1consensus.fasta --algorithm quiver chunk1contigs.alignmentset.xml
variantCaller.py --alignmentSetRefWindows --referenceFileName referenceset.xml --outputFilename chunk2consensus.fasta --algorithm quiver chunk2contigs.alignmentset.xml
The chunking works by duplicating the original merged dataset (no bam
duplication) and adding filters to each duplicate such that only reads
belonging to the appropriate contigs are emitted. The contigs are distributed
amongst the output files in such a way that the total number of records per
chunk is about even.
Tangential Information
~~~~~~~~~~~~~~~~~~~~~~
DataSet.refNames (which returns a list of reference names available in the
dataset) is also subject to the filtering imposed during the split. Therefore
you wont be running through superfluous (and apparently unsampled) contigs to
get the reads in this chunk. The DataSet.records generator is also subject to
filtering, but not as efficiently as readsInRange. If you do not have a
reference window, readsInReference() is also an option.
As the bam files are never touched, each dataset contains all the information
necessary to access all reads for all contigs. Doing so on these filtered
datasets would require disabling the filters first::
dset.disableFilters()
Or removing the specific filter giving you problems::
dset.filters.removeRequirement('rname')
Resequencing Pipeline (API version)
-----------------------------------
In this scenario, we have two movies worth of subreads in two SubreadSets that
we want to align to a reference, merge together, split into DataSet chunks by
contig, then send through quiver on a chunkwise basis (in parallel). We want to
do them using the API, rather than the CLI.
1. Align each movie to the reference, producing a dataset with one bam file for
each execution
.. code-block:: python
# CLI (or see pbalign API):
pbalign movie1.subreadset.xml referenceset.xml movie1.alignmentset.xml
pbalign movie2.subreadset.xml referenceset.xml movie2.alignmentset.xml
2. Merge the files into a FOFN-like dataset (bams aren't touched)
.. code-block:: python
# API, filename_list is dummy data:
filename_list = ['movie1.alignmentset.xml', 'movie2.alignmentset.xml']
# open:
dsets = [AlignmentSet(fn) for fn in filename_list]
# merge with + operator:
dset = reduce(lambda x, y: x + y, dsets)
# OR:
dset = AlignmentSet(*filename_list)
3. Split the dataset into chunks by contigs (or subcontig windows)
.. code-block:: python
# split:
dsets = dset.split(contigs=True, chunks=8)
4. Quiver then consumes these chunks
.. code-block:: python
# write out if you need to (or pass directly to quiver API):
outfilename_list = ['chunk1contigs.alignmentset.xml', 'chunk2contigs.alignmentset.xml']
# write with 'write' method:
map(lambda (ds, nm): ds.write(nm), zip(dsets, outfilename_list))
# CLI (or see quiver API):
variantCaller.py --alignmentSetRefWindows --referenceFileName referenceset.xml --outputFilename chunk1consensus.fasta --algorithm quiver chunk1contigs.alignmentset.xml
variantCaller.py --alignmentSetRefWindows --referenceFileName referenceset.xml --outputFilename chunk2consensus.fasta --algorithm quiver chunk2contigs.alignmentset.xml
# Inside quiver (still using python dataset API):
aln = AlignmentSet(fname)
# get this set's windows:
refWindows = aln.refWindows
# gather the reads for these windows using readsInRange, e.g.:
reads = list(itertools.chain(aln.readsInRange(rId, start, end) for rId, start, end in refWindows))
API overview
=============================
The chunking works by duplicating the original merged dataset (no bam
duplication) and adding filters to each duplicate such that only reads
belonging to the appropriate contigs/windows are emitted. The contigs are
distributed amongst the output files in such a way that the total number of
records per chunk is about even.
DataSets can be created using the appropriate constructor (SubreadSet), or with
the common constructor (DataSet) and later cast to a specific type
(copy(asType="SubreadSet")). The DataSet constructor acts as a factory function
(an artifact of early api Designs). The factory behavior is defined in the
DataSet metaclass.
.. inheritance-diagram:: pbcore.io.dataset.DataSetIO
:parts: 1
.. automodule:: pbcore.io.dataset.DataSetIO
:members:
:special-members: __call__, __init__, __metaclass__, __repr__, __add__,
__eq__, __deepcopy__
:show-inheritance:
The operations possible between DataSets of the same and different types are
limited, see the DataSet XML documentation for details.
DataSet XML files have a few major components: XML metadata,
ExternalReferences, Filters, DataSet Metadata, etc. These are represented in
different ways internally, depending on their complexity. DataSet metadata
especially contains a large number of different potential elements, many of
which are accessible in the API as nested attributes. To preserve the API's
ability to grant access to any DataSet Metadata available now and in the
future, as well as to maintain the performance of dataset reading and writing,
each DataSet stores its metadata in what approximates a tree structure, with
various helper classes and functions manipulating this tree. The structure of
this tree and currently implemented helper classes are available in the
DataSetMembers module.
.. inheritance-diagram:: pbcore.io.dataset.DataSetMembers
:parts: 1
.. automodule:: pbcore.io.dataset.DataSetMembers
:members:
:special-members: __getitem__, __iter__, __repr__
:show-inheritance:
:undoc-members:
|