File: parsers.rst

package info (click to toggle)
python-screed 1.1.3-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 820 kB
sloc: python: 3,356; makefile: 169; sh: 32; javascript: 16
file content (167 lines) | stat: -rw-r--r-- 7,272 bytes
parent folder | download | duplicates (4)
Writing Custom Sequence Parsers
===============================

screed is built to be adaptable to new kinds of file sequence formats.
Included with screed are parsers for handling FASTA and FASTQ sequence
file types, though if you need screed to work with a new format, all
you need to do is write a new parser.

Field Roles
-----------

Each field in a screed database is assigned a role. These roles
describe what kind of information is stored in their field. Right now
there are only 4 different roles in a screed database: the text role,
the sliceable role, the indexed key role and the primary key role. All
roles are defined in the file: screed/DBConstants.py

The text role (DBConstants._STANDARD_TEXT) is the role most fields in
a database will have. This role tells screed that the associated field
is storing standard textual data. Nothing special.

The sliceable role (DBConstants._SLICEABLE_TEXT) is a role that can be
assigned to long sequence fields. screed's default FASTA parser
defines the 'sequence' field with the sliceable role. When screed
retrieves a field that has the sliceable role, it builds a special
data structure that supports slicing into the text.

The indexed key role (DBConstants._INDEXED_TEXT_KEY) is associated
with exactly one of the fields in a screed database. In screed's FASTA
and FASTQ parsers, this role is fulfilled by the 'name' field. This
field is required because it is the field screed tells sqlite to index
when creating the database and it is the field used for name look-ups
when querying a screed database.

The primary key role (DBConstants._PRIMARY_KEY_ROLE) is a role
automatically associated with the 'id' field in each database. This
field is always created with each screed database and always holds
this role. You as a user of screed won't need to worry about this one.

General Parsing Function Format
-------------------------------

create_db is the function central to the creation of screed
databases. This function accepts a file path, a tuple of field names
and roles, and an iterator function. The file path describes where the
screed database should go, the tuple contains the names of fields and
their associated roles and the iterator function yields records in a
dictionary format.

This sub-section describes general steps for preparing and using
screed with a custom sequence parser. Though they don't have to be,
future sequence parsers should be located in the seqparse.py file for
convenience.  These steps will be described in the context of working
from the Python shell.

First import the create_db function::

    >>> from screed import create_db

The create_db class handles the formatting of screed databases and
provides a simple interface for storing sequence data.

Next the database fields and roles must be specified. The fields tell
screed the names and order of the data fields inside each record. For instance,
lets say our new sequence has types 'name', 'bar', and 'baz', all text. The
tuple will be::

    >>> fields = (('name', DBConstants._INDEXED_TEXT_KEY),
                  ('bar', DBConstants._STANDARD_TEXT),
                  ('baz', DBConstants._STANDARD_TEXT))

Notice how 'name' is given the indexed key role and bar and baz are
given text roles? If, for instance, you know 'baz' fields can be very long
and you want to be able to retrieve slices of them, you could specify
fields as::

    >>> fields = (('name', DBConstants._INDEXED_TEXT_KEY),
                  ('bar', DBConstants._STANDARD_TEXT),
                  ('baz', DBConstants._SLICEABLE_TEXT))

All screed databases come with an 'id' field, which is a sequential
numbering order starting at 0 for the first record, 1 for the second, and
so on. The names and number of the other fields are arbitrary with one
restriction: one and only one of the fields must fulfill the indexed key role.

Next, you need to setup an iterator function that will return records in
a dictionary format. Have a look at the 'fastq_iter', 'fasta_iter', or
'hava_iter' functions in the screed/fastq.py, screed/fasta.py, and
screed/hava.py files, respectively for examples on how to write one of these.
If you don't know what an iterator function is, the documentation on the
Python website gives a good description:
http://docs.python.org/library/stdtypes.html#iterator-types.

Once the iterator function is written, it needs to be instantiated. In the
context of the built-in parsing functions, this means opening a file and
passing the file handle to the iterator function::

    >>> seqfile = open('path_to_seq_file', 'rb')
    >>> iter_instance = myiter(seqfile)

Assuming that your iterator function is called 'myiter', this sets up an
instance of it ready to use with create_db.

Now the screed database is created with one command::

    >>> create_db('path_to_screed_db', fields, iter_instance)

If you want the screed database saved at 'path_to_screed_db'. If instead you
want the screed database created in the same directory and with a
similar file name as the sequence file, its OK to do this::

    >>> create_db('path_to_seq_file', fields, iter_instance)

create_db will just append '_screed' to the end of the file name and make
a screed database at that file path so the original file won't be
overwritten.

When you're done the sequence file should be closed::

    >>> seqfile.close()

Using the Built-in Sequence Iterator Functions
----------------------------------------------

This section shows how to use the 'fastq_iter' and 'fasta_iter' functions
for returning records from a sequence file.

These functions both take a file handle as the only argument and then return
a dictionary for each record in the file containing names of fields and
associated data. These functions are primarily used in conjunction with
the db_create() function, but they can be useful by themselves.

First, import the necessary module and open a text file containing sequences.
For this example, the 'fastq_iter' function will be used::

    >>> import screed.fastq
    >>> seqfile = open('path_to_seqfile', 'rb')

Now, the 'fastq_iter' can be instantiated and iterated over::

    >>> fq_instance = screed.fastq(seqfile)
    >>> for record in fq_instance:
    ...     print record.name

That will print the name of every sequence in the file. If instead you want
to accumulate the sequences::

    >>> sequences = []
    >>> for record in fq_instance:
    ...     sequences.append(record.sequence)

These iterators are the core of screed's sequence modularity. If there is
a new sequence format you want screed to work with, all it needs is its
own iterator.

Error checking in parsing methods
---------------------------------

The existing FASTA/FASTQ parsing functions contain some error
checking, such as making sure the file can be opened and checking
correct data is being read. Though screed doesn't enforce this, it is
strongly recommended to include error checking code in your parser. To
remain non-specific to one file sequence type or another, the
underlying screed library can't contain error checking code of this
kind. If errors are not detected by the parsing function, they will be
silently included into the database being built and could cause
problems much later when trying to read from the database.