File: chapter_blast.tex

package info (click to toggle)
python-biopython 1.68%2Bdfsg-3~bpo8%2B1
links: PTS, VCS
area: main
in suites: jessie-backports
size: 46,856 kB
sloc: python: 160,306; xml: 93,216; ansic: 9,118; sql: 1,208; makefile: 155; sh: 63
file content (663 lines) | stat: -rw-r--r-- 31,851 bytes
parent folder | download | duplicates (2)
\chapter{BLAST}
\label{chapter:blast}
Hey, everybody loves BLAST right? I mean, geez, how can it get any easier to do comparisons between one of your sequences and every other sequence in the known world? But, of course, this section isn't about how cool BLAST is, since we already know that. It is about the problem with BLAST -- it can be really difficult to deal with the volume of data generated by large runs, and to automate BLAST runs in general.

Fortunately, the Biopython folks know this only too well, so they've developed lots of tools for dealing with BLAST and making things much easier. This section details how to use these tools and do useful things with them.

Dealing with BLAST can be split up into two steps, both of which can be done from within Biopython.
Firstly, running BLAST for your query sequence(s), and getting some output.
Secondly, parsing the BLAST output in Python for further analysis.

Your first introduction to running BLAST was probably via the NCBI web-service.
In fact, there are lots of ways you can run BLAST, which can be categorised in several ways.
The most important distinction is running BLAST locally (on your own machine),
and running BLAST remotely (on another machine, typically the NCBI servers).
We're going to start this chapter by invoking the NCBI online BLAST service
from within a Python script.

\emph{NOTE}: The following Chapter~\ref{chapter:searchio} describes
\verb|Bio.SearchIO|, an \emph{experimental} module in Biopython. We
intend this to ultimately replace the older \verb|Bio.Blast| module, as it
provides a more general framework handling other related sequence
searching tools as well. However, until that is declared stable, for
production code please continue to use the  \verb|Bio.Blast| module
for dealing with NCBI BLAST.

\section{Running BLAST over the Internet}
\label{sec:running-www-blast}

We use the function \verb|qblast()| in the \verb|Bio.Blast.NCBIWWW| module
to call the online version of BLAST.  This has three non-optional arguments:
\begin{itemize}
\item The first argument is the blast program to use for the search, as a
lower case string. The options and descriptions of the programs are
available at \url{http://www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml}.
Currently \verb|qblast| only works with blastn, blastp, blastx, tblast
and tblastx.
\item The second argument specifies the databases to search against. Again,
the options for this are available on the NCBI web pages at
\url{http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.shtml}.
\item The third argument is a string containing your query sequence.  This
can either be the sequence itself, the sequence in fasta format,
or an identifier like a GI number.
\end{itemize}

The \verb|qblast| function also take a number of other option arguments
which are basically analogous to the different parameters you can set
on the BLAST web page.  We'll just highlight a few of them here:

\begin{itemize}
\item The argument \verb|url_base| sets the base URL for running BLAST over the
internet. By default it connects to the NCBI, but one can use this to connect
to an instance of NCBI BLAST running in the cloud. Please refer to the documentation
for the \verb|qblast| function for further details.
\item The \verb|qblast| function can return the BLAST results in various
formats, which you can choose with the optional \verb|format_type| keyword:
\verb|"HTML"|, \verb|"Text"|, \verb|"ASN.1"|, or \verb|"XML"|.
The default is \verb|"XML"|, as that is the format expected by the parser,
described in section~\ref{sec:parsing-blast} below.
\item The argument \verb|expect| sets the expectation or e-value threshold.
\end{itemize}

For more about the optional BLAST arguments, we refer you to the NCBI's own
documentation, or that built into Biopython:

\begin{verbatim}
>>> from Bio.Blast import NCBIWWW
>>> help(NCBIWWW.qblast)
...
\end{verbatim}

Note that the default settings on the NCBI BLAST website are not quite
the same as the defaults on QBLAST. If you get different results, you'll
need to check the parameters (e.g., the expectation value threshold and
the gap values).

For example, if you have a nucleotide sequence you want to search against
the nucleotide database (nt) using BLASTN, and you know the GI number of your
query sequence, you can use:

\begin{verbatim}
>>> from Bio.Blast import NCBIWWW
>>> result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
\end{verbatim}

Alternatively, if we have our query sequence already in a FASTA formatted
file, we just need to open the file and read in this record as a string,
and use that as the query argument:

\begin{verbatim}
>>> from Bio.Blast import NCBIWWW
>>> fasta_string = open("m_cold.fasta").read()
>>> result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
\end{verbatim}

We could also have read in the FASTA file as a \verb|SeqRecord| and then
supplied just the sequence itself:

\begin{verbatim}
>>> from Bio.Blast import NCBIWWW
>>> from Bio import SeqIO
>>> record = SeqIO.read("m_cold.fasta", format="fasta")
>>> result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
\end{verbatim}

Supplying just the sequence means that BLAST will assign an identifier
for your sequence automatically.  You might prefer to use the
\verb|SeqRecord| object's format method to make a FASTA string
(which will include the existing identifier):

\begin{verbatim}
>>> from Bio.Blast import NCBIWWW
>>> from Bio import SeqIO
>>> record = SeqIO.read("m_cold.fasta", format="fasta")
>>> result_handle = NCBIWWW.qblast("blastn", "nt", record.format("fasta"))
\end{verbatim}

This approach makes more sense if you have your sequence(s) in a
non-FASTA file format which you can extract using \verb|Bio.SeqIO|
(see Chapter~\ref{chapter:Bio.SeqIO}).

Whatever arguments you give the \verb|qblast()| function, you should
get back your results in a handle object (by default in XML format).
The next step would be to parse the XML output into Python objects
representing the search results (Section~\ref{sec:parsing-blast}),
but you might want to save a local copy of the output file first.
I find this especially useful when debugging my code that extracts
info from the BLAST results (because re-running the online search
is slow and wastes the NCBI computer time).

\label{sec:saving-blast-output}

We need to be a bit careful since we can use \verb|result_handle.read()| to
read the BLAST output only once -- calling \verb|result_handle.read()| again
returns an empty string.

\begin{verbatim}
>>> save_file = open("my_blast.xml", "w")
>>> save_file.write(result_handle.read())
>>> save_file.close()
>>> result_handle.close()
\end{verbatim}

After doing this, the results are in the file \verb|my_blast.xml| and the
original handle has had all its data extracted (so we closed it). However,
the \verb|parse| function of the BLAST parser (described
in~\ref{sec:parsing-blast}) takes a file-handle-like object, so
we can just open the saved file for input:

\begin{verbatim}
>>> result_handle = open("my_blast.xml")
\end{verbatim}

Now that we've got the BLAST results back into a handle again, we are ready
to do something with them, so this leads us right into the parsing section
(see Section~\ref{sec:parsing-blast} below). You may want to jump ahead to
that now \ldots.

\section{Running BLAST locally}
\label{sec:running-local-blast}

\subsection{Introduction}

Running BLAST locally (as opposed to over the internet, see
Section~\ref{sec:running-www-blast}) has at least major two advantages:
\begin{itemize}
\item Local BLAST may be faster than BLAST over the internet;
\item Local BLAST allows you to make your own database to search for sequences against.
\end{itemize}
Dealing with proprietary or unpublished sequence data can be another reason to run BLAST
locally.  You may not be allowed to redistribute the sequences, so submitting them to the
NCBI as a BLAST query would not be an option.

Unfortunately, there are some major drawbacks too -- installing all the bits and getting
it setup right takes some effort:
\begin{itemize}
\item Local BLAST requires command line tools to be installed.
\item Local BLAST requires (large) BLAST databases to be setup (and potentially kept up to date).
\end{itemize}

\noindent
To further confuse matters there are several different BLAST packages available,
and there are also other tools which can produce imitation BLAST output files, such as BLAT.

\subsection{Standalone NCBI BLAST+}

The ``new''
\href{http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download}
{NCBI BLAST+} suite was released in 2009. This replaces the old NCBI ``legacy'' BLAST
package (see below).

This section will show briefly how to use these tools from within Python. If you have
already read or tried the alignment tool examples in Section~\ref{sec:alignment-tools}
this should all seem quite straightforward. First, we construct a command line string
(as you would type in at the command line prompt if running standalone BLAST by hand).
Then we can execute this command from within Python.

For example, taking a FASTA file of gene nucleotide sequences, you might want to
run a BLASTX (translation) search against the non-redundant (NR) protein database.
Assuming you (or your systems administrator) has downloaded and installed the NR
database, you might run:

\begin{verbatim}
blastx -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001 -outfmt 5
\end{verbatim}

This should run BLASTX against the NR database, using an expectation cut-off value
of $0.001$ and produce XML output to the specified file (which we can then parse).
On my computer this takes about six minutes - a good reason to save the output
to a file so you can repeat any analysis as needed.

From within Biopython we can use the NCBI BLASTX wrapper from the
\verb|Bio.Blast.Applications| module to build the command line string,
and run it:

\begin{verbatim}
>>> from Bio.Blast.Applications import NcbiblastxCommandline
>>> help(NcbiblastxCommandline)
...
>>> blastx_cline = NcbiblastxCommandline(query="opuntia.fasta", db="nr", evalue=0.001,
...                                      outfmt=5, out="opuntia.xml")
>>> blastx_cline
NcbiblastxCommandline(cmd='blastx', out='opuntia.xml', outfmt=5, query='opuntia.fasta',
db='nr', evalue=0.001)
>>> print(blastx_cline)
blastx -out opuntia.xml -outfmt 5 -query opuntia.fasta -db nr -evalue 0.001
>>> stdout, stderr = blastx_cline()
\end{verbatim}

In this example there shouldn't be any output from BLASTX to the terminal,
so stdout and stderr should be empty. You may want to check the output file
\verb|opuntia.xml| has been created.

As you may recall from earlier examples in the tutorial, the \verb|opuntia.fasta|
contains seven sequences, so the BLAST XML output should contain multiple results.
Therefore use \verb|Bio.Blast.NCBIXML.parse()| to parse it as described below in
Section~\ref{sec:parsing-blast}.

\subsection{Other versions of BLAST}

NCBI BLAST+ (written in C++) was first released in 2009 as a replacement for
the original NCBI ``legacy'' BLAST (written in C) which is no longer being updated.
There were a lot of changes -- the old version had a single core command line
tool \verb|blastall| which covered multiple different BLAST search types (which
are now separate commands in BLAST+), and all the command line options
were renamed.
Biopython's wrappers for the NCBI ``legacy'' BLAST tools have been deprecated
and will be removed in a future release.
To try to avoid confusion, we do not cover calling these old tools from Biopython
in this tutorial.

You may also come across \href{http://blast.wustl.edu/}{Washington University BLAST}
(WU-BLAST), and its successor, \href{http://blast.advbiocomp.com}{Advanced Biocomputing
BLAST} (AB-BLAST, released in 2009, not free/open source). These packages include
the command line tools \verb|wu-blastall| and \verb|ab-blastall|, which mimicked
\verb|blastall| from the NCBI ``legacy'' BLAST suite.
Biopython does not currently provide wrappers for calling these tools, but should be able
to parse any NCBI compatible output from them.

\section{Parsing BLAST output}
\label{sec:parsing-blast}

As mentioned above, BLAST can generate output in various formats, such as
XML, HTML, and plain text. Originally, Biopython had parsers for BLAST
plain text and HTML output, as these were the only output formats offered
at the time. Unfortunately, the BLAST output in these formats kept changing,
each time breaking the Biopython parsers. Our HTML BLAST parser has been
removed, but the plain text BLAST parser is still available (see
Section~\ref{sec:parsing-blast-deprecated}). Use it at your own risk,
it may or may not work, depending on which BLAST version you're using.

As keeping up with changes in BLAST
became a hopeless endeavor, especially with users running different BLAST
versions, we now recommend to parse the output in XML format, which can be
generated by recent versions of BLAST. Not only is the XML output more stable
than the plain text and HTML output, it is also much easier to parse
automatically, making Biopython a whole lot more stable.

You can get BLAST output in XML format in various ways. For the parser, it
doesn't matter how the output was generated, as long as it is in the XML format.
\begin{itemize}
\item You can use Biopython to run BLAST over the internet, as described in
section~\ref{sec:running-www-blast}.
\item You can use Biopython to run BLAST locally, as described in
section~\ref{sec:running-local-blast}.
\item You can do the BLAST search yourself on the NCBI site through your
web browser, and then save the results. You need to choose XML as the format
in which to receive the results, and save the final BLAST page you get
(you know, the one with all of the interesting results!) to a file.
\item You can also run BLAST locally without using Biopython, and save
the output in a file. Again, you need to choose XML as the format in which
to receive the results.
\end{itemize}
The important point is that you do not have to use Biopython
scripts to fetch the data in order to be able to parse it.
Doing things in one of these ways, you then need to get a handle
to the results. In Python, a handle is just a nice general way of
describing input to any info source so that the info can be retrieved
using \verb|read()| and \verb|readline()| functions
(see Section~{sec:appendix-handles}).

If you followed the code above for interacting with BLAST through a
script, then you already have \verb|result_handle|, the handle to the
BLAST results.  For example, using a GI number to do an online search:

\begin{verbatim}
>>> from Bio.Blast import NCBIWWW
>>> result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
\end{verbatim}

If instead you ran BLAST some other way, and have the
BLAST output (in XML format) in the file \verb|my_blast.xml|, all you
need to do is to open the file for reading:

\begin{verbatim}
>>> result_handle = open("my_blast.xml")
\end{verbatim}

Now that we've got a handle, we are ready to parse the output. The
code to parse it is really quite small.  If you expect a single
BLAST result (i.e., you used a single query):

\begin{verbatim}
>>> from Bio.Blast import NCBIXML
>>> blast_record = NCBIXML.read(result_handle)
\end{verbatim}

\noindent or, if you have lots of results (i.e., multiple query sequences):

\begin{verbatim}
>>> from Bio.Blast import NCBIXML
>>> blast_records = NCBIXML.parse(result_handle)
\end{verbatim}

Just like \verb|Bio.SeqIO| and \verb|Bio.AlignIO|
(see Chapters~\ref{chapter:Bio.SeqIO} and~\ref{chapter:Bio.AlignIO}),
we have a pair of input functions, \verb|read| and \verb|parse|, where
\verb|read| is for when you have exactly one object, and \verb|parse|
is an iterator for when you can have lots of objects -- but instead of
getting \verb|SeqRecord| or \verb|MultipleSeqAlignment| objects, we
get BLAST record objects.

To be able to handle the situation where the BLAST file may be huge,
containing thousands of results, \verb|NCBIXML.parse()| returns an
iterator. In plain English, an iterator allows you to step through
the BLAST output, retrieving BLAST records one by one for each BLAST
search result:

\begin{verbatim}
>>> from Bio.Blast import NCBIXML
>>> blast_records = NCBIXML.parse(result_handle)
>>> blast_record = next(blast_records)
# ... do something with blast_record
>>> blast_record = next(blast_records)
# ... do something with blast_record
>>> blast_record = next(blast_records)
# ... do something with blast_record
>>> blast_record = next(blast_records)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
# No further records
\end{verbatim}

Or, you can use a \verb|for|-loop:
\begin{verbatim}
>>> for blast_record in blast_records:
...     # Do something with blast_record
\end{verbatim}

Note though that you can step through the BLAST records only once.
Usually, from each BLAST record you would save the information that
you are interested in. If you want to save all returned BLAST records,
you can convert the iterator into a list:
\begin{verbatim}
>>> blast_records = list(blast_records)
\end{verbatim}
Now you can access each BLAST record in the list with an index as usual.
If your BLAST file is huge though, you may run into memory problems trying to
save them all in a list.

Usually, you'll be running one BLAST search at a time. Then, all you need
to do is to pick up the first (and only) BLAST record in \verb|blast_records|:
\begin{verbatim}
>>> from Bio.Blast import NCBIXML
>>> blast_records = NCBIXML.parse(result_handle)
>>> blast_record = next(blast_records)
\end{verbatim}
\noindent or more elegantly:
\begin{verbatim}
>>> from Bio.Blast import NCBIXML
>>> blast_record = NCBIXML.read(result_handle)
\end{verbatim}

I guess by now you're wondering what is in a BLAST record.

\section{The BLAST record class}

A BLAST Record contains everything you might ever want to extract from the
BLAST output. Right now we'll just show
an example of how to get some info out of the BLAST report, but if you
want something in particular that is not described here, look at the
info on the record class in detail, and take a gander into the code or
automatically generated documentation -- the docstrings have lots of
good info about what is stored in each piece of information.

To continue with our example, let's just print out some summary info
about all hits in our blast report greater than a particular
threshold. The following code does this:

\begin{verbatim}
>>> E_VALUE_THRESH = 0.04

>>> for alignment in blast_record.alignments:
...     for hsp in alignment.hsps:
...         if hsp.expect < E_VALUE_THRESH:
...             print('****Alignment****')
...             print('sequence:', alignment.title)
...             print('length:', alignment.length)
...             print('e value:', hsp.expect)
...             print(hsp.query[0:75] + '...')
...             print(hsp.match[0:75] + '...')
...             print(hsp.sbjct[0:75] + '...')
\end{verbatim}

This will print out summary reports like the following:

\begin{verbatim}
****Alignment****
sequence: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold acclimation protein WCOR413-like protein
alpha form mRNA, complete cds
length: 783
e value: 0.034
tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc...
||||||||| | ||||||||||| || ||||  || || |||||||| |||||| |  | |||||||| ||| ||...
tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc...
\end{verbatim}

Basically, you can do anything you want to with the info in the BLAST
report once you have parsed it. This will, of course, depend on what
you want to use it for, but hopefully this helps you get started on
doing what you need to do!

An important consideration for extracting information from a BLAST report is the type of objects that the information is stored in. In Biopython, the parsers return \verb|Record| objects, either \verb|Blast| or \verb|PSIBlast| depending on what you are parsing. These objects are defined in \verb|Bio.Blast.Record| and are quite complete.

Here are my attempts at UML class diagrams for the \verb|Blast| and \verb|PSIBlast| record classes. If you are good at UML and see mistakes/improvements that can be made, please let me know. The Blast class diagram is shown in Figure~\ref{fig:blastrecord}.

\begin{htmlonly}
\label{fig:blastrecord}
\imgsrc[width=650, height=750]{images/BlastRecord.png}
\end{htmlonly}

\begin{latexonly}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\textwidth]{images/BlastRecord.png}
\caption{Class diagram for the Blast Record class representing all of the info in a BLAST report}
\label{fig:blastrecord}
\end{figure}
\end{latexonly}

The PSIBlast record object is similar, but has support for the rounds that are used in the iteration steps of PSIBlast. The class diagram for PSIBlast is shown in Figure~\ref{fig:psiblastrecord}.

\begin{htmlonly}
\label{fig:psiblastrecord}
\imgsrc[width=650, height=750]{images/PSIBlastRecord.png}
\end{htmlonly}

\begin{latexonly}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\textwidth]{images/PSIBlastRecord.png}
\caption{Class diagram for the PSIBlast Record class.}
\label{fig:psiblastrecord}
\end{figure}
\end{latexonly}

\section{Deprecated BLAST parsers}
\label{sec:parsing-blast-deprecated}

Older versions of Biopython had parsers for BLAST output in plain text or HTML
format. Over the years, we discovered that it is very hard to maintain these
parsers in working order. Basically, any small change to the BLAST output in
newly released BLAST versions tends to cause the plain text and HTML parsers
to break. We therefore recommend parsing BLAST output in XML format, as
described in section~\ref{sec:parsing-blast}.

Depending on which BLAST versions or programs you're using, our plain text BLAST parser may or may not work. Use it at your own risk!

\subsection{Parsing plain-text BLAST output}

The plain text BLAST parser is located in \verb|Bio.Blast.NCBIStandalone|.

As with the XML parser, we need to have a handle object that we can pass to the parser. The handle must implement the \verb|readline()| method and do this properly. The common ways to get such a handle are to either use the provided \verb|blastall| or \verb|blastpgp| functions to run the local blast, or to run a local blast via the command line, and then do something like the following:

\begin{verbatim}
>>> result_handle = open("my_file_of_blast_output.txt")
\end{verbatim}

Well, now that we've got a handle (which we'll call \verb|result_handle|),
we are ready to parse it. This can be done with the following code:

\begin{verbatim}
>>> from Bio.Blast import NCBIStandalone
>>> blast_parser = NCBIStandalone.BlastParser()
>>> blast_record = blast_parser.parse(result_handle)
\end{verbatim}

This will parse the BLAST report into a Blast Record class (either a Blast or a PSIBlast record, depending on what you are parsing) so that you can extract the information from it. In our case, let's just print out a quick summary of all of the alignments greater than some threshold value.

\begin{verbatim}
>>> E_VALUE_THRESH = 0.04
>>> for alignment in blast_record.alignments:
...     for hsp in alignment.hsps:
...         if hsp.expect < E_VALUE_THRESH:
...             print('****Alignment****')
...             print('sequence:', alignment.title)
...             print('length:', alignment.length)
...             print('e value:', hsp.expect)
...             print(hsp.query[0:75] + '...')
...             print(hsp.match[0:75] + '...')
...             print(hsp.sbjct[0:75] + '...')
\end{verbatim}

If you also read the section~\ref{sec:parsing-blast} on parsing BLAST XML output, you'll notice that the above code is identical to what is found in that section. Once you parse something into a record class you can deal with it independent of the format of the original BLAST info you were parsing. Pretty snazzy!

Sure, parsing one record is great, but I've got a BLAST file with tons of records -- how can I parse them all? Well, fear not, the answer lies in the very next section.

\subsection{Parsing a plain-text BLAST file full of BLAST runs}

We can do this using the blast iterator. To set up an iterator, we first set up a parser, to parse our blast reports in Blast Record objects:

\begin{verbatim}
>>> from Bio.Blast import NCBIStandalone
>>> blast_parser = NCBIStandalone.BlastParser()
\end{verbatim}

Then we will assume we have a handle to a bunch of blast records, which we'll call \verb|result_handle|. Getting a handle is described in full detail above in the blast parsing sections.

Now that we've got a parser and a handle, we are ready to set up the iterator with the following command:

\begin{verbatim}
>>> blast_iterator = NCBIStandalone.Iterator(result_handle, blast_parser)
\end{verbatim}

The second option, the parser, is optional. If we don't supply a parser, then the iterator will just return the raw BLAST reports one at a time.

Now that we've got an iterator, we start retrieving blast records (generated by our parser) using \verb|next()|:

\begin{verbatim}
>>> blast_record = next(blast_iterator)
\end{verbatim}

Each call to next will return a new record that we can deal with. Now we can iterate through these records and generate our old favorite, a nice little blast report:

\begin{verbatim}
>>> for blast_record in blast_iterator:
...     E_VALUE_THRESH = 0.04
...     for alignment in blast_record.alignments:
...         for hsp in alignment.hsps:
...             if hsp.expect < E_VALUE_THRESH:
...                 print('****Alignment****')
...                 print('sequence:', alignment.title)
...                 print('length:', alignment.length)
...                 print('e value:', hsp.expect)
...                 if len(hsp.query) > 75:
...                     dots = '...'
...                 else:
...                     dots = ''
...                 print(hsp.query[0:75] + dots)
...                 print(hsp.match[0:75] + dots)
...                 print(hsp.sbjct[0:75] + dots)
\end{verbatim}

%Notice that \verb|next(b_iterator)| will return \verb|None| when it runs out of records to parse, so it is easy to iterate through the entire file with a while loop that checks for the existence of a record.

The iterator allows you to deal with huge blast records without any memory problems, since things are read in one at a time. I have parsed tremendously huge files without any problems using this.

%TODO - Remove this or update it for BLAST XML parser
\subsection{Finding a bad record somewhere in a huge plain-text BLAST file}

One really ugly problem that happens to me is that I'll be parsing a huge blast file for a while, and the parser will bomb out with a ValueError. This is a serious problem, since you can't tell if the ValueError is due to a parser problem, or a problem with the BLAST. To make it even worse, you have no idea where the parse failed, so you can't just ignore the error, since this could be ignoring an important data point.

We used to have to make a little script to get around this problem, but the \verb|Bio.Blast| module now includes a \verb|BlastErrorParser| which really helps make this easier. The \verb|BlastErrorParser| works very similar to the regular \verb|BlastParser|, but it adds an extra layer of work by catching ValueErrors that are generated by the parser, and attempting to diagnose the errors.

Let's take a look at using this parser -- first we define the file we are going to parse and the file to write the problem reports to:

\begin{verbatim}
>>> import os
>>> blast_file = os.path.join(os.getcwd(), "blast_out", "big_blast.out")
>>> error_file = os.path.join(os.getcwd(), "blast_out", "big_blast.problems")
\end{verbatim}

Now we want to get a \verb|BlastErrorParser|:

\begin{verbatim}
>>> from Bio.Blast import NCBIStandalone
>>> error_handle = open(error_file, "w")
>>> blast_error_parser = NCBIStandalone.BlastErrorParser(error_handle)
\end{verbatim}

Notice that the parser take an optional argument of a handle. If a handle is passed, then the parser will write any blast records which generate a ValueError to this handle. Otherwise, these records will not be recorded.

Now we can use the \verb|BlastErrorParser| just like a regular blast parser. Specifically, we might want to make an iterator that goes through our blast records one at a time and parses them with the error parser:

\begin{verbatim}
>>> result_handle = open(blast_file)
>>> iterator = NCBIStandalone.Iterator(result_handle, blast_error_parser)
\end{verbatim}

We can read these records one a time, but now we can catch and deal with errors that are due to problems with Blast (and not with the parser itself):

\begin{verbatim}
>>> try:
...     next_record = next(iterator)
... except NCBIStandalone.LowQualityBlastError as info:
...     print("LowQualityBlastError detected in id %s" % info[1])
\end{verbatim}

The \verb|next()| functionality is normally called indirectly via a \verb|for|-loop.
Right now the \verb|BlastErrorParser| can generate the following errors:

\begin{itemize}
  \item \verb|ValueError| -- This is the same error generated by the regular BlastParser, and is due to the parser not being able to parse a specific file. This is normally either due to a bug in the parser, or some kind of discrepancy between the version of BLAST you are using and the versions the parser is able to handle.

  \item \verb|LowQualityBlastError| -- When BLASTing a sequence that is of really bad quality (for example, a short sequence that is basically a stretch of one nucleotide), it seems that Blast ends up masking out the entire sequence and ending up with nothing to parse. In this case it will produce a truncated report that causes the parser to generate a ValueError. \verb|LowQualityBlastError| is reported in these cases. This error returns an info item with the following information:
  \begin{itemize}
    \item \verb|item[0]| -- The error message
    \item \verb|item[1]| -- The id of the input record that caused the error. This is really useful if you want to record all of the records that are causing problems.
  \end{itemize}
\end{itemize}

As mentioned, with each error generated, the BlastErrorParser will write the offending record to the specified \verb|error_handle|. You can then go ahead and look and these and deal with them as you see fit. Either you will be able to debug the parser with a single blast report, or will find out problems in your blast runs. Either way, it will definitely be a useful experience!

Hopefully the \verb|BlastErrorParser| will make it much easier to debug and deal with large Blast files.

\section{Dealing with PSI-BLAST}

You can run the standalone version of PSI-BLAST (the legacy NCBI command line
tool \verb|blastpgp|, or its replacement \verb|psiblast|) using the wrappers
in \verb|Bio.Blast.Applications| module.

At the time of writing, the NCBI do not appear to support tools running a
PSI-BLAST search via the internet.

Note that the \verb|Bio.Blast.NCBIXML| parser can read the XML output from
current versions of PSI-BLAST, but information like which sequences in each
iteration is new or reused isn't present in the XML file.
If you care about this information you may have more joy with the plain text
output and the \verb|PSIBlastParser| in \verb|Bio.Blast.NCBIStandalone|.

\section{Dealing with RPS-BLAST}

You can run the standalone version of RPS-BLAST (either the legacy NCBI
command line tool \verb|rpsblast|, or its replacement with the same name)
using the wrappers in \verb|Bio.Blast.Applications| module.

At the time of writing, the NCBI do not appear to support tools running an
RPS-BLAST search via the internet.

You can use the \verb|Bio.Blast.NCBIXML| parser to read the XML output from
current versions of RPS-BLAST.