1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407
|
.. _blast-usage:
*****************
Controlling BLAST
*****************
.. authors, Gavin Huttley, Tom Elliott, Jeremy Widmann
Preliminaries
-------------
In order to run BLAST locally (from a program running on your computer) you will need to do three things:
- Download the BLAST "executables" from NCBI
- Make sure these programs are available on your ``PATH``
- Construct and format a database to search against
- Tested on version 2.2.13
NCBI has recently changed the BLAST programs, and as yet PyCogent does not support the new versions. The "legacy" versions are available from `here <http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download>`_ (login as guest).
Detailed installation instructions are beyond the scope of this example, but are available at `NCBI's website <http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/unix_setup.html>`_ .
After downloading the programs and setting up your ``PATH``, you should test BLAST by doing this from the command line:
::
$ blastall --help
Which should give this:
::
blastall 2.2.22 arguments:
-p Program Name [String]
-d Database [String]
default = nr...
The file ``refseqs.fasta`` contains some short sequences for use in the following examples.
.. doctest::
>>> from cogent import LoadSeqs, DNA
>>> seqs = LoadSeqs('data/refseqs.fasta',
... moltype=DNA, aligned=False)
>>> for seq in seqs.Seqs:
... print seq
...
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
TGCAGCTTGAGCCACAGGAGAGAGAGAGCTTC
TGCAGCTTGAGCCACAGGAGAGAGCCTTC
TGCAGCTTGAGCCACAGGAGAGAGAGAGCTTC
ACCGATGAGATATTAGCACAGGGGAATTAGAACCA
TGTCGAGAGTGAGATGAGATGAGAACA
ACGTATTTTAATTTGGCATGGT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
CCAGAGCGAGTGAGATAGACACCCAC
These sequences can be formatted as a database by running the ``formatdb`` program contained in the NCBI download (this assumes that ``formatdb`` is on your ``PATH``)
.. doctest::
>>> from cogent.app.formatdb import build_blast_db_from_fasta_path
>>> result = build_blast_db_from_fasta_path('data/refseqs.fasta')
>>> result[0]
'data/refseqs.fasta'...
The function ``build_blast_db_from_fasta_path()`` returns a tuple containing the path to the database, and a list of paths for all the new files written by ``formatdb``.
The final requirement is to know the path to the ``data`` directory that comes with your BLAST download. This directory contains substitution matrices and other files.
BLAST with text output
----------------------
In this example, we load a DNA sequence from a file in the data directory and BLAST against our formatted database. The parameters dictionary contains two flagged arguments: -p for the program to use, and -m for the type of output we want. '-m':'9' requests "tabular with comment lines." See ``blastall --help`` for more details.
Also, the application controller is set up to require a path to the data directory even though we don't use a substitution matrix for DNA. Here we can just pass any string.
.. doctest::
>>> from cogent import LoadSeqs, DNA
>>> from cogent.app.blast import blast_seqs, Blastall
>>> seqs = LoadSeqs('data/inseqs.fasta', moltype=DNA, aligned=False)
>>> seq = seqs.getSeq('s2_like_seq')
>>> seq
DnaSequence(TGCAGCT... 28)
>>> params={'-p':'blastn','-m':'9'}
>>> result = blast_seqs([seq],
... Blastall,
... blast_db = 'data/refseqs.fasta',
... blast_mat_root = 'xxx',
... params = params)
>>> data = result['StdOut'].read()
>>> print data.split('\n')[:1]
['# BLASTN 2.2...
We save the results for further processing
.. doctest::
>>> outfile = open('data/blast_test.txt','w')
>>> outfile.write(data)
>>> outfile.close()
The simplest way to organize the results is to use a parser. The BLAST parsers operate on a file object.
.. doctest::
>>> from cogent.parse.blast import MinimalBlastParser9
>>> blastfile = open('data/blast_test.txt', 'r')
>>> blast_results = MinimalBlastParser9(blastfile)
>>> type(blast_results)
<type 'generator'>
>>> for result in blast_results:
... print result
...
({'QUERY': '1', 'FIELDS': 'Query id...
The results include one item for each query sequence. Each result consists of a tuple whose first item is a dictionary of metadata. The second item is a list of hits. For example, you might do this
.. doctest::
>>> from cogent.parse.blast import MinimalBlastParser9
>>> blastfile = open('data/blast_test.txt', 'r')
>>> blast_results = MinimalBlastParser9(blastfile)
>>> for result in blast_results:
... meta_data, hit_list = result
... fields = meta_data['FIELDS'].split(',')
... for key, value in zip(fields, hit_list[0]):
... print key.strip().ljust(20), value
Query id 1
Subject id s2
% identity 89.66
alignment length 29
mismatches 2
gap openings 1
q. start 1
q. end 28
s. start 1
s. end 29
e-value 6e-05
bit score 26.3
BLAST with XML output
---------------------
In this example, we load a DNA sequence from a file in the data directory and BLAST against our formatted database as above.
NCBI recommends that you use XML as the output for BLAST. (They reserve the right to change the format for other output types). XML is the output when we pass '-m':'7'.
.. doctest::
>>> from cogent import LoadSeqs, DNA
>>> from cogent.app.blast import blast_seqs, Blastall
>>> seqs = LoadSeqs('data/inseqs.fasta', moltype=DNA, aligned=False)
>>> seq = seqs.getSeq('s2_like_seq')
>>> params={'-p':'blastn','-m':'7'}
>>> result = blast_seqs([seq],
... Blastall,
... blast_db = 'data/refseqs.fasta',
... blast_mat_root = 'xxx',
... params = params)
>>> data = result['StdOut'].read()
>>> outfile = open('data/blast_test.xml','w')
>>> outfile.write(data)
>>> outfile.close()
One nice thing about this format is that it includes the alignment. The organization of the results from this parser is slightly different. Each result is still a tuple, but the first item of the tuple is metadata about the BLAST settings (``meta_meta_data``). The keys for the fields in the output are contained as the first element in the list that is the second item in the result tuple.
.. doctest::
>>> from cogent.parse.blast_xml import MinimalBlastParser7
>>> blastfile = open('data/blast_test.xml', 'r')
>>> blast_results = MinimalBlastParser7(blastfile)
>>> for result in blast_results:
... meta_meta_data, hit_list = result
... key_list = hit_list[0]
... for key, value in zip(key_list, hit_list[1]):
... if 'ALIGN' in key:
... continue
... print key.ljust(20), value
QUERY ID 1
SUBJECT_ID lcl|s2
HIT_DEF No definition line found
HIT_ACCESSION s2
HIT_LENGTH 29
PERCENT_IDENTITY 26
MISMATCHES 0
GAP_OPENINGS 1
QUERY_START 1
QUERY_END 28
SUBJECT_START 1
SUBJECT_END 29
E_VALUE 6.00825e-05
BIT_SCORE 26.2635
SCORE 13
POSITIVE 26
>>> from cogent.parse.blast_xml import MinimalBlastParser7
>>> blastfile = open('data/blast_test.xml', 'r')
>>> blast_results = MinimalBlastParser7(blastfile)
>>> for result in blast_results:
... meta_meta_data, hit_list = result
... key_list = hit_list[0]
... for key in ('QUERY_ALIGN','MIDLINE_ALIGN','SUBJECT_ALIGN'):
... i = key_list.index(key)
... print hit_list[1][i][:40]
TGCAGCTTGAG-CACAGGTTAGAGCCTTC
||||||||||| |||||| |||||||||
TGCAGCTTGAGCCACAGGAGAGAGCCTTC
.. doctest::
:hide:
>>> from cogent.util.misc import remove_files
>>> remove_files(['data/blast_test.txt', 'data/blast_test.xml'],
... error_on_missing=False)
BLAST with protein sequences
----------------------------
In this example, we load a protein sequence from a file in the data directory and BLAST against a new protein database we will create. Since we want to BLAST protein sequences instead of DNA, we will have to construct a new BLAST database.
The file ``refseqs_protein.fasta`` contains some short sequences for use in the following examples.
.. doctest::
>>> from cogent.app.formatdb import build_blast_db_from_fasta_path
>>> result = build_blast_db_from_fasta_path('data/refseqs_protein.fasta', is_protein=True)
>>> result[0]
'data/refseqs_protein.fasta'...
Notice that we set the parameter ``is_protein`` to ``True`` since our database consists of protein sequences this time. This was not necessary in the previous example, because ``is_protein`` is set to ``False`` by default.
Now that we have built our protein BLAST database, we can load our sequence and BLAST against this database.
.. doctest::
>>> from cogent import LoadSeqs, PROTEIN
>>> from cogent.app.blast import blast_seqs, Blastall
>>> seqs = LoadSeqs('data/inseqs_protein.fasta', moltype=PROTEIN, aligned=False)
>>> seq = seqs.getSeq('1091044_fragment')
>>> seq
ProteinSequence(IPLDFDK... 26)
Notice we need to use '-p':'blastp' in the parameters dictionary, since ``blastp`` is used for protein.
.. doctest::
>>> params={'-p':'blastp','-m':'9'}
>>> result = blast_seqs([seq],
... Blastall,
... blast_db = 'data/refseqs_protein.fasta',
... params = params)
>>> data = result['StdOut'].read()
>>> print data.split('\n')[:1]
['# BLASTP 2.2...
We save the results for further processing
.. doctest::
>>> outfile = open('data/blast_protein_test.txt','w')
>>> outfile.write(data)
>>> outfile.close()
Now we will explore some of the convenience methods of the ``BlastResult`` object.
.. doctest::
>>> from cogent.parse.blast import BlastResult
>>> blast_results = BlastResult(open('data/blast_protein_test.txt','r'))
Suppose we want to filter our results based on various criteria. In many cases you may want to only keep the top '3' matches with the longest 'ALIGNMENT LENGTH' for the query sequence to the target.
.. doctest::
>>> best_hits = dict(blast_results.bestHitsByQuery(field='ALIGNMENT LENGTH', n=3))
>>> query_1_best_hits = best_hits['1']
>>> for hit in query_1_best_hits:
... for key, value in hit.items():
... print key.ljust(20), value
... print
...
MISMATCHES 0
ALIGNMENT LENGTH 26
Q. END 26
BIT SCORE 56.2
% IDENTITY 100.00
Q. START 1
S. START 30
S. END 55
GAP OPENINGS 0
QUERY ID 1
E-VALUE 5e-12
SUBJECT ID 1091044
<BLANKLINE>
MISMATCHES 10
ALIGNMENT LENGTH 27
Q. END 25
BIT SCORE 33.5
% IDENTITY 55.56
Q. START 1
S. START 32
S. END 58
GAP OPENINGS 1
QUERY ID 1
E-VALUE 3e-05
SUBJECT ID 5326864
<BLANKLINE>
MISMATCHES 16
ALIGNMENT LENGTH 24
Q. END 25
BIT SCORE 22.3
% IDENTITY 33.33
Q. START 2
S. START 19
S. END 42
GAP OPENINGS 0
QUERY ID 1
E-VALUE 0.077
SUBJECT ID 14286173
<BLANKLINE>
The fist of the top 3 hits for alignment length has 0 MISMATCHES and a % IDENTITY of 100.00. The next 2 hits have many MISMATCHES and a much lower % IDENTITY. Lets filter the results again, but by E-VALUE this time:
.. doctest::
>>> best_hits = dict(blast_results.bestHitsByQuery(field='E-VALUE', n=3))
>>> query_1_best_hits = best_hits['1']
>>> for hit in query_1_best_hits:
... for key, value in hit.items():
... print key.ljust(20), value
... print
...
MISMATCHES 0
ALIGNMENT LENGTH 26
Q. END 26
BIT SCORE 56.2
% IDENTITY 100.00
Q. START 1
S. START 30
S. END 55
GAP OPENINGS 0
QUERY ID 1
E-VALUE 5e-12
SUBJECT ID 1091044
<BLANKLINE>
MISMATCHES 10
ALIGNMENT LENGTH 27
Q. END 25
BIT SCORE 33.5
% IDENTITY 55.56
Q. START 1
S. START 32
S. END 58
GAP OPENINGS 1
QUERY ID 1
E-VALUE 3e-05
SUBJECT ID 5326864
<BLANKLINE>
MISMATCHES 6
ALIGNMENT LENGTH 18
Q. END 26
BIT SCORE 30.4
% IDENTITY 66.67
Q. START 9
S. START 31
S. END 48
GAP OPENINGS 0
QUERY ID 1
E-VALUE 3e-04
SUBJECT ID 15964668
<BLANKLINE>
You can filter the BLAST results by any of the fields you like. You can also use the ``BlastResult`` object to do a quick assessment of your BLAST results looking only at the fields you like:
.. doctest::
>>> fields = ['SUBJECT ID', 'BIT SCORE', 'E-VALUE']
>>> for query, results in blast_results.items():
... print ''.join([f.ljust(20) for f in fields])
... for result in results[-1]:
... print ''.join(map(str,[result[field].ljust(20) for field in fields]))
SUBJECT ID BIT SCORE E-VALUE
1091044 56.2 5e-12
5326864 33.5 3e-05
15964668 30.4 3e-04
17229033 29.6 5e-04
21112072 28.1 0.001
4704732 25.8 0.007
13541117 24.6 0.016
15826629 24.3 0.020
14286173 22.3 0.077
6323138 21.9 0.10
18313548 20.8 0.22
21674812 20.0 0.38
14600438 20.0 0.38
4996210 18.5 1.1
15605963 17.3 2.5
15615431 16.5 4.2
.. doctest::
:hide:
>>> from cogent.util.misc import remove_files
>>> remove_files(['data/blast_protein_test.txt'],
... error_on_missing=False)
|