"""BLAST+7 format (:mod:`skbio.io.format.blast7`) ============================================== .. currentmodule:: skbio.io.format.blast7 The BLAST+7 format (``blast+7``) stores the results of a BLAST [1]_ database search. This format is produced by both BLAST+ output format 7 and legacy BLAST output format 9. The results are stored in a simple tabular format with headers. Values are separated by the tab character. An example BLAST+7-formatted file comparing two nucleotide sequences, taken from [2]_ (tab characters represented by ````): .. code-block:: none # BLASTN 2.2.18+ # Query: gi|1786181|gb|AE000111.1|AE000111 # Subject: ecoli # Fields: query acc., subject acc., evalue, q. start, q. end, s. st\ art, s. end # 5 hits found AE000111AE0001110.0110596110596 AE000111AE0001748e-305565567169286821 AE000111AE0003941e-2755875671135219 AE000111AE0004256e-265587567185528468 AE000111AE0001713e-245587567122142130 Format Support ============== **Has Sniffer: Yes** +------+------+---------------------------------------------------------------+ |Reader|Writer| Object Class | +======+======+===============================================================+ |Yes |No |:mod:`pandas.DataFrame` | +------+------+---------------------------------------------------------------+ Format Specification ==================== There are two BLAST+7 file formats supported by scikit-bio: BLAST+ output format 7 (``-outfmt 7``) and legacy BLAST output format 9 (``-m 9``). Both file formats are structurally similar, with minor differences. Example BLAST+ output format 7 file:: # BLASTP 2.2.31+ # Query: query1 # Subject: subject2 # Fields: q. start, q. end, s. start, s. end, identical, mismatches, sbjct\ frame, query acc.ver, subject acc.ver # 2 hits found 1 8 3 10 8 0 1 query1 subject2 2 5 2 15 8 0 2 query1 subject2 .. note:: Database searches without hits may occur in BLAST+ output format 7 files. scikit-bio ignores these "empty" records: .. code-block:: none # BLASTP 2.2.31+ # Query: query1 # Subject: subject1 # 0 hits found Example legacy BLAST output format 9 file: .. code-block:: none # BLASTN 2.2.3 [May-13-2002] # Database: other_vertebrate # Query: AF178033 # Fields: Query id,Subject id,% identity,alignment length,mismatches,gap openings,q.\ start,q. end,s. start,s. end,e-value,bit score AF178033 EMORG:AF178033 100.00 811 0 0 1 811 1 811 0.0 1566.6 AF178033 EMORG:AF031394 99.63 811 3 0 1 811 99 909 0.0 1542.8 .. note:: scikit-bio requires fields to be consistent within a file. BLAST Column Types ------------------ The following column types are output by BLAST and supported by scikit-bio. For more information on these column types, see :mod:`skbio.io.format.blast6`. +-------------------+----------------------+ |Field Name |DataFrame Column Name | +===================+======================+ |query id |qseqid | +-------------------+----------------------+ |query gi |qgi | +-------------------+----------------------+ |query acc. |qacc | +-------------------+----------------------+ |query acc.ver |qaccver | +-------------------+----------------------+ |query length |qlen | +-------------------+----------------------+ |subject id |sseqid | +-------------------+----------------------+ |subject ids |sallseqid | +-------------------+----------------------+ |subject gi |sgi | +-------------------+----------------------+ |subject gis |sallgi | +-------------------+----------------------+ |subject acc. |sacc | +-------------------+----------------------+ |subject acc.ver |saccver | +-------------------+----------------------+ |subject accs |sallacc | +-------------------+----------------------+ |subject length |slen | +-------------------+----------------------+ |q\\. start |qstart | +-------------------+----------------------+ |q\\. end |qend | +-------------------+----------------------+ |s\\. start |sstart | +-------------------+----------------------+ |s\\. end |send | +-------------------+----------------------+ |query seq |qseq | +-------------------+----------------------+ |subject seq |sseq | +-------------------+----------------------+ |evalue |evalue | +-------------------+----------------------+ |bit score |bitscore | +-------------------+----------------------+ |score |score | +-------------------+----------------------+ |alignment length |length | +-------------------+----------------------+ |% identity |pident | +-------------------+----------------------+ |identical |nident | +-------------------+----------------------+ |mismatches |mismatch | +-------------------+----------------------+ |positives |positive | +-------------------+----------------------+ |gap opens |gapopen | +-------------------+----------------------+ |gaps |gaps | +-------------------+----------------------+ |% positives |ppos | +-------------------+----------------------+ |query/sbjct frames |frames | +-------------------+----------------------+ |query frame |qframe | +-------------------+----------------------+ |sbjct frame |sframe | +-------------------+----------------------+ |BTOP |btop | +-------------------+----------------------+ |subject tax ids |staxids | +-------------------+----------------------+ |subject sci names |sscinames | +-------------------+----------------------+ |subject com names |scomnames | +-------------------+----------------------+ |subject blast names|sblastnames | +-------------------+----------------------+ |subject super |sskingdoms | |kingdoms | | +-------------------+----------------------+ |subject title |stitle | +-------------------+----------------------+ |subject strand |sstrand | +-------------------+----------------------+ |subject titles |salltitles | +-------------------+----------------------+ |% query coverage |qcovs | |per subject | | +-------------------+----------------------+ |% query coverage |qcovhsp | |per hsp | | +-------------------+----------------------+ Examples -------- Suppose we have a BLAST+7 file: >>> from io import StringIO >>> import skbio.io >>> import pandas as pd >>> fs = '\\n'.join([ ... '# BLASTN 2.2.18+', ... '# Query: gi|1786181|gb|AE000111.1|AE000111', ... '# Database: ecoli', ... '# Fields: query acc., subject acc., evalue, q. start, q. end, s. st\ art, s. end', ... '# 5 hits found', ... 'AE000111\\tAE000111\\t0.0\\t1\\t10596\\t1\\t10596', ... 'AE000111\\tAE000174\\t8e-30\\t5565\\t5671\\t6928\\t6821', ... 'AE000111\\tAE000171\\t3e-24\\t5587\\t5671\\t2214\\t2130', ... 'AE000111\\tAE000425\\t6e-26\\t5587\\t5671\\t8552\\t8468' ... ]) >>> fh = StringIO(fs) Read the file into a ``pd.DataFrame``: >>> df = skbio.io.read(fh, into=pd.DataFrame) >>> df # doctest: +NORMALIZE_WHITESPACE qacc sacc evalue qstart qend sstart send 0 AE000111 AE000111 0.000000e+00 1.0 10596.0 1.0 10596.0 1 AE000111 AE000174 8.000000e-30 5565.0 5671.0 6928.0 6821.0 2 AE000111 AE000171 3.000000e-24 5587.0 5671.0 2214.0 2130.0 3 AE000111 AE000425 6.000000e-26 5587.0 5671.0 8552.0 8468.0 Suppose we have a legacy BLAST 9 file: >>> from io import StringIO >>> import skbio.io >>> import pandas as pd >>> fs = '\\n'.join([ ... '# BLASTN 2.2.3 [May-13-2002]', ... '# Database: other_vertebrate', ... '# Query: AF178033', ... '# Fields: ', ... 'Query id,Subject id,% identity,alignment length,mismatches,gap openin\ gs,q. start,q. end,s. start,s. end,e-value,bit score', ... 'AF178033\\tEMORG:AF178033\\t100.00\\t811\\t0\\t0\\t1\\t811\\t1\\t81\ 1\\t0.0\\t1566.6', ... 'AF178033\\tEMORG:AF178032\\t94.57\\t811\\t44\\t0\\t1\\t811\\t1\\t81\ 1\\t0.0\\t1217.7', ... 'AF178033\\tEMORG:AF178031\\t94.82\\t811\\t42\\t0\\t1\\t811\\t1\\t81\ 1\\t0.0\\t1233.5' ... ]) >>> fh = StringIO(fs) Read the file into a ``pd.DataFrame``: >>> df = skbio.io.read(fh, into=pd.DataFrame) >>> df[['qseqid', 'sseqid', 'pident']] # doctest: +NORMALIZE_WHITESPACE qseqid sseqid pident 0 AF178033 EMORG:AF178033 100.00 1 AF178033 EMORG:AF178032 94.57 2 AF178033 EMORG:AF178031 94.82 References ---------- .. [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. .. [2] http://www.ncbi.nlm.nih.gov/books/NBK279682/ """ # noqa: D205, D415 # ---------------------------------------------------------------------------- # Copyright (c) 2013--, scikit-bio development team. # # Distributed under the terms of the Modified BSD License. # # The full license is in the file LICENSE.txt, distributed with this software. # ---------------------------------------------------------------------------- import pandas as pd from skbio.io import create_format, BLAST7FormatError from skbio.io.format._blast import _parse_blast_data blast7 = create_format("blast+7") column_converter = { "query id": "qseqid", "query gi": "qgi", "query acc.": "qacc", "query acc.ver": "qaccver", "query length": "qlen", "subject id": "sseqid", "subject ids": "sallseqid", "subject gi": "sgi", "subject gis": "sallgi", "subject acc.": "sacc", "subject acc.ver": "saccver", "subject accs.": "sallacc", "subject length": "slen", "q. start": "qstart", "q. end": "qend", "s. start": "sstart", "s. end": "send", "query seq": "qseq", "subject seq": "sseq", "evalue": "evalue", "bit score": "bitscore", "score": "score", "alignment length": "length", "% identity": "pident", "identical": "nident", "mismatches": "mismatch", "positives": "positive", "gap opens": "gapopen", "gaps": "gaps", "% positives": "ppos", "query/sbjct frames": "frames", "query frame": "qframe", "sbjct frame": "sframe", "BTOP": "btop", "subject tax ids": "staxids", "subject sci names": "sscinames", "subject com names": "scomnames", "subject blast names": "sblastnames", "subject super kingdoms": "sskingdoms", "subject title": "stitle", "subject titles": "salltitles", "subject strand": "sstrand", "% query coverage per subject": "qcovs", "% query coverage per hsp": "qcovhsp", "Query id": "qseqid", "Subject id": "sseqid", "gap openings": "gapopen", "e-value": "evalue", } @blast7.sniffer() def _blast7_sniffer(fh): # Smells a BLAST+7 file if the following conditions are present # -First line contains "BLAST" # -Second line contains "Query" or "Database" # -Third line starts with "Subject" or "Query" or "Database" lines = [line for _, line in zip(range(3), fh)] if len(lines) < 3: return False, {} if not lines[0].startswith("# BLAST"): return False, {} if not (lines[1].startswith("# Query:") or lines[1].startswith("# Database:")): return False, {} if not ( lines[2].startswith("# Subject:") or lines[2].startswith("# Query:") or lines[2].startswith("# Database:") ): return False, {} return True, {} @blast7.reader(pd.DataFrame, monkey_patch=False) def _blast7_to_data_frame(fh): line_num = 0 columns = None skiprows = [] for line in fh: if line == "# Fields: \n": # Identifies Legacy BLAST 9 data line = next(fh) line_num += 1 if columns is None: columns = _parse_fields(line, legacy=True) skiprows.append(line_num) else: next_columns = _parse_fields(line, legacy=True) if columns != next_columns: raise BLAST7FormatError( "Fields %r do not equal fields %r" % (columns, next_columns) ) skiprows.append(line_num) elif line.startswith("# Fields: "): # Identifies BLAST+7 data if columns is None: columns = _parse_fields(line) else: # Affirms data types do not differ throught file next_columns = _parse_fields(line) if columns != next_columns: raise BLAST7FormatError( "Fields %r do not equal fields %r" % (columns, next_columns) ) line_num += 1 if columns is None: # Affirms file contains BLAST data raise BLAST7FormatError("File contains no BLAST data.") fh.seek(0) return _parse_blast_data( fh, columns, BLAST7FormatError, "Number of fields (%r) does not equal number" " of data columns (%r).", comment="#", skiprows=skiprows, ) def _parse_fields(line, legacy=False): r"""Remove '\n' from fields line and returns fields as a list (columns).""" line = line.rstrip("\n") if legacy: fields = line.split(",") else: line = line.split("# Fields: ")[1] fields = line.split(", ") columns = [] for field in fields: if field not in column_converter: raise BLAST7FormatError( "Unrecognized field (%r)." " Supported fields: %r" % (field, set(column_converter.keys())) ) columns.append(column_converter[field]) return columns