1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357
|
"""
PHYLIP multiple sequence alignment format (:mod:`skbio.io.format.phylip`)
=========================================================================
.. currentmodule:: skbio.io.format.phylip
The PHYLIP file format stores a multiple sequence alignment. The format was
originally defined and used in Joe Felsenstein's PHYLIP package [1]_, and has
since been supported by several other bioinformatics tools (e.g., RAxML [2]_).
See [3]_ for the original format description, and [4]_ and [5]_ for additional
descriptions.
An example PHYLIP-formatted file taken from [3]_::
5 42
Turkey AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA
Chimp AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT
Gorilla AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA
.. note:: Original copyright notice for the above PHYLIP file:
*(c) Copyright 1986-2008 by The University of Washington. Written by Joseph
Felsenstein. Permission is granted to copy this document provided that no
fee is charged for it and that this copyright notice is not removed.*
Format Support
--------------
**Has Sniffer: Yes**
+------+------+---------------------------------------------------------------+
|Reader|Writer| Object Class |
+======+======+===============================================================+
|Yes |Yes |:mod:`skbio.alignment.TabularMSA` |
+------+------+---------------------------------------------------------------+
Format Specification
--------------------
PHYLIP format is a plain text format containing exactly two sections: a header
describing the dimensions of the alignment, followed by the multiple sequence
alignment itself.
The format described here is "strict" PHYLIP, as described in [4]_. Strict
PHYLIP requires that each sequence identifier is exactly 10 characters long
(padded with spaces as necessary). Other bioinformatics tools (e.g., RAxML) may
relax this rule to allow for longer sequence identifiers. See the
**Alignment Section** below for more details.
The format described here is "sequential" format. The original PHYLIP format
specification [3]_ describes both sequential and interleaved formats.
.. note:: scikit-bio currently supports reading and writing strict, sequential
PHYLIP-formatted files. Relaxed and/or interleaved PHYLIP formats are not
supported.
Header Section
^^^^^^^^^^^^^^
The header consists of a single line describing the dimensions of the
alignment. It **must** be the first line in the file. The header consists of
optional spaces, followed by two positive integers (``n`` and ``m``) separated
by one or more spaces. The first integer (``n``) specifies the number of
sequences (i.e., the number of rows) in the alignment. The second integer
(``m``) specifies the length of the sequences (i.e., the number of columns) in
the alignment. The smallest supported alignment dimensions are 1x1.
.. note:: scikit-bio will write the PHYLIP format header *without* preceding
spaces, and with only a single space between ``n`` and ``m``.
PHYLIP format *does not* support blank line(s) between the header and the
alignment.
Alignment Section
^^^^^^^^^^^^^^^^^
The alignment section immediately follows the header. It consists of ``n``
lines (rows), one for each sequence in the alignment. Each row consists of a
sequence identifier (ID) and characters in the sequence, in fixed width format.
The sequence ID can be up to 10 characters long. IDs less than 10 characters
must have spaces appended to them to reach the 10 character fixed width. Within
an ID, all characters except newlines are supported, including spaces,
underscores, and numbers.
.. note:: When reading a PHYLIP-formatted file into an
``skbio.alignment.TabularMSA`` object, sequence identifiers/labels are
stored as ``TabularMSA`` index labels (``index`` property).
When writing an ``skbio.alignment.TabularMSA`` object as a PHYLIP-formatted
file, ``TabularMSA`` index labels will be converted to strings and written
as sequence identifiers/labels.
scikit-bio supports the empty string (``''``) as a valid sequence ID. An
empty ID will be padded with 10 spaces when writing.
Sequence characters immediately follow the sequence ID. They *must* start at
the 11th character in the line, as the first 10 characters are reserved for the
sequence ID. While PHYLIP format does not explicitly restrict the set of
supported characters that may be used to represent a sequence, the original
format description [3]_ specifies the IUPAC nucleic acid lexicon for DNA or RNA
sequences, and the IUPAC protein lexicon for protein sequences. The original
PHYLIP specification uses ``-`` as a gap character, though older versions also
supported ``.``. The sequence characters may contain optional spaces (e.g., to
improve readability), and both upper and lower case characters are supported.
.. note:: scikit-bio will read/write a PHYLIP-formatted file as long as the
alignment's sequence characters are valid for the type of in-memory sequence
object being read into or written from. This differs from the PHYLIP
specification, which states that a PHYLIP-formatted file can only contain
valid IUPAC characters. See the ``constructor`` format parameter below for
details.
Since scikit-bio supports both ``-`` and ``.`` as gap characters (e.g., in
``DNA``, ``RNA``, and ``Protein`` sequence objects), both are supported when
reading/writing a PHYLIP-formatted file.
When writing a PHYLIP-formatted file, scikit-bio will split up each sequence
into chunks that are 10 characters long. Each chunk will be separated by a
single space. The sequence will always appear on a single line (sequential
format). It will *not* be wrapped across multiple lines. Sequences are
chunked in this manner for improved readability, and because most example
PHYLIP files are chunked in a similar way (e.g., see the example file
above). Note that this chunking is not required when reading
PHYLIP-formatted files, nor by the PHYLIP format specification itself.
Format Parameters
-----------------
The only supported format parameter is ``constructor``, which specifies the
type of in-memory sequence object to read each aligned sequence into. This must
be a subclass of ``GrammaredSequence`` (e.g., ``DNA``, ``RNA``, ``Protein``)
and is a required format parameter. For example, if you know that the PHYLIP
file you're reading contains DNA sequences, you would pass ``constructor=DNA``
to the reader call.
Examples
--------
Let's create a ``TabularMSA`` with three DNA sequences:
>>> from skbio import TabularMSA, DNA
>>> seqs = [DNA('ACCGTTGTA-GTAGCT', metadata={'id':'seq1'}),
... DNA('A--GTCGAA-GTACCT', metadata={'id':'sequence-2'}),
... DNA('AGAGTTGAAGGTATCT', metadata={'id':'3'})]
>>> msa = TabularMSA(seqs, minter='id')
>>> msa
TabularMSA[DNA]
----------------------
Stats:
sequence count: 3
position count: 16
----------------------
ACCGTTGTA-GTAGCT
A--GTCGAA-GTACCT
AGAGTTGAAGGTATCT
>>> msa.index
Index(['seq1', 'sequence-2', '3'], dtype='object')
Now let's write the ``TabularMSA`` to file in PHYLIP format and take a look at
the output:
>>> from io import StringIO
>>> fh = StringIO()
>>> print(msa.write(fh, format='phylip').getvalue())
3 16
seq1 ACCGTTGTA- GTAGCT
sequence-2A--GTCGAA- GTACCT
3 AGAGTTGAAG GTATCT
<BLANKLINE>
>>> fh.close()
Notice that the 16-character sequences were split into two chunks, and that
each sequence appears on a single line (sequential format). Also note that each
sequence ID is padded with spaces to 10 characters in order to produce a fixed
width column.
If the index labels in a ``TabularMSA`` surpass the 10-character limit, an
error will be raised when writing:
>>> msa.index = ['seq1', 'long-sequence-2', 'seq3']
>>> fh = StringIO()
>>> msa.write(fh, format='phylip')
Traceback (most recent call last):
...
skbio.io._exception.PhylipFormatError: ``TabularMSA`` can only be written in \
PHYLIP format if all sequence index labels have 10 or fewer characters. Found \
sequence with index label 'long-sequence-2' that exceeds this limit. Use \
``TabularMSA.reassign_index`` to assign shorter index labels.
>>> fh.close()
One way to work around this is to assign shorter index labels. The recommended
way to do this is via ``TabularMSA.reassign_index``. For example, to reassign
default integer index labels:
>>> msa.reassign_index()
>>> msa.index
RangeIndex(start=0, stop=3, step=1)
We can now write the ``TabularMSA`` in PHYLIP format:
>>> fh = StringIO()
>>> print(msa.write(fh, format='phylip').getvalue())
3 16
0 ACCGTTGTA- GTAGCT
1 A--GTCGAA- GTACCT
2 AGAGTTGAAG GTATCT
<BLANKLINE>
>>> fh.close()
References
----------
.. [1] http://evolution.genetics.washington.edu/phylip.html
.. [2] RAxML Version 8: A tool for Phylogenetic Analysis and
Post-Analysis of Large Phylogenies". In Bioinformatics, 2014
.. [3] http://evolution.genetics.washington.edu/phylip/doc/sequence.html
.. [4] http://www.phylo.org/tools/obsolete/phylip.html
.. [5] http://www.bioperl.org/wiki/PHYLIP_multiple_alignment_format
"""
# ----------------------------------------------------------------------------
# Copyright (c) 2013--, scikit-bio development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------
from skbio.alignment import TabularMSA
from skbio.io import create_format, PhylipFormatError
from skbio.util._misc import chunk_str
phylip = create_format('phylip')
@phylip.sniffer()
def _phylip_sniffer(fh):
# Strategy:
# Read the header and a single sequence; verify that the sequence length
# matches the header information. Do not verify that the total number of
# lines matches the header information, since that would require reading
# the whole file.
try:
header = next(_line_generator(fh))
_, seq_len = _validate_header(header)
line = next(_line_generator(fh))
_validate_line(line, seq_len)
except (StopIteration, PhylipFormatError):
return False, {}
return True, {}
@phylip.reader(TabularMSA)
def _phylip_to_tabular_msa(fh, constructor=None):
if constructor is None:
raise ValueError("Must provide `constructor`.")
seqs = []
index = []
for seq, ID in _parse_phylip_raw(fh):
seqs.append(constructor(seq))
index.append(ID)
return TabularMSA(seqs, index=index)
@phylip.writer(TabularMSA)
def _tabular_msa_to_phylip(obj, fh):
sequence_count = obj.shape.sequence
if sequence_count < 1:
raise PhylipFormatError(
"TabularMSA can only be written in PHYLIP format if there is at "
"least one sequence in the alignment.")
sequence_length = obj.shape.position
if sequence_length < 1:
raise PhylipFormatError(
"TabularMSA can only be written in PHYLIP format if there is at "
"least one position in the alignment.")
chunk_size = 10
labels = [str(label) for label in obj.index]
for label in labels:
if len(label) > chunk_size:
raise PhylipFormatError(
"``TabularMSA`` can only be written in PHYLIP format if all "
"sequence index labels have %d or fewer characters. Found "
"sequence with index label '%s' that exceeds this limit. Use "
"``TabularMSA.reassign_index`` to assign shorter index labels."
% (chunk_size, label))
fh.write('{0:d} {1:d}\n'.format(sequence_count, sequence_length))
fmt = '{0:%d}{1}\n' % chunk_size
for label, seq in zip(labels, obj):
chunked_seq = chunk_str(str(seq), chunk_size, ' ')
fh.write(fmt.format(label, chunked_seq))
def _validate_header(header):
header_vals = header.split()
try:
n_seqs, seq_len = [int(x) for x in header_vals]
if n_seqs < 1 or seq_len < 1:
raise PhylipFormatError(
'The number of sequences and the length must be positive.')
except ValueError:
raise PhylipFormatError(
'Found non-header line when attempting to read the 1st record '
'(header line should have two space-separated integers): '
'"%s"' % header)
return n_seqs, seq_len
def _validate_line(line, seq_len):
if not line:
raise PhylipFormatError("Empty lines are not allowed.")
ID = line[:10].strip()
seq = line[10:].replace(' ', '')
if len(seq) != seq_len:
raise PhylipFormatError(
"The length of sequence %s is not %s as specified in the header."
% (ID, seq_len))
return (seq, ID)
def _parse_phylip_raw(fh):
"""Raw parser for PHYLIP files.
Returns a list of raw (seq, id) values. It is the responsibility of the
caller to construct the correct in-memory object to hold the data.
"""
# Note: this returns the full data instead of yielding each sequence,
# because the header specifies the number of sequences, so the file cannot
# be validated until it's read completely.
# File should have a single header on the first line.
try:
header = next(_line_generator(fh))
except StopIteration:
raise PhylipFormatError("This file is empty.")
n_seqs, seq_len = _validate_header(header)
# All following lines should be ID+sequence. No blank lines are allowed.
data = []
for line in _line_generator(fh):
data.append(_validate_line(line, seq_len))
if len(data) != n_seqs:
raise PhylipFormatError(
"The number of sequences is not %s " % n_seqs +
"as specified in the header.")
return data
def _line_generator(fh):
"""Just remove linebreak characters and yield lines.
"""
for line in fh:
yield line.rstrip('\n')
|