| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 
 | """PHYLIP multiple sequence alignment format (:mod:`skbio.io.format.phylip`)
=========================================================================
.. currentmodule:: skbio.io.format.phylip
The PHYLIP file format stores a multiple sequence alignment. The format was
originally defined and used in Joe Felsenstein's PHYLIP package [1]_, and has
since been supported by several other bioinformatics tools (e.g., RAxML [2]_).
See [3]_ for the original format description, and [4]_ and [5]_ for additional
descriptions.
An example PHYLIP-formatted file taken from [3]_::
          5    42
    Turkey    AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
    Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT
    H. SapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA
    Chimp     AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT
    Gorilla   AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA
.. note:: Original copyright notice for the above PHYLIP file:
   *(c) Copyright 1986-2008 by The University of Washington. Written by Joseph
   Felsenstein. Permission is granted to copy this document provided that no
   fee is charged for it and that this copyright notice is not removed.*
Format Support
--------------
**Has Sniffer: Yes**
+------+------+---------------------------------------------------------------+
|Reader|Writer|                          Object Class                         |
+======+======+===============================================================+
|Yes   |Yes   |:mod:`skbio.alignment.TabularMSA`                              |
+------+------+---------------------------------------------------------------+
Format Specification
--------------------
PHYLIP format is a plain text format containing exactly two sections: a header
describing the dimensions of the alignment, followed by the multiple sequence
alignment itself.
The format described here is "strict" PHYLIP, as described in [4]_. Strict
PHYLIP requires that each sequence identifier is exactly 10 characters long
(padded with spaces as necessary). Other bioinformatics tools (e.g., RAxML) may
relax this rule to allow for longer sequence identifiers. See the
**Alignment Section** below for more details.
The format described here is "sequential" format. The original PHYLIP format
specification [3]_ describes both sequential and interleaved formats.
.. note:: scikit-bio currently supports reading and writing strict, sequential
   PHYLIP-formatted files. Relaxed and/or interleaved PHYLIP formats are not
   supported.
Header Section
^^^^^^^^^^^^^^
The header consists of a single line describing the dimensions of the
alignment. It **must** be the first line in the file. The header consists of
optional spaces, followed by two positive integers (``n`` and ``m``) separated
by one or more spaces. The first integer (``n``) specifies the number of
sequences (i.e., the number of rows) in the alignment. The second integer
(``m``) specifies the length of the sequences (i.e., the number of columns) in
the alignment. The smallest supported alignment dimensions are 1x1.
.. note:: scikit-bio will write the PHYLIP format header *without* preceding
   spaces, and with only a single space between ``n`` and ``m``.
   PHYLIP format *does not* support blank line(s) between the header and the
   alignment.
Alignment Section
^^^^^^^^^^^^^^^^^
The alignment section immediately follows the header. It consists of ``n``
lines (rows), one for each sequence in the alignment. Each row consists of a
sequence identifier (ID) and characters in the sequence, in fixed width format.
The sequence ID can be up to 10 characters long. IDs less than 10 characters
must have spaces appended to them to reach the 10 character fixed width. Within
an ID, all characters except newlines are supported, including spaces,
underscores, and numbers.
.. note:: When reading a PHYLIP-formatted file into an
   ``skbio.alignment.TabularMSA`` object, sequence identifiers/labels are
   stored as ``TabularMSA`` index labels (``index`` property).
   When writing an ``skbio.alignment.TabularMSA`` object as a PHYLIP-formatted
   file, ``TabularMSA`` index labels will be converted to strings and written
   as sequence identifiers/labels.
   scikit-bio supports the empty string (``''``) as a valid sequence ID. An
   empty ID will be padded with 10 spaces when writing.
Sequence characters immediately follow the sequence ID. They *must* start at
the 11th character in the line, as the first 10 characters are reserved for the
sequence ID. While PHYLIP format does not explicitly restrict the set of
supported characters that may be used to represent a sequence, the original
format description [3]_ specifies the IUPAC nucleic acid lexicon for DNA or RNA
sequences, and the IUPAC protein lexicon for protein sequences. The original
PHYLIP specification uses ``-`` as a gap character, though older versions also
supported ``.``. The sequence characters may contain optional spaces (e.g., to
improve readability), and both upper and lower case characters are supported.
.. note:: scikit-bio will read/write a PHYLIP-formatted file as long as the
   alignment's sequence characters are valid for the type of in-memory sequence
   object being read into or written from. This differs from the PHYLIP
   specification, which states that a PHYLIP-formatted file can only contain
   valid IUPAC characters. See the ``constructor`` format parameter below for
   details.
   Since scikit-bio supports both ``-`` and ``.`` as gap characters (e.g., in
   ``DNA``, ``RNA``, and ``Protein`` sequence objects), both are supported when
   reading/writing a PHYLIP-formatted file.
   When writing a PHYLIP-formatted file, scikit-bio will split up each sequence
   into chunks that are 10 characters long. Each chunk will be separated by a
   single space. The sequence will always appear on a single line (sequential
   format). It will *not* be wrapped across multiple lines. Sequences are
   chunked in this manner for improved readability, and because most example
   PHYLIP files are chunked in a similar way (e.g., see the example file
   above). Note that this chunking is not required when reading
   PHYLIP-formatted files, nor by the PHYLIP format specification itself.
Format Parameters
-----------------
The only supported format parameter is ``constructor``, which specifies the
type of in-memory sequence object to read each aligned sequence into. This must
be a subclass of ``GrammaredSequence`` (e.g., ``DNA``, ``RNA``, ``Protein``)
and is a required format parameter. For example, if you know that the PHYLIP
file you're reading contains DNA sequences, you would pass ``constructor=DNA``
to the reader call.
Examples
--------
Let's create a ``TabularMSA`` with three DNA sequences:
>>> from skbio import TabularMSA, DNA
>>> seqs = [DNA('ACCGTTGTA-GTAGCT', metadata={'id':'seq1'}),
...         DNA('A--GTCGAA-GTACCT', metadata={'id':'sequence-2'}),
...         DNA('AGAGTTGAAGGTATCT', metadata={'id':'3'})]
>>> msa = TabularMSA(seqs, minter='id')
>>> msa
TabularMSA[DNA]
----------------------
Stats:
    sequence count: 3
    position count: 16
----------------------
ACCGTTGTA-GTAGCT
A--GTCGAA-GTACCT
AGAGTTGAAGGTATCT
>>> msa.index
Index(['seq1', 'sequence-2', '3'], dtype='object')
Now let's write the ``TabularMSA`` to file in PHYLIP format and take a look at
the output:
>>> from io import StringIO
>>> fh = StringIO()
>>> print(msa.write(fh, format='phylip').getvalue())
3 16
seq1      ACCGTTGTA- GTAGCT
sequence-2A--GTCGAA- GTACCT
3         AGAGTTGAAG GTATCT
<BLANKLINE>
>>> fh.close()
Notice that the 16-character sequences were split into two chunks, and that
each sequence appears on a single line (sequential format). Also note that each
sequence ID is padded with spaces to 10 characters in order to produce a fixed
width column.
If the index labels in a ``TabularMSA`` surpass the 10-character limit, an
error will be raised when writing:
>>> msa.index = ['seq1', 'long-sequence-2', 'seq3']
>>> fh = StringIO()
>>> msa.write(fh, format='phylip')
Traceback (most recent call last):
    ...
skbio.io._exception.PhylipFormatError: ``TabularMSA`` can only be written in \
PHYLIP format if all sequence index labels have 10 or fewer characters. Found \
sequence with index label 'long-sequence-2' that exceeds this limit. Use \
``TabularMSA.reassign_index`` to assign shorter index labels.
>>> fh.close()
One way to work around this is to assign shorter index labels. The recommended
way to do this is via ``TabularMSA.reassign_index``. For example, to reassign
default integer index labels:
>>> msa.reassign_index()
>>> msa.index
RangeIndex(start=0, stop=3, step=1)
We can now write the ``TabularMSA`` in PHYLIP format:
>>> fh = StringIO()
>>> print(msa.write(fh, format='phylip').getvalue())
3 16
0         ACCGTTGTA- GTAGCT
1         A--GTCGAA- GTACCT
2         AGAGTTGAAG GTATCT
<BLANKLINE>
>>> fh.close()
References
----------
.. [1] http://evolution.genetics.washington.edu/phylip.html
.. [2] RAxML Version 8: A tool for Phylogenetic Analysis and
   Post-Analysis of Large Phylogenies". In Bioinformatics, 2014
.. [3] http://evolution.genetics.washington.edu/phylip/doc/sequence.html
.. [4] http://www.phylo.org/tools/obsolete/phylip.html
.. [5] http://www.bioperl.org/wiki/PHYLIP_multiple_alignment_format
"""  # noqa: D205, D415
# ----------------------------------------------------------------------------
# Copyright (c) 2013--, scikit-bio development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE.txt, distributed with this software.
# ----------------------------------------------------------------------------
from skbio.alignment import TabularMSA
from skbio.io import create_format, PhylipFormatError
from skbio.util._misc import chunk_str
phylip = create_format("phylip")
@phylip.sniffer()
def _phylip_sniffer(fh):
    # Strategy:
    #   Read the header and a single sequence; verify that the sequence length
    #   matches the header information.  Do not verify that the total number of
    #   lines matches the header information, since that would require reading
    #   the whole file.
    try:
        header = next(_line_generator(fh))
        _, seq_len = _validate_header(header)
        line = next(_line_generator(fh))
        _validate_line(line, seq_len)
    except (StopIteration, PhylipFormatError):
        return False, {}
    return True, {}
@phylip.reader(TabularMSA)
def _phylip_to_tabular_msa(fh, constructor=None):
    if constructor is None:
        raise ValueError("Must provide `constructor`.")
    seqs = []
    index = []
    for seq, ID in _parse_phylip_raw(fh):
        seqs.append(constructor(seq))
        index.append(ID)
    return TabularMSA(seqs, index=index)
@phylip.writer(TabularMSA)
def _tabular_msa_to_phylip(obj, fh):
    sequence_count = obj.shape.sequence
    if sequence_count < 1:
        raise PhylipFormatError(
            "TabularMSA can only be written in PHYLIP format if there is at "
            "least one sequence in the alignment."
        )
    sequence_length = obj.shape.position
    if sequence_length < 1:
        raise PhylipFormatError(
            "TabularMSA can only be written in PHYLIP format if there is at "
            "least one position in the alignment."
        )
    chunk_size = 10
    labels = [str(label) for label in obj.index]
    for label in labels:
        if len(label) > chunk_size:
            raise PhylipFormatError(
                "``TabularMSA`` can only be written in PHYLIP format if all "
                "sequence index labels have %d or fewer characters. Found "
                "sequence with index label '%s' that exceeds this limit. Use "
                "``TabularMSA.reassign_index`` to assign shorter index labels."
                % (chunk_size, label)
            )
    fh.write("{0:d} {1:d}\n".format(sequence_count, sequence_length))
    fmt = "{0:%d}{1}\n" % chunk_size
    for label, seq in zip(labels, obj):
        chunked_seq = chunk_str(str(seq), chunk_size, " ")
        fh.write(fmt.format(label, chunked_seq))
def _validate_header(header):
    header_vals = header.split()
    try:
        n_seqs, seq_len = [int(x) for x in header_vals]
        if n_seqs < 1 or seq_len < 1:
            raise PhylipFormatError(
                "The number of sequences and the length must be positive."
            )
    except ValueError:
        raise PhylipFormatError(
            "Found non-header line when attempting to read the 1st record "
            "(header line should have two space-separated integers): "
            '"%s"' % header
        )
    return n_seqs, seq_len
def _validate_line(line, seq_len):
    if not line:
        raise PhylipFormatError("Empty lines are not allowed.")
    ID = line[:10].strip()
    seq = line[10:].replace(" ", "")
    if len(seq) != seq_len:
        raise PhylipFormatError(
            "The length of sequence %s is not %s as specified in the header."
            % (ID, seq_len)
        )
    return (seq, ID)
def _parse_phylip_raw(fh):
    """Raw parser for PHYLIP files.
    Returns a list of raw (seq, id) values.  It is the responsibility of the
    caller to construct the correct in-memory object to hold the data.
    """
    # Note: this returns the full data instead of yielding each sequence,
    # because the header specifies the number of sequences, so the file cannot
    # be validated until it's read completely.
    # File should have a single header on the first line.
    try:
        header = next(_line_generator(fh))
    except StopIteration:
        raise PhylipFormatError("This file is empty.")
    n_seqs, seq_len = _validate_header(header)
    # All following lines should be ID+sequence. No blank lines are allowed.
    data = []
    for line in _line_generator(fh):
        data.append(_validate_line(line, seq_len))
    if len(data) != n_seqs:
        raise PhylipFormatError(
            "The number of sequences is not %s " % n_seqs
            + "as specified in the header."
        )
    return data
def _line_generator(fh):
    """Just remove linebreak characters and yield lines."""
    for line in fh:
        yield line.rstrip("\n")
 |