| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 
 | """
Stockholm format (:mod:`skbio.io.format.stockholm`)
===================================================
.. currentmodule:: skbio.io.format.stockholm
The Stockholm format is a multiple sequence alignment format (MSA) that
optionally supports storing arbitrary alignment features (metadata). Features
can be placed into four different categories: GF, GS, GR, and GC (described in
more detail below).
An example Stockholm file, taken from [1]_:
.. code-block:: none
    # STOCKHOLM 1.0
    #=GF ID    UPSK
    #=GF SE    Predicted; Infernal
    #=GF SS    Published; PMID 9223489
    #=GF RN    [1]
    #=GF RM    9223489
    #=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic
    #=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent \
RNA
    #=GF RT    polymerase.
    #=GF RA    Deiman BA, Kortlever RM, Pleij CW;
    #=GF RL    J Virol 1997;71:5990-5996.
    AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG
    M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG
    J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG
    M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG
    #=GC SS_cons                   .AAA....<<<<aaa....>>>>
    //
Format Support
--------------
**Has Sniffer: Yes**
**State: Experimental as of 0.4.2.**
+------+------+---------------------------------------------------------------+
|Reader|Writer|                          Object Class                         |
+======+======+===============================================================+
|Yes   |Yes   |:mod:`skbio.alignment.TabularMSA`                              |
+------+------+---------------------------------------------------------------+
Format Specification
--------------------
The Stockholm format consists of a header, a multiple sequence alignment,
associated metadata (features), and a footer.
Header
^^^^^^
The first line of a Stockholm file must be the following header:
.. code-block:: none
   # STOCKHOLM 1.0
Multiple Sequence Alignment
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sequence lines consist of a sequence name, followed by whitespace, followed by
the aligned sequence. For example::
    seq1 ACG-T-GGT
    seq2 ACCGTTCG-
Sequence names (``seq1``, ``seq2``) are stored in the ``TabularMSA``
``index``.
.. note:: scikit-bio currently supports reading Stockholm files where each
   sequence is contained on a single line. Interleaved/wrap-around Stockholm
   files are not supported. When writing, each sequence will be placed on its
   own line.
.. warning:: Sequence names must be unique in the Stockholm file. Likewise,
   when writing from a ``TabularMSA``, ``index`` must be unique.
Metadata
^^^^^^^^
Stockholm files support storing arbitrary metadata (features) about the MSA.
All metadata described in the following sections are optional and may appear in
any order. Metadata "mark-up" lines begin with either ``#=GF``, ``#=GS``,
``#=GR``, or ``#=GC``, and each line describes a single feature of the
alignment.
.. note:: Stockholm format supports generic features. [1]_ and [2]_ provide a
   list of common features output by Pfam/Rfam. scikit-bio does not
   require that these features are present. These features are processed in the
   same way as any arbitrary feature would be, as a simple key-value pair of
   strings. When writing, feature names, feature data, and sequence names are
   converted to type ``str``.
.. note:: When writing a Stockholm file, scikit-bio will place the metadata in
   the format's recommended order:
   - GF: Above the alignment
   - GS: Above the alignment (after GF)
   - GR: Below corresponding sequence
   - GC: Below the alignment
GF metadata
+++++++++++
Data relating to the multiple sequence alignment as a whole, such as authors or
number of sequences in the alignment. Starts with ``#=GF`` followed by a
feature name and data relating to the feature. Typically comes first in a
Stockholm file.
For example (taken from [2]_):
.. code-block:: none
    #=GF DE CBS domain
Where ``DE`` is the feature name and ``CBS Domain`` is the feature data.
GF metadata is stored in the ``TabularMSA`` ``metadata`` dictionary.
.. note:: When reading, duplicate GF feature names will have their values
   concatenated in the order they appear in the file. Concatenation will
   also add a space between lines if one isn't already there in order to avoid
   joining words together. When writing, each GF feature will be placed on its
   own line, regardless of length.
.. note:: Trees labelled with ``NH``/``TN`` are handled differently than other
   GF features. When reading a Stockholm file with these features, the reader
   follows the rules described in [2]_. Trees split over multiple lines will
   have their values concatenated. Unlike other GF features, trees will never
   have a space added when they are concatenated.
   A single tree without an identifier will be stored as::
       metadata = {
           'NH': 'tree in NHX format'
       }
   A single tree with an identifier will be stored as::
       metadata = {
           'NH': {
               'tree-id': 'tree in NHX format'
           }
       }
   Multiple trees (which must have identifiers) will be stored as::
       metadata = {
           'NH': {
               'tree-id-1': 'tree in NHX format',
               'tree-id-2': 'tree in NHX format'
           }
       }
.. note:: References labelled with ``RN``/``RM``/``RT``/``RA``/``RL``/``RC``
   are handled differently than other GF features. When reading a Stockholm
   file with these features, the reader populates a list of dictionaries,
   where each dictionary represents a single reference. The list contains
   references in the order they appear in the file, regardless of the value
   provided for ``RN``. If a reference does not include all possible reference
   tags (e.g. ``RC`` is missing), the dictionary will only contain the
   reference tags present for that reference. When writing, the writer adds a
   reference number (``RN``) line before writing each reference, for example:
   .. code-block:: none
      #=GF RN [1]
      #=GF RA Kestrel Gorlick
      ...
      #=GF RN [2]
      ...
   References will be stored as::
       metadata = {
           'RN': [{
               'RM': 'reference medline',
               'RT': 'reference title',
               'RA': 'reference author',
               'RL': 'reference location',
               'RC': 'reference comment'
           }, {
               'RM': 'reference medline',
               ...
           }]
       }
GS metadata
+++++++++++
Data relating to a specific sequence in the multiple sequence alignment.
Starts with ``#=GS`` followed by the sequence name followed by a feature name
and data relating to the feature. Typically comes after GF metadata in a
Stockholm file.
For example (taken from [2]_):
.. code-block:: none
    #=GS O83071/259-312 AC O83071
Where ``O83071/259-312`` is the sequence name, ``AC`` is the feature name, and
``083071`` is the feature data.
GS metadata is stored in the sequence-specific ``metadata`` dictionary.
.. note:: When reading, duplicate GS feature names will have their values
   concatenated in the order they appear in the file. Concatenation will
   also add a space between lines if one isn't already there in order to avoid
   joining words together. When writing, each GS feature will be placed on its
   own line, regardless of length.
GR metadata
+++++++++++
Data relating to the columns of a specific sequence in a multiple sequence
alignment. Starts with ``#=GR`` followed by the sequence name followed by a
feature name and data relating to the feature, one character per column.
Typically comes after the sequence line it relates to.
For example (taken from [2]_):
.. code-block:: none
    #=GR O31698/18-71 SS    CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
Where ``O31698/18-71`` is the sequence name, ``SS`` is the feature name, and
``CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH`` is the feature data.
GR metadata is stored in sequence-specific ``positional_metadata``.
.. note:: Duplicate GR feature names attributed to a single sequence are
   disallowed.
GC metadata
+++++++++++
Data relating to the columns of the multiple sequence alignment as a whole.
Starts with ``#=GC`` followed by a feature name and data relating to the
feature, one character per column. Typically comes at the end of the multiple
sequence alignment.
For example (taken from [2]_):
.. code-block:: none
    #=GC SS_cons            CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
Where ``SS_cons`` is the feature name and
``CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH`` is the feature data.
GC metadata is stored in ``TabularMSA`` ``positional_metadata``.
.. note:: Duplicate GC feature names are disallowed.
Footer
^^^^^^
The final line of a Stockholm file must be the following footer::
    //
.. note:: scikit-bio currently supports reading a Stockholm file containing a
   single MSA. If the file contains more than one MSA, only the first MSA will
   be read into a ``TabularMSA``.
Format Parameters
-----------------
The only supported format parameter is ``constructor``, which specifies the
type of in-memory sequence object to read each aligned sequence into. This must
be a subclass of ``GrammaredSequence`` (e.g., ``DNA``, ``RNA``, ``Protein``)
and is a required format parameter. For example, if you know that the Stockholm
file you're reading contains DNA sequences, you would pass ``constructor=DNA``
to the reader call.
Examples
--------
Suppose we have a Stockholm file containing an MSA of protein sequences
(modified from [2]_):
>>> import skbio.io
>>> from io import StringIO
>>> from skbio import Protein, TabularMSA
>>> fs = '\\n'.join([
...         '# STOCKHOLM 1.0',
...         '#=GF CC CBS domains are small intracellular modules mostly'
...         ' found',
...         '#=GF CC in 2 or four copies within a protein.',
...         '#=GS O83071/192-246 AC O83071',
...         '#=GS O31698/88-139 OS Bacillus subtilis',
...         'O83071/192-246          MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRV',
...         '#=GR O83071/192-246 SA  999887756453524252..55152525....36463',
...         'O83071/259-312          MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVA',
...         'O31698/18-71            MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAI',
...         'O31698/88-139           EVMLTDIPRLHINDPIMK..GFGMVINN......GFV',
...         'O31699/88-139           EVMLTDIPRLHINDPIMK..GFGMVINN......GFV',
...         '#=GR O31699/88-139 AS   ________________*____________________',
...         '#=GR O31699/88-139 IN   ____________1______________2_________',
...         '#=GC SS_cons            CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEE',
...         '//'
...     ])
>>> fh = StringIO(fs)
>>> msa = TabularMSA.read(fh, constructor=Protein)
>>> msa # doctest: +NORMALIZE_WHITESPACE
TabularMSA[Protein]
----------------------------------------------------------------------
Metadata:
    'CC': 'CBS domains are small intracellular modules mostly found in
           2 or four copies within a protein.'
Positional metadata:
    'SS_cons': <dtype: object>
Stats:
    sequence count: 5
    position count: 37
----------------------------------------------------------------------
MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRV
MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVA
MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAI
EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
The sequence names are stored in the ``index``:
>>> msa.index
Index(['O83071/192-246', 'O83071/259-312', 'O31698/18-71', 'O31698/88-139',
       'O31699/88-139'],
      dtype='object')
The ``TabularMSA`` has GF metadata stored in its ``metadata`` dictionary:
>>> msa.metadata
OrderedDict([('CC', 'CBS domains are small intracellular modules mostly found \
in 2 or four copies within a protein.')])
GC metadata is stored in the ``TabularMSA`` ``positional_metadata``:
>>> msa.positional_metadata  # doctest: +ELLIPSIS
   SS_cons
0        C
1        C
2        C
3        C
4        C
5        H
6        H
7        H
8        H
9        H
...
GS metadata is stored in the sequence-specific ``metadata`` dictionary:
>>> msa[0].metadata
OrderedDict([('AC', 'O83071')])
GR metadata is stored in sequence-specific ``positional_metadata``:
>>> msa[4].positional_metadata  # doctest: +ELLIPSIS
   AS IN
0   _  _
1   _  _
2   _  _
3   _  _
4   _  _
5   _  _
6   _  _
7   _  _
8   _  _
9   _  _
...
Let's write this ``TabularMSA`` in Stockholm format:
>>> fh = StringIO()
>>> _ = msa.write(fh, format='stockholm')
>>> print(fh.getvalue())
# STOCKHOLM 1.0
#=GF CC CBS domains are small intracellular modules mostly found in 2 or four \
copies within a protein.
#=GS O83071/192-246 AC O83071
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246         MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRV
#=GR O83071/192-246 SA 999887756453524252..55152525....36463
O83071/259-312         MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVA
O31698/18-71           MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAI
O31698/88-139          EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
O31699/88-139          EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
#=GR O31699/88-139 AS  ________________*____________________
#=GR O31699/88-139 IN  ____________1______________2_________
#=GC SS_cons           CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEE
//
<BLANKLINE>
>>> fh.close()
References
==========
.. [1] https://en.wikipedia.org/wiki/Stockholm_format
.. [2] http://sonnhammer.sbc.su.se/Stockholm.html
"""
# ----------------------------------------------------------------------------
# Copyright (c) 2013--, scikit-bio development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE.txt, distributed with this software.
# ----------------------------------------------------------------------------
from collections import OrderedDict
from skbio.alignment import TabularMSA
from skbio.sequence._grammared_sequence import GrammaredSequence
from skbio.io import create_format, StockholmFormatError
stockholm = create_format('stockholm')
_REFERENCE_TAGS = frozenset({'RM', 'RT', 'RA', 'RL', 'RC'})
@stockholm.sniffer()
def _stockholm_sniffer(fh):
    # Smells a Stockholm file if the following conditions are met:
    # - File isn't empty
    # - File contains correct header
    try:
        line = next(fh)
    except StopIteration:
        return False, {}
    if _is_header(line):
        return True, {}
    return False, {}
@stockholm.reader(TabularMSA)
def _stockholm_to_tabular_msa(fh, constructor=None):
    # Checks that user has passed required constructor parameter
    if constructor is None:
        raise ValueError("Must provide `constructor` parameter indicating the "
                         "type of sequences in the alignment. `constructor` "
                         "must be a subclass of `GrammaredSequence` "
                         "(e.g., `DNA`, `RNA`, `Protein`).")
    # Checks that contructor parameter is supported
    elif not issubclass(constructor, GrammaredSequence):
        raise TypeError("`constructor` must be a subclass of "
                        "`GrammaredSequence`.")
    # Checks that the file isn't empty
    try:
        line = next(fh)
    except StopIteration:
        raise StockholmFormatError("File is empty.")
    # Checks that the file follows basic format (includes the required header)
    if not _is_header(line):
        raise StockholmFormatError("File missing required Stockholm header "
                                   "line.")
    msa_data = _MSAData()
    for line in fh:
        if line.isspace():
            continue
        line = line.rstrip('\n')
        if _is_sequence_line(line):
            seq_name, seq_data = _parse_sequence_line(line)
            msa_data.add_sequence(seq_name, seq_data)
        elif line.startswith("#=GF"):
            feature_name, feature_data = _parse_gf_line(line)
            msa_data.add_gf_metadata(feature_name, feature_data)
        elif line.startswith("#=GS"):
            seq_name, feature_name, feature_data = _parse_gs_line(line)
            msa_data.add_gs_metadata(seq_name, feature_name, feature_data)
        elif line.startswith("#=GR"):
            seq_name, feature_name, feature_data = _parse_gr_line(line)
            msa_data.add_gr_metadata(seq_name, feature_name, feature_data)
        elif line.startswith('#=GC'):
            feature_name, feature_data = _parse_gc_line(line)
            msa_data.add_gc_metadata(feature_name, feature_data)
        elif _is_footer(line):
            break
        else:
            raise StockholmFormatError("Unrecognized line: %r" % line)
    if not _is_footer(line):
        raise StockholmFormatError('Final line does not conform to Stockholm '
                                   'format. Must contain only "//".')
    return msa_data.build_tabular_msa(constructor)
# For storing intermediate data used to construct a Sequence object.
class _MSAData:
    def __init__(self):
        self._seqs = {}
        self._seq_order = []
        self._metadata = OrderedDict()
        self._positional_metadata = OrderedDict()
    def add_sequence(self, seq_name, seq_data):
        if seq_name not in self._seqs:
            self._seqs[seq_name] = _SeqData(seq_name)
        self._seqs[seq_name].seq = seq_data
        self._seq_order.append(seq_name)
    def add_gf_metadata(self, feature_name, feature_data):
        # Handles first instance of labelled tree
        if feature_name == 'TN' and 'NH' not in self._metadata:
            self._metadata['NH'] = OrderedDict()
            self._metadata['NH'][feature_data] = ''
        # Handles second instance of labelled tree
        elif feature_name == 'TN' and 'NH' in self._metadata:
            if feature_data in self._metadata['NH']:
                raise StockholmFormatError("Tree name %r used multiple times "
                                           "in file." % feature_data)
            self._metadata['NH'][feature_data] = ''
        # Handles extra line(s) of an already created tree
        elif feature_name == 'NH' and feature_name in self._metadata:
            trees = self._metadata[feature_name]
            if isinstance(trees, OrderedDict):
                tree_id = next(reversed(trees))
                self._metadata[feature_name][tree_id] = (trees[tree_id] +
                                                         feature_data)
            else:
                self._metadata[feature_name] = (self._metadata[feature_name] +
                                                feature_data)
        elif feature_name == 'RN':
            if feature_name not in self._metadata:
                self._metadata[feature_name] = [OrderedDict()]
            else:
                self._metadata[feature_name].append(OrderedDict())
        elif feature_name in _REFERENCE_TAGS:
            if 'RN' not in self._metadata:
                raise StockholmFormatError("Expected 'RN' tag to precede "
                                           "'%s' tag." % feature_name)
            reference_dict = self._metadata['RN'][-1]
            if feature_name not in reference_dict:
                reference_dict[feature_name] = feature_data
            else:
                padding = _get_padding(reference_dict[feature_name])
                reference_dict[feature_name] += padding + feature_data
        elif feature_name in self._metadata:
            padding = _get_padding(self._metadata[feature_name][-1])
            self._metadata[feature_name] = (self._metadata[feature_name] +
                                            padding + feature_data)
        else:
            self._metadata[feature_name] = feature_data
    def add_gc_metadata(self, feature_name, feature_data):
        if feature_name in self._positional_metadata:
            _raise_duplicate_error("Found duplicate GC label %r."
                                   % feature_name)
        self._positional_metadata[feature_name] = feature_data
    def add_gs_metadata(self, seq_name, feature_name, feature_data):
        if seq_name not in self._seqs:
            self._seqs[seq_name] = _SeqData(seq_name)
        self._seqs[seq_name].add_metadata_feature(feature_name, feature_data)
    def add_gr_metadata(self, seq_name, feature_name, feature_data):
        if seq_name not in self._seqs:
            self._seqs[seq_name] = _SeqData(seq_name)
        self._seqs[seq_name].add_positional_metadata_feature(feature_name,
                                                             feature_data)
    def build_tabular_msa(self, constructor):
        if len(self._seqs) != len(self._seq_order):
            invalid_seq_names = set(self._seqs) - set(self._seq_order)
            raise StockholmFormatError('Found GS or GR metadata for '
                                       'nonexistent sequence(s): %r'
                                       % invalid_seq_names)
        seqs = []
        for seq_name in self._seq_order:
            seqs.append(self._seqs[seq_name].build_sequence(constructor))
        positional_metadata = self._positional_metadata
        if not positional_metadata:
            positional_metadata = None
        metadata = self._metadata
        if not metadata:
            metadata = None
        # Constructs TabularMSA
        return TabularMSA(seqs, metadata=metadata,
                          positional_metadata=positional_metadata,
                          index=self._seq_order)
class _SeqData:
    def __init__(self, name):
        self.name = name
        self._seq = None
        self.metadata = None
        self.positional_metadata = None
    @property
    def seq(self):
        return self._seq
    @seq.setter
    def seq(self, seq):
        if self._seq is None:
            self._seq = seq
        else:
            _raise_duplicate_error("Found duplicate sequence name: %r"
                                   % self.name)
    def add_metadata_feature(self, feature_name, feature_data):
        if self.metadata is None:
            self.metadata = OrderedDict()
        if feature_name in self.metadata:
            padding = _get_padding(self.metadata[feature_name][-1])
            self.metadata[feature_name] += padding + feature_data
        else:
            self.metadata[feature_name] = feature_data
    def add_positional_metadata_feature(self, feature_name, feature_data):
        if self.positional_metadata is None:
            self.positional_metadata = OrderedDict()
        if feature_name in self.positional_metadata:
            _raise_duplicate_error("Found duplicate GR label %r associated "
                                   "with sequence name %r"
                                   % (feature_name, self.name))
        else:
            self.positional_metadata[feature_name] = feature_data
    def build_sequence(self, constructor):
        return constructor(self.seq, metadata=self.metadata,
                           positional_metadata=(self.positional_metadata))
def _parse_gf_line(line):
    line = line.split(None, 2)
    _check_for_malformed_line(line, 3)
    return line[1:]
def _parse_gs_line(line):
    line = line.split(None, 3)
    _check_for_malformed_line(line, 4)
    return line[1:]
def _parse_gr_line(line):
    line = line.split(None, 3)
    _check_for_malformed_line(line, 4)
    seq_name = line[1]
    feature_name = line[2]
    feature_data = list(line[3])
    return seq_name, feature_name, feature_data
def _parse_gc_line(line):
    line = line.split(None, 2)
    _check_for_malformed_line(line, 3)
    feature_name = line[1]
    feature_data = list(line[2])
    return feature_name, feature_data
def _parse_sequence_line(line):
    line = line.split(None, 1)
    _check_for_malformed_line(line, 2)
    return line
def _is_header(line):
    return line == '# STOCKHOLM 1.0\n'
def _is_footer(line):
    return line.rstrip() == '//'
def _is_sequence_line(line):
    return not (line.startswith("#") or _is_footer(line))
def _raise_duplicate_error(message):
    raise StockholmFormatError(message+' Note: If the file being used is in '
                                       'Stockholm interleaved format, this '
                                       'is not supported by the reader.')
def _check_for_malformed_line(line, expected_len):
    if len(line) != expected_len:
        raise StockholmFormatError('Line contains %d item(s). It must '
                                   'contain exactly %d item(s).'
                                   % (len(line), expected_len))
@stockholm.writer(TabularMSA)
def _tabular_msa_to_stockholm(obj, fh):
    if not obj.index.is_unique:
        raise StockholmFormatError("The TabularMSA's index labels must be"
                                   " unique.")
    # Writes header
    fh.write("# STOCKHOLM 1.0\n")
    # Writes GF data to file
    if obj.has_metadata():
        for gf_feature, gf_feature_data in obj.metadata.items():
            if gf_feature == 'NH' and isinstance(gf_feature_data, dict):
                for tree_id, tree in gf_feature_data.items():
                    fh.write("#=GF TN %s\n" % tree_id)
                    fh.write("#=GF NH %s\n" % tree)
            elif gf_feature == 'RN':
                if not isinstance(gf_feature_data, list):
                    raise StockholmFormatError(
                        "Expected 'RN' to contain a list of reference "
                        "dictionaries, got %r." % gf_feature_data)
                for ref_num, dictionary in enumerate(gf_feature_data, start=1):
                    if not isinstance(dictionary, dict):
                        raise StockholmFormatError(
                            "Expected reference information to be stored as a "
                            "dictionary, found reference %d stored as %r." %
                            (ref_num, type(dictionary).__name__))
                    fh.write("#=GF RN [%d]\n" % ref_num)
                    for feature in dictionary:
                        if feature not in _REFERENCE_TAGS:
                            formatted_reference_tags = ', '.join(
                                [tag for tag in _REFERENCE_TAGS])
                            raise StockholmFormatError(
                                "Invalid reference tag %r found in reference "
                                "dictionary %d. Valid reference tags are: %s."
                                % (feature, ref_num, formatted_reference_tags))
                        fh.write("#=GF %s %s\n" % (feature,
                                                   dictionary[feature]))
            else:
                fh.write("#=GF %s %s\n" % (gf_feature, gf_feature_data))
    unpadded_data = []
    # Writes GS data to file, retrieves GR data, and retrieves sequence data
    for seq, seq_name in zip(obj, obj.index):
        seq_name = str(seq_name)
        if seq.has_metadata():
            for gs_feature, gs_feature_data in seq.metadata.items():
                fh.write("#=GS %s %s %s\n" % (seq_name, gs_feature,
                                              gs_feature_data))
        unpadded_data.append((seq_name, str(seq)))
        if seq.has_positional_metadata():
            df = _format_positional_metadata(seq.positional_metadata,
                                             'Sequence-specific positional '
                                             'metadata (GR)')
            for gr_feature in df.columns:
                gr_feature_data = ''.join(df[gr_feature])
                gr_string = "#=GR %s %s" % (seq_name, gr_feature)
                unpadded_data.append((gr_string, gr_feature_data))
    # Retrieves GC data
    if obj.has_positional_metadata():
        df = _format_positional_metadata(obj.positional_metadata,
                                         'Multiple sequence alignment '
                                         'positional metadata (GC)')
        for gc_feature in df.columns:
            gc_feature_data = ''.join(df[gc_feature])
            gc_string = "#=GC %s" % gc_feature
            unpadded_data.append((gc_string, gc_feature_data))
    # Writes GR, GC, and raw data to file with padding
    _write_padded_data(unpadded_data, fh)
    # Writes footer
    fh.write("//\n")
def _write_padded_data(data, fh):
    max_data_len = 0
    for label, _ in data:
        if len(label) > max_data_len:
            max_data_len = len(label)
    fmt = '{0:%d} {1}\n' % max_data_len
    for label, value in data:
        fh.write(fmt.format(label, value))
def _format_positional_metadata(df, data_type):
    # Asserts positional metadata feature names are unique
    if not df.columns.is_unique:
        num_repeated_columns = len(df.columns) - len(set(df.columns))
        raise StockholmFormatError('%s feature names must be unique. '
                                   'Found %d duplicate names.'
                                   % (data_type, num_repeated_columns))
    str_df = df.astype(str)
    # Asserts positional metadata dataframe items are one character long
    for column in str_df.columns:
        if (str_df[column].str.len() != 1).any():
            raise StockholmFormatError("%s must contain a single character for"
                                       " each position's value. Found value(s)"
                                       " in column %s of incorrect length."
                                       % (data_type, column))
    return str_df
def _get_padding(item):
    return '' if item[-1].isspace() else ' '
 |