1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
|
r"""
Alignments (:mod:`skbio.alignment`)
===================================
.. currentmodule:: skbio.alignment
This module provides functionality for computing and manipulating sequence
alignments. DNA, RNA, and protein sequences can be aligned, as well as
sequences with custom alphabets.
Data Structures
---------------
.. autosummary::
:toctree: generated/
TabularMSA
Optimized (i.e., production-ready) Alignment Algorithms
-------------------------------------------------------
.. autosummary::
:toctree: generated/
StripedSmithWaterman
AlignmentStructure
local_pairwise_align_ssw
Slow (i.e., educational-purposes only) Alignment Algorithms
-----------------------------------------------------------
.. autosummary::
:toctree: generated/
global_pairwise_align_nucleotide
global_pairwise_align_protein
global_pairwise_align
local_pairwise_align_nucleotide
local_pairwise_align_protein
local_pairwise_align
General functionality
---------------------
.. autosummary::
:toctree: generated/
make_identity_substitution_matrix
Data Structure Examples
-----------------------
Load two DNA sequences that have been previously aligned into a ``TabularMSA``
object, using sequence IDs as the MSA's index:
>>> from skbio import TabularMSA, DNA
>>> seqs = [DNA("ACC--G-GGTA..", metadata={'id': "seq1"}),
... DNA("TCC--G-GGCA..", metadata={'id': "seq2"})]
>>> msa = TabularMSA(seqs, minter='id')
>>> msa
TabularMSA[DNA]
----------------------
Stats:
sequence count: 2
position count: 13
----------------------
ACC--G-GGTA..
TCC--G-GGCA..
>>> msa.index
Index(['seq1', 'seq2'], dtype='object')
Alignment Algorithm Examples
----------------------------
Optimized Alignment Algorithm Examples
--------------------------------------
Using the convenient ``local_pairwise_align_ssw`` function:
>>> from skbio.alignment import local_pairwise_align_ssw
>>> alignment, score, start_end_positions = local_pairwise_align_ssw(
... DNA("ACTAAGGCTCTCTACCCCTCTCAGAGA"),
... DNA("ACTAAGGCTCCTAACCCCCTTTTCTCAGA")
... )
>>> alignment
TabularMSA[DNA]
------------------------------
Stats:
sequence count: 2
position count: 30
------------------------------
ACTAAGGCTCTCT-ACCCC----TCTCAGA
ACTAAGGCTC-CTAACCCCCTTTTCTCAGA
>>> score
27
>>> start_end_positions
[(0, 24), (0, 28)]
Using the ``StripedSmithWaterman`` object:
>>> from skbio.alignment import StripedSmithWaterman
>>> query = StripedSmithWaterman("ACTAAGGCTCTCTACCCCTCTCAGAGA")
>>> alignment = query("AAAAAACTCTCTAAACTCACTAAGGCTCTCTACCCCTCTTCAGAGAAGTCGA")
>>> print(alignment)
ACTAAGGCTC...
ACTAAGGCTC...
Score: 49
Length: 28
Using the ``StripedSmithWaterman`` object for multiple targets in an efficient
way and finding the aligned sequence representations:
>>> from skbio.alignment import StripedSmithWaterman
>>> alignments = []
>>> target_sequences = [
... "GCTAACTAGGCTCCCTTCTACCCCTCTCAGAGA",
... "GCCCAGTAGCTTCCCAATATGAGAGCATCAATTGTAGATCGGGCC",
... "TCTATAAGATTCCGCATGCGTTACTTATAAGATGTCTCAACGG",
... "TAGAGATTAATTGCCACTGCCAAAATTCTG"
... ]
>>> query_sequence = "ACTAAGGCTCTCTACCCCTCTCAGAGA"
>>> query = StripedSmithWaterman(query_sequence)
>>> for target_sequence in target_sequences:
... alignment = query(target_sequence)
... alignments.append(alignment)
...
>>> print(alignments[0])
ACTAAGGCTC...
ACT-AGGCTC...
Score: 38
Length: 30
>>> print(alignments[0].aligned_query_sequence)
ACTAAGGCTC---TCTACCCCTCTCAGAGA
>>> print(alignments[0].aligned_target_sequence)
ACT-AGGCTCCCTTCTACCCCTCTCAGAGA
Slow Alignment Algorithm Examples
---------------------------------
scikit-bio also provides pure-Python implementations of Smith-Waterman and
Needleman-Wunsch alignment. These are much slower than the methods described
above, but serve as useful educational examples as they're simpler to
experiment with. Functions are provided for local and global alignment of
protein and nucleotide sequences. The ``global*`` and ``local*`` functions
differ in the underlying algorithm that is applied (``global*`` uses Needleman-
Wunsch while ``local*`` uses Smith-Waterman), and ``*protein`` and
``*nucleotide`` differ in their default scoring of matches, mismatches, and
gaps.
Here we locally align a pair of protein sequences using gap open penalty
of 11 and a gap extend penalty of 1 (in other words, it is much more
costly to open a new gap than extend an existing one).
>>> from skbio import Protein
>>> from skbio.alignment import local_pairwise_align_protein
>>> s1 = Protein("HEAGAWGHEE")
>>> s2 = Protein("PAWHEAE")
>>> alignment, score, start_end_positions = local_pairwise_align_protein(
... s1, s2, 11, 1)
This returns an ``skbio.TabularMSA`` object, the alignment score, and start/end
positions of each aligned sequence:
>>> alignment
TabularMSA[Protein]
---------------------
Stats:
sequence count: 2
position count: 5
---------------------
AWGHE
AW-HE
>>> score
25.0
>>> start_end_positions
[(4, 8), (1, 4)]
Similarly, we can perform global alignment of nucleotide sequences:
>>> from skbio import DNA
>>> from skbio.alignment import global_pairwise_align_nucleotide
>>> s1 = DNA("GCGTGCCTAAGGTATGCAAG")
>>> s2 = DNA("ACGTGCCTAGGTACGCAAG")
>>> alignment, score, start_end_positions = global_pairwise_align_nucleotide(
... s1, s2)
>>> alignment
TabularMSA[DNA]
----------------------
Stats:
sequence count: 2
position count: 20
----------------------
GCGTGCCTAAGGTATGCAAG
ACGTGCCTA-GGTACGCAAG
"""
# ----------------------------------------------------------------------------
# Copyright (c) 2013--, scikit-bio development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE.txt, distributed with this software.
# ----------------------------------------------------------------------------
from ._tabular_msa import TabularMSA
from ._pairwise import (
local_pairwise_align_nucleotide, local_pairwise_align_protein,
local_pairwise_align, global_pairwise_align_nucleotide,
global_pairwise_align_protein, global_pairwise_align,
make_identity_substitution_matrix, local_pairwise_align_ssw
)
from skbio.alignment._ssw_wrapper import (
StripedSmithWaterman, AlignmentStructure)
__all__ = ['TabularMSA', 'StripedSmithWaterman', 'AlignmentStructure',
'local_pairwise_align_ssw', 'global_pairwise_align',
'global_pairwise_align_nucleotide', 'global_pairwise_align_protein',
'local_pairwise_align', 'local_pairwise_align_nucleotide',
'local_pairwise_align_protein', 'make_identity_substitution_matrix']
|