File: chapter_pairwise2.rst

package info (click to toggle)
python-biopython 1.85%2Bdfsg-4
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 126,372 kB
  • sloc: xml: 1,047,995; python: 332,722; ansic: 16,944; sql: 1,208; makefile: 140; sh: 81
file content (209 lines) | stat: -rw-r--r-- 8,253 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
.. _`chapter:pairwise2`:

Pairwise alignments using pairwise2
===================================

**Please note that Bio.pairwise2 was deprecated in Release 1.80.** As an
alternative, please consider using ``Bio.Align.PairwiseAligner``
(described in Chapter :ref:`chapter:pairwise`).

``Bio.pairwise2`` contains essentially the same algorithms as ``water``
(local) and ``needle`` (global) from the
`EMBOSS <http://emboss.sourceforge.net/>`__ suite (see above) and should
return the same results. The ``pairwise2`` module has undergone some
optimization regarding speed and memory consumption recently (Biopython
versions >1.67) so that for short sequences (global alignments: ~2000
residues, local alignments ~600 residues) it’s faster (or equally fast)
to use ``pairwise2`` than calling EMBOSS’ ``water`` or ``needle`` via
the command line tools.

Suppose you want to do a global pairwise alignment between the same two
hemoglobin sequences from above (``HBA_HUMAN``, ``HBB_HUMAN``) stored in
``alpha.faa`` and ``beta.faa``:

.. doctest examples

.. code:: pycon

   >>> from Bio import pairwise2
   >>> from Bio import SeqIO
   >>> seq1 = SeqIO.read("alpha.faa", "fasta")
   >>> seq2 = SeqIO.read("beta.faa", "fasta")
   >>> alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq)

As you see, we call the alignment function with ``align.globalxx``. The
tricky part are the last two letters of the function name (here:
``xx``), which are used for decoding the scores and penalties for
matches (and mismatches) and gaps. The first letter decodes the match
score, e.g. ``x`` means that a match counts 1 while mismatches have no
costs. With ``m`` general values for either matches or mismatches can be
defined (for full details see :py:mod:`Bio.pairwise2`). The
second letter decodes the cost for gaps; ``x`` means no gap costs at
all, with ``s`` different penalties for opening and extending a gap can
be assigned. So, ``globalxx`` means that only matches between both
sequences are counted.

Our variable ``alignments`` now contains a list of alignments (at least
one) which have the same optimal score for the given conditions. In our
example this are 80 different alignments with the score 72
(``Bio.pairwise2`` will return up to 1000 alignments). Have a look at
one of these alignments:

.. cont-doctest

.. code:: pycon

   >>> len(alignments)
   80
   >>> print(alignments[0])  # doctest:+ELLIPSIS
   Alignment(seqA='MV-LSPADKTNV---K-A--A-WGKVGAHAG...YR-', seqB='MVHL-----T--PEEKSAVTALWGKV----...Y-H', score=72.0, start=0, end=217)

Each alignment is a named tuple consisting of the two aligned sequences,
the score, the start and the end positions of the alignment (in global
alignments the start is always 0 and the end the length of the
alignment). ``Bio.pairwise2`` has a function ``format_alignment`` for a
nicer printout:

.. cont-doctest

.. code:: pycon

   >>> print(pairwise2.format_alignment(*alignments[0]))  # doctest:+ELLIPSIS
   MV-LSPADKTNV---K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--F...YR-
   || |     |     | |  | ||||        |  |  |||  |  |      |    |   |...|  
   MVHL-----T--PEEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF...Y-H
     Score=72
   <BLANKLINE>

Since Biopython 1.77 the required parameters can be supplied with
keywords. The last example can now also be written as:

.. cont-doctest

.. code:: pycon

   >>> alignments = pairwise2.align.globalxx(sequenceA=seq1.seq, sequenceB=seq2.seq)

Better alignments are usually obtained by penalizing gaps: higher costs
for opening a gap and lower costs for extending an existing gap. For
amino acid sequences match scores are usually encoded in matrices like
``PAM`` or ``BLOSUM``. Thus, a more meaningful alignment for our example
can be obtained by using the BLOSUM62 matrix, together with a gap open
penalty of 10 and a gap extension penalty of 0.5 (using ``globalds``):

.. code:: pycon

   >>> from Bio import pairwise2
   >>> from Bio import SeqIO
   >>> from Bio.Align import substitution_matrices
   >>> blosum62 = substitution_matrices.load("BLOSUM62")
   >>> seq1 = SeqIO.read("alpha.faa", "fasta")
   >>> seq2 = SeqIO.read("beta.faa", "fasta")
   >>> alignments = pairwise2.align.globalds(seq1.seq, seq2.seq, blosum62, -10, -0.5)
   >>> len(alignments)
   2
   >>> print(pairwise2.format_alignment(*alignments[0]))
   MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY...KYR
   || |.|..|..|.|.|||| ......|............|.......||.
   MVHLTPEEKSAVTALWGKV-NVDEVGGEALGRLLVVYPWTQRFF...KYH
     Score=292.5

This alignment has the same score that we obtained earlier with EMBOSS
needle using the same sequences and the same parameters.

Local alignments are called similarly with the function
``align.localXX``, where again XX stands for a two letter code for the
match and gap functions:

.. doctest

.. code:: pycon

   >>> from Bio import pairwise2
   >>> from Bio.Align import substitution_matrices
   >>> blosum62 = substitution_matrices.load("BLOSUM62")
   >>> alignments = pairwise2.align.localds("LSPADKTNVKAA", "PEEKSAV", blosum62, -10, -1)
   >>> print(pairwise2.format_alignment(*alignments[0]))
   3 PADKTNV
     |..|..|
   1 PEEKSAV
     Score=16
   <BLANKLINE>

In recent Biopython versions, ``format_alignment`` will only print the
aligned part of a local alignment (together with the start positions in
1-based notation, as shown in the above example). If you are also
interested in the non- aligned parts of the sequences, use the
keyword-parameter ``full_sequences=True``:

.. doctest

.. code:: pycon

   >>> from Bio import pairwise2
   >>> from Bio.Align import substitution_matrices
   >>> blosum62 = substitution_matrices.load("BLOSUM62")
   >>> alignments = pairwise2.align.localds("LSPADKTNVKAA", "PEEKSAV", blosum62, -10, -1)
   >>> print(pairwise2.format_alignment(*alignments[0], full_sequences=True))
   LSPADKTNVKAA
     |..|..|   
   --PEEKSAV---
     Score=16
   <BLANKLINE>

Note that local alignments must, as defined by Smith & Waterman, have a
positive score (>0). Thus, ``pairwise2`` may return no alignments if no
score >0 has been obtained. Also, ``pairwise2`` will not report
alignments which are the result of the addition of zero-scoring
extensions on either site. In the next example, the pairs
serine/aspartic acid (S/D) and lysine/asparagine (K/N) both have a match
score of 0. As you see, the aligned part has not been extended:

.. doctest

.. code:: pycon

   >>> from Bio import pairwise2
   >>> from Bio.Align import substitution_matrices
   >>> blosum62 = substitution_matrices.load("BLOSUM62")
   >>> alignments = pairwise2.align.localds("LSSPADKTNVKKAA", "DDPEEKSAVNN", blosum62, -10, -1)
   >>> print(pairwise2.format_alignment(*alignments[0]))
   4 PADKTNV
     |..|..|
   3 PEEKSAV
     Score=16
   <BLANKLINE>

Instead of supplying a complete match/mismatch matrix, the match code
``m`` allows for easy defining general match/mismatch values. The next
example uses match/mismatch scores of 5/-4 and gap penalties
(open/extend) of 2/0.5 using ``localms``:

.. cont-doctest

.. code:: pycon

   >>> alignments = pairwise2.align.localms("AGAACT", "GAC", 5, -4, -2, -0.5)
   >>> print(pairwise2.format_alignment(*alignments[0]))
   2 GAAC
     | ||
   1 G-AC
     Score=13
   <BLANKLINE>

One useful keyword argument of the ``Bio.pairwise2.align`` functions is
``score_only``. When set to ``True`` it will only return the score of
the best alignment(s), but in a significantly shorter time. It will also
allow the alignment of longer sequences before a memory error is raised.
Another useful keyword argument is ``one_alignment_only=True`` which
will also result in some speed gain.

Unfortunately, ``Bio.pairwise2`` does not work with Biopython’s multiple
sequence alignment objects (yet). However, the module has some
interesting advanced features: you can define your own match and gap
functions (interested in testing affine logarithmic gap costs?), gap
penalties and end gaps penalties can be different for both sequences,
sequences can be supplied as lists (useful if you have residues that are
encoded by more than one character), etc. These features are hard (if at
all) to realize with other alignment tools. For more details see the
module's API documentation :py:mod:`Bio.pairwise2`.