1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138
|
*******************
Building alignments
*******************
.. authors, Gavin Huttley, Kristian Rother, Patrick Yannul
Using the cogent aligners
=========================
Running a pairwise Needleman-Wunsch-Alignment
---------------------------------------------
.. doctest::
>>> from cogent.align.algorithm import nw_align
>>> seq1 = 'AKSAMITNY'
>>> seq2 = 'AKHSAMMIT'
>>> print nw_align(seq1,seq2)
('AK-SAM-ITNY', 'AKHSAMMIT--')
Running a progressive aligner
-----------------------------
We import useful functions and then load the sequences to be aligned.
.. doctest::
>>> from cogent import LoadSeqs, LoadTree, DNA
>>> seqs = LoadSeqs('data/test2.fasta', aligned=False, moltype=DNA)
For nucleotides
^^^^^^^^^^^^^^^
We load a canned nucleotide substitution model and the progressive aligner ``TreeAlign`` function.
.. doctest::
>>> from cogent.evolve.models import HKY85
>>> from cogent.align.progressive import TreeAlign
We first align without providing a guide tree. The ``TreeAlign`` algorithm builds pairwise alignments and estimates the substitution model parameters and pairwise distances. The distances are used to build a neighbour joining tree and the median value of substitution model parameters are provided to the substitution model for the progressive alignment step.
.. doctest::
>>> aln, tree = TreeAlign(HKY85(), seqs)
Param Estimate Summary Stats: kappa
==============================
Statistic Value
------------------------------
Count 10
Sum 1e+06
Median 4.256
Mean 1e+05
StandardDeviation 3.162e+05
Variance 1e+11
------------------------------
>>> aln
5 x 60 text alignment: NineBande[-C-----GCCA...], Mouse[GCAGTGAGCCA...], DogFaced[GCAAGGAGCCA...], ...
We then align using a guide tree (pre-estimated) and specifying the ratio of transitions to transversions (kappa).
.. doctest::
>>> tree = LoadTree(treestring='(((NineBande:0.0128202449453,Mouse:0.184732725695):0.0289459522137,DogFaced:0.0456427810916):0.0271363715538,Human:0.0341320714654,HowlerMon:0.0188456837006)root;')
>>> params={'kappa': 4.0}
>>> aln, tree = TreeAlign(HKY85(), seqs, tree=tree, param_vals=params)
>>> aln
5 x 60 text alignment: NineBande[-C-----GCCA...], Mouse[GCAGTGAGCCA...], DogFaced[GCAAGGAGCCA...], ...
For codons
^^^^^^^^^^
We load a canned codon substitution model and use a pre-defined tree and parameter estimates.
.. doctest::
>>> from cogent.evolve.models import MG94HKY
>>> tree = LoadTree(treestring='((NineBande:0.0575781680031,Mouse:0.594704139406):0.078919659556,DogFaced:0.142151930069,(HowlerMon:0.0619991555435,Human:0.10343006422):0.0792423439112)')
>>> params={'kappa': 4.0, 'omega': 1.3}
>>> aln, tree = TreeAlign(MG94HKY(), seqs, tree=tree, param_vals=params)
>>> aln
5 x 60 text alignment: NineBande[------CGCCA...], Mouse[GCAGTGAGCCA...], DogFaced[GCAAGGAGCCA...], ...
Building alignments with 3rd-party apps such as muscle or clustalw
==================================================================
See :ref:`alignment-controllers`.
Converting gaps from aa-seq alignment to nuc seq alignment
==========================================================
We load some unaligned DNA sequences and show their translation.
.. doctest::
>>> from cogent import LoadSeqs, DNA, PROTEIN
>>> seqs = [('hum', 'AAGCAGATCCAGGAAAGCAGCGAGAATGGCAGCCTGGCCGCGCGCCAGGAGAGGCAGGCCCAGGTCAACCTCACT'),
... ('mus', 'AAGCAGATCCAGGAGAGCGGCGAGAGCGGCAGCCTGGCCGCGCGGCAGGAGAGGCAGGCCCAAGTCAACCTCACG'),
... ('rat', 'CTGAACAAGCAGCCACTTTCAAACAAGAAA')]
>>> unaligned_DNA = LoadSeqs(data=seqs, moltype = DNA, aligned = False)
>>> print unaligned_DNA.toFasta()
>hum
AAGCAGATCCAGGAAAGCAGCGAGAATGGCAGCCTGGCCGCGCGCCAGGAGAGGCAGGCCCAGGTCAACCTCACT
>mus
AAGCAGATCCAGGAGAGCGGCGAGAGCGGCAGCCTGGCCGCGCGGCAGGAGAGGCAGGCCCAAGTCAACCTCACG
>rat
CTGAACAAGCAGCCACTTTCAAACAAGAAA
>>> print unaligned_DNA.getTranslation()
>hum
KQIQESSENGSLAARQERQAQVNLT
>mus
KQIQESGESGSLAARQERQAQVNLT
>rat
LNKQPLSNKK
<BLANKLINE>
We load an alignment of these protein sequences.
.. doctest::
>>> aligned_aa_seqs = [('hum', 'KQIQESSENGSLAARQERQAQVNLT'),
... ('mus', 'KQIQESGESGSLAARQERQAQVNLT'),
... ('rat', 'LNKQ------PLS---------NKK')]
>>> aligned_aa = LoadSeqs(data = aligned_aa_seqs, moltype = PROTEIN)
We then obtain an alignment of the DNA sequences from the alignment of their translation.
.. doctest::
>>> aligned_DNA = aligned_aa.replaceSeqs(unaligned_DNA)
>>> print aligned_DNA
>hum
AAGCAGATCCAGGAAAGCAGCGAGAATGGCAGCCTGGCCGCGCGCCAGGAGAGGCAGGCCCAGGTCAACCTCACT
>mus
AAGCAGATCCAGGAGAGCGGCGAGAGCGGCAGCCTGGCCGCGCGGCAGGAGAGGCAGGCCCAAGTCAACCTCACG
>rat
CTGAACAAGCAG------------------CCACTTTCA---------------------------AACAAGAAA
<BLANKLINE>
|