File: DNA_and_RNA_sequences.rst

package info (click to toggle)
python-cogent 1.4.1-1.2
  • links: PTS, VCS
  • area: non-free
  • in suites: squeeze
  • size: 13,260 kB
  • ctags: 20,087
  • sloc: python: 116,163; ansic: 732; makefile: 74; sh: 9
file content (263 lines) | stat: -rw-r--r-- 7,456 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
DNA and RNA sequences
---------------------

.. authors, Gavin Huttley, Kristian Rother, Patrick Yannul, Tom Elliott

DNA sequence objects
^^^^^^^^^^^^^^^^^^^^

Creating a DNA sequence from a string
"""""""""""""""""""""""""""""""""""""

All sequence and alignment objects have a molecular type, or ``MolType`` which provides key properties for validating sequence characters. Here we use the ``DNA`` ``MolType`` to create a DNA sequence.

.. doctest::

    >>> from cogent import DNA
    >>> my_seq = DNA.makeSequence("AGTACACTGGT")
    >>> my_seq
    DnaSequence(AGTACAC... 11)
    >>> print my_seq
    AGTACACTGGT
    >>> str(my_seq)
    'AGTACACTGGT'

Converting to FASTA format
""""""""""""""""""""""""""

.. doctest::

    >>> from cogent import DNA
    >>> my_seq = DNA.makeSequence('AGTACACTGGT')
    >>> print my_seq.toFasta()
    >0
    AGTACACTGGT

Creating a named sequence
"""""""""""""""""""""""""

.. doctest::

    >>> from cogent import DNA
    >>> my_seq = DNA.makeSequence('AGTACACTGGT','my_gene')
    >>> my_seq
    DnaSequence(AGTACAC... 11)
    >>> type(my_seq)
    <class 'cogent.core.sequence.DnaSequence'>

Setting or changing the name of a sequence
""""""""""""""""""""""""""""""""""""""""""

.. doctest::

    >>> from cogent import DNA
    >>> my_seq = DNA.makeSequence('AGTACACTGGT')
    >>> my_seq.Name = 'my_gene'
    >>> print my_seq.toFasta()
    >my_gene
    AGTACACTGGT

Inverting a DNA sequence
""""""""""""""""""""""""

.. doctest::

    >>> from cogent import DNA
    >>> my_seq = DNA.makeSequence("AGTACACTGGT")
    >>> print my_seq.complement()
    TCATGTGACCA
    >>> print my_seq.reversecomplement()
    ACCAGTGTACT

The ``rc`` method name is easier to type

.. doctest::

    >>> print my_seq.rc()
    ACCAGTGTACT

.. _translation:

Converting a DnaSequence object to protein
""""""""""""""""""""""""""""""""""""""""""

.. doctest::

    >>> from cogent import DNA
    >>> my_seq = DNA.makeSequence("AGTACACTGGT",'X')
    >>> pep = my_seq.getTranslation()
    >>> type(pep)
    <class 'cogent.core.sequence.ProteinSequence'>
    >>> print pep.toFasta()
    >X
    STL

Converting a DNA sequence to RNA
""""""""""""""""""""""""""""""""

.. doctest::

    >>> print my_seq.toRna()
    AGUACACUGGU

Testing complementarity
"""""""""""""""""""""""

.. doctest::

    >>> from cogent import DNA
    >>> a = DNA.makeSequence("AGTACACTGGT")
    >>> a.canPair(a.complement())
    False
    >>> a.canPair(a.reversecomplement())
    True

Joining two DNA sequences
"""""""""""""""""""""""""

.. doctest::

    >>> from cogent import DNA
    >>> my_seq = DNA.makeSequence("AGTACACTGGT")
    >>> extra_seq = DNA.makeSequence("CTGAC")
    >>> long_seq = my_seq + extra_seq
    >>> long_seq
    DnaSequence(AGTACAC... 16)
    >>> str(long_seq)
    'AGTACACTGGTCTGAC'

Slicing DNA sequences
"""""""""""""""""""""

.. doctest::

    >>> my_seq[1:6]
    DnaSequence(GTACA)

Getting 3rd positions from codons
"""""""""""""""""""""""""""""""""

We'll do this by specifying the position indices of interest, creating a sequence ``Feature`` and using that to extract the positions.

.. doctest::

    >>> from cogent import DNA
    >>> seq = DNA.makeSequence('ATGATGATGATG')

Creating the position indices, note that we start at the 2nd index (the 'first' codon's 3rd position) indicate each position as a *span* (``i -- i+1``).

.. doctest::

    >>> indices = [(i, i+1) for i in range(len(seq))[2::3]]

Create the sequence feature and use it to slice the sequence.

.. doctest::

    >>> pos3 = seq.addFeature('pos3', 'pos3', indices)
    >>> pos3 = pos3.getSlice()
    >>> assert str(pos3) == 'GGGG'

Getting 1st and 2nd positions from codons
"""""""""""""""""""""""""""""""""""""""""

The only difference here to above is that our spans cover 2 positions.

.. doctest::

    >>> from cogent import DNA
    >>> seq = DNA.makeSequence('ATGATGATGATG')
    >>> indices = [(i, i+2) for i in range(len(seq))[::3]]
    >>> pos12 = seq.addFeature('pos12', 'pos12', indices)
    >>> pos12 = pos12.getSlice()
    >>> assert str(pos12) == 'ATATATAT'

Creating a general sequence object
""""""""""""""""""""""""""""""""""

.. note:: the import statement

.. doctest::

    >>> from cogent.core.sequence import Sequence
    >>> seq = Sequence('ABCDEF','Name')
    >>> print seq.toFasta()
    >Name
    ABCDEF
    >>> print type(seq)
    <class 'cogent.core.sequence.Sequence'>

Loading sequences from a file
"""""""""""""""""""""""""""""

For loading collections of unaligned or aligned sequences see :ref:`load-seqs`.

Loading FASTA sequences from an open file or list of lines
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

To load sequences from a fasta file directly, you can use the ``MinimalFastaParser``.

.. note:: This returns the sequences as strings.

.. doctest::

    >>> from cogent.parse.fasta import MinimalFastaParser
    >>> f=open('data/long_testseqs.fasta')
    >>> seqs = [(name, seq) for name, seq in MinimalFastaParser(f)]
    >>> print seqs
    [('Human', 'TGTGGCACAAATAC...

Handling overloaded FASTA sequence labels
+++++++++++++++++++++++++++++++++++++++++

The FASTA label field is frequently overloaded, with different information fields present in the field and separated by some delimiter. This can be flexibly addressed using the ``LabelParser``. By creating a custom label parser, we can decided which part we use as the sequence name. We show how convert a field into something specific.

.. doctest::
    
    >>> from cogent.parse.fasta import LabelParser
    >>> def latin_to_common(latin):
    ...     return {'Homo sapiens': 'human',
    ...             'Pan troglodtyes': 'chimp'}[latin]
    >>> label_parser = LabelParser("%(species)s",
    ...             [[1, "species", latin_to_common]], split_with=':')
    >>> for label in ">abcd:Homo sapiens:misc", ">abcd:Pan troglodtyes:misc":
    ...     label = label_parser(label)
    ...     print label, type(label)
    human <class 'cogent.parse.fasta.RichLabel'>
    chimp <class 'cogent.parse.fasta.RichLabel'>

The ``RichLabel`` objects have an ``Info`` object as an attribute, allowing specific reference to all the specified label fields.

.. doctest::
    
    >>> from cogent.parse.fasta import MinimalFastaParser, LabelParser
    >>> fasta_data = ['>gi|10047090|ref|NP_055147.1| small muscle protein, X-linked [Homo sapiens]',
    ...  'MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGAKKLPGPAVNL',
    ... 'SEIQNIKSELKYVPKAEQ',
    ... '>gi|10047092|ref|NP_037391.1| neuronal protein [Homo sapiens]',
    ... 'MANRGPSYGLSREVQEKIEQKYDADLENKLVDWIILQCAEDIEHPPPGRAHFQKWLMDGTVLCKLINSLY',
    ... 'PPGQEPIPKISESKMAFKQMEQISQFLKAAETYGVRTTDIFQTVDLWEGKDMAAVQRTLMALGSVAVTKD']
    ... 
    >>> label_to_name = LabelParser("%(ref)s",
    ...                              [[1,"gi", str],
    ...                               [3, "ref", str],
    ...                               [4, "description", str]],
    ...                               split_with="|")
    ... 
    >>> for name, seq in MinimalFastaParser(fasta_data, label_to_name=label_to_name):
    ...     print name
    ...     print name.Info.gi
    ...     print name.Info.description
    NP_055147.1
    10047090
     small muscle protein, X-linked [Homo sapiens]
    NP_037391.1
    10047092
     neuronal protein [Homo sapiens]

Loading DNA sequences from a GenBank file
+++++++++++++++++++++++++++++++++++++++++

.. todo:: get sample data for this

*To be written.*