1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379
|
********************************
Analysis of sequence composition
********************************
.. sectionauthor:: Jesse Zaneveld
PyCogent provides several tools for analyzing the composition of DNA, RNA, or
protein sequences.
Loading your sequence
=====================
Let us say that we wish to study the sequence composition of the *Y. pseudotuberculosis* PB1 DNA Polymerase III beta subunit.
First we input the sequence as a string.
.. doctest::
>>> y_pseudo_seq = \
... """ atgaaatttatcattgaacgtgagcatctgctaaaaccactgcaacaggtcagtagcccg
... ctgggtggacgccctacgttgcctattttgggtaacttgttgctgcaagtcacggaaggc
... tctttgcggctgaccggtaccgacttggagatggagatggtggcttgtgttgccttgtct
... cagtcccatgagccgggtgctaccacagtacccgcacggaagttttttgatatctggcgt
... ggtttacccgaaggggcggaaattacggtagcgttggatggtgatcgcctgctagtgcgc
... tctggtcgcagccgtttctcgctgtctaccttgcctgcgattgacttccctaatctggat
... gactggcagagtgaggttgaattcactttaccgcaggctacgttaaagcgtctgattgag
... tccactcagttttcgatggcccatcaggatgtccgttattatttgaacggcatgctgttt
... gagaccgaaggcgaagagttacgtactgtggcgaccgatgggcatcgcttggctgtatgc
... tcaatgcctattggccagacgttaccctcacattcggtgatcgtgccgcgtaaaggtgtg
... atggagctggttcggttgctggatggtggtgatacccccttgcggctgcaaattggcagt
... aataatattcgtgctcatgtgggcgattttattttcacatctaagctggttgatggccgt
... ttcccggattatcgccgcgtattgccgaagaatcctgataaaatgctggaagccggttgc
... gatttactgaaacaggcattttcgcgtgcggcaattctgtcaaatgagaagttccgtggt
... gttcggctctatgtcagccacaatcaactcaaaatcactgctaataatcctgaacaggaa
... gaagcagaagagatcctcgatgttagctacgaggggacagaaatggagatcggtttcaac
... gtcagctatgtgcttgatgtgctaaatgcactgaagtgcgaagatgtgcgcctgttattg
... actgactctgtatccagtgtgcagattgaagacagcgccagccaagctgcagcctatgtc
... gtcatgccaatgcgtttgtag"""
To check that our results are reasonable, we can also load a small example string.
.. doctest::
>>> example_seq = "GCGTTT"
In order to calculate compositional statistics, we need to import one of the ``Usage`` objects from ``cogent.core.usage``, create an object from our string, and normalize the counts contained in the string into frequencies. ``Usage`` objects include ``BaseUsage``, ``PositionalBaseUsage``, ``CodonUsage``, and ``AminoAcidUsage``.
Let us start with the ``BaseUsage`` object. The first few steps will be the same for the other Usage objects, however (as we will see below).
GC content
==========
Total GC content
-----------------
GC content is one commonly used compositional statistic. To calculate the total GC content of our gene, we will need to initiate and normalize a ``BaseUsage`` object.
.. doctest::
>>> from cogent.core.usage import BaseUsage
>>> example_bu = BaseUsage(example_seq)
>>> # Print raw counts
>>> print example_bu.content("GC")
3.0
>>> example_bu.normalize()
>>> print example_bu.content("GC")
0.5
We can now visually verify that the reported GC contents are correct, and use the same technique on our full sequence.
.. doctest::
>>> y_pseudo_bu = BaseUsage(y_pseudo_seq)
>>> # Print raw counts
>>> y_pseudo_bu.content("GC")
555.0
>>> y_pseudo_bu.normalize()
>>> print y_pseudo_bu.content("GC")
0.50408719346
Positional GC content of Codons
-------------------------------
When analyzing protein coding genes, it is often useful to subdivide the GC content by codon position. In particular, the 3rd codon position ``CodonUsage`` objects allow us to calculate the GC content at each codon position.
First, let us calculate the GC content for the codons in the example sequence as follows.
.. doctest::
>>> # Import CodonUsage object
>>> from cogent.core.usage import CodonUsage
>>> # Initiate & normalize CodonUsage object
>>> example_seq_cu = CodonUsage(example_seq)
>>> example_seq_cu.normalize()
>>> GC,P1,P2,P3 = example_seq_cu.positionalGC()
Here, GC is the overall GC content for the sequence, while P1, P2, and P3 are the GC content at the first, second, and third codon positions, respectively.
Printing the results for the example gives the following results.
.. doctest::
>>> print "GC:", GC
GC: 0.5
>>> print "P1:", P1
P1: 0.5
>>> print "P2:", P2
P2: 0.5
>>> print "P3:", P3
P3: 0.5
We can then do the same for our biological sequence.
.. doctest::
>>> y_pseudo_cu = CodonUsage(y_pseudo_seq)
>>> y_pseudo_cu.normalize()
>>> y_pseudo_GC = y_pseudo_cu.positionalGC()
>>> print y_pseudo_GC
[0.51874999999999993, 0.5843749999999999, 0.4750000000000001, 0.49687499999999996]
These results could then be fed into downstream analyses.
One important note is that ``CodonUsage`` objects calculate the GC content of codons within nucleotide sequences, rather than the full GC content. Therefore, ``BaseUsage`` rather than ``CodonUsage`` objects should be used for calculating the GC content of non-coding sequences.
Total Base Usage
================
A more detailed view of composition incorporates the relative counts or frequencies of all bases. We can calculate total base usage as follows.
.. doctest::
>>> from cogent.core.usage import BaseUsage
>>> example_bu = BaseUsage(example_seq)
>>> # Print raw counts
>>> for k in example_bu.RequiredKeys:
... print k, example_bu[k]
A 0.0
C 1.0
U 3.0
G 2.0
>>> example_bu.normalize()
>>> for k in example_bu.RequiredKeys:
... print k, example_bu[k]
A 0.0
C 0.166666666667
U 0.5
G 0.333333333333
Dinucleotide Content
====================
The ``DinucUsage`` object allows us to calculate Dinucleotide usage for our sequence.
Dinucleotide usage can be calculated using overlapping, non-overlapping, or '3-1' dinucleotides.
Given the sequence "AATTAAGCC", each method will count dinucleotide usage differently. Overlapping dinucleotide usage will count "AA", "AT", "TT", "TA", "AA", "AG", "GC", "CC". Non-overlapping dinucleotide usage will count "AA", "TT", "AA", "GC" 3-1 dinucleotide usage will count "TT", "AC".
Calculating the GC content at the third and first codon positions ("3-1" usage) is useful for some applications, such as gene transfer detection, because changes at these positions tend to produce the most conservative amino acid substitutions, and thus are thought to better reflect mutational (rather than selective) pressure.
Overlapping dinucleotide content
--------------------------------
To calculate overlapping dinucleotide usage for our *Y. pseudotuberculosis* PB1 sequence.
.. doctest::
>>> from cogent.core.usage import DinucUsage
>>> du = DinucUsage(y_pseudo_seq, Overlapping=True)
>>> du.normalize()
We can inspect individual dinucleotide usages and confirm that the results add to 100% as follows
.. doctest::
>>> total = 0.0
>>> for k in du.RequiredKeys:
... print k, du[k]
... total += du[k]
UU 0.0757855822551
UC 0.0517560073937
UA 0.043438077634
UG 0.103512014787
CU 0.0619223659889
CC 0.0517560073937
CA 0.0517560073937
CG 0.0573012939002
AU 0.0674676524954
AC 0.043438077634
AA 0.0573012939002
AG 0.054528650647
GU 0.0711645101664
GC 0.0794824399261
GA 0.0674676524954
GG 0.0619223659889
>>> print "Total:",total
Total: 1.0
Non-overlapping Dinucleotide Content
------------------------------------
To calculate non-overlapping dinucleotide usage we simply change the ``Overlapping`` parameter to ``False`` when initiating the ``DinucUsage`` object.
.. doctest::
>>> from cogent.core.usage import DinucUsage
>>> du_no = DinucUsage(y_pseudo_seq, Overlapping=False)
>>> du_no.normalize()
>>> total = 0
>>> for k in du_no.RequiredKeys:
... print k, du_no[k]
... total += du_no[k]
UU 0.0733082706767
UC 0.0507518796992
UA 0.0375939849624
UG 0.105263157895
CU 0.0733082706767
CC 0.046992481203
CA 0.0394736842105
CG 0.0601503759398
AU 0.0751879699248
AC 0.046992481203
AA 0.062030075188
AG 0.0545112781955
GU 0.0601503759398
GC 0.0845864661654
GA 0.0676691729323
GG 0.062030075188
>>> print "Total:",total
Total: 1.0
'3-1' Dinucleotide Content
--------------------------
To calculate dinucleotide usage considering only adjacent first and third codon positions, we set the Overlapping parameter to '3-1' when constructing our ``DinucUsage`` object
.. doctest::
>>> from cogent.core.usage import DinucUsage
>>> du_3_1 = DinucUsage(y_pseudo_seq, Overlapping='3-1')
>>> du_3_1.normalize()
>>> total = 0
>>> for k in du_3_1.RequiredKeys:
... print k, du_3_1[k]
... total += du_3_1[k]
UU 0.0720221606648
UC 0.0664819944598
UA 0.0360110803324
UG 0.0914127423823
CU 0.0387811634349
CC 0.0415512465374
CA 0.0554016620499
CG 0.0554016620499
AU 0.0498614958449
AC 0.0470914127424
AA 0.0664819944598
AG 0.0747922437673
GU 0.0886426592798
GC 0.0886426592798
GA 0.0609418282548
GG 0.0664819944598
>>> print "Total:",total
Total: 1.0
Comparing dinucleotide usages
-----------------------------
Above, we noted that there are several ways to calculate dinucleotide usages on a single sequence, and that the choice of methods changes the reported frequencies somewhat. How could we quantify the effect this choice make on the result?
One way to test this is to calculate the Euclidean distance between the resulting frequencies. We can do this using the dinucleotide usage's
.. doctest::
>>> du_vs_du_3_1_dist = du.distance(du_3_1)
As required of a true distance, the results are independent of the direction of the calculation.
.. doctest::
>>> du_3_1_vs_du_dist = du_3_1.distance(du)
>>> print du_3_1_vs_du_dist == du_vs_du_3_1_dist
True
Caution regarding unnormalized distances
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that in this case we have already called ``du.normalize()`` on each ``DinucUsage`` object. You MUST call ``du.normalize()`` before calculating distances. Otherwise the distance calculated will be for the dinucleotide counts, rather than frequencies. Distances of counts can be non-zero even for sequences with identical dinucleotide usage, if those sequences are of different lengths.
k-words
-------
*To be written.*
Codon usage analyses
====================
In addition to allowing a more detailed examination of GC content in coding sequences, ``CodonUsage`` objects (as the name implies) let us examine the codon usage of our sequence.
.. doctest::
>>> from cogent.core.usage import CodonUsage
>>> y_pseudo_cu = CodonUsage(y_pseudo_seq)
>>> # Print raw counts
>>> for k in y_pseudo_cu.RequiredKeys:
... print k, y_pseudo_cu[k]
UUU 8.0
UUC 4.0
UUA 5.0
UUG 14.0
UCU 4.0
UCC 3.0
UCA 5.0
UCG 3.0
UAU 8.0...
Note that before normalization the ``CodonUsage`` object holds raw counts of results. However, for most purposes, we will want frequencies, so we normalize the counts.
.. doctest::
>>> y_pseudo_cu.normalize()
>>> # Print normalized frequencies
>>> for k in y_pseudo_cu.RequiredKeys:
... print k, y_pseudo_cu[k]
UUU 0.0225988700565
UUC 0.0112994350282
UUA 0.0141242937853
UUG 0.0395480225989
UCU 0.0112994350282
UCC 0.00847457627119
UCA 0.0141242937853
UCG 0.00847457627119
UAU 0.0225988700565...
Relative Synonymous Codon Usage
-------------------------------
The RSCU or relative synonymous codon usage metric divides the frequency of each codon by the total frequency of all codons encoding the same amino acid.
.. doctest::
>>> y_pseudo_cu.normalize()
>>> y_pseudo_rscu = y_pseudo_cu.rscu()
>>> # Print rscu frequencies
>>> for k in y_pseudo_rscu.keys():
... print k, y_pseudo_rscu[k]
ACC 0.263157894737
GUC 0.238095238095
ACA 0.210526315789
ACG 0.263157894737
AAC 0.4
CCU 0.315789473684
UGG 1.0
AUC 0.266666666667
GUA 0.190476190476...
PR2 bias
--------
*To be written*
Fingerprint analysis
--------------------
*To be written*
Amino Acid Usage
================
*To be written.*
Profiles
========
*To be written.*
Visualisation
=============
*To be written.*
|