1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024
|
.. _`chapter:seq_objects`:
Sequence objects
================
Biological sequences are arguably the central object in Bioinformatics,
and in this chapter we’ll introduce the Biopython mechanism for dealing
with sequences, the ``Seq`` object.
Chapter :ref:`chapter:seq_annot` will introduce the
related ``SeqRecord`` object, which combines the sequence information
with any annotation, used again in
Chapter :ref:`chapter:seqio` for Sequence Input/Output.
Sequences are essentially strings of letters like ``AGTACACTGGT``, which
seems very natural since this is the most common way that sequences are
seen in biological file formats.
The most important difference between ``Seq`` objects and standard
Python strings is they have different methods. Although the ``Seq``
object supports many of the same methods as a plain string, its
``translate()`` method differs by doing biological translation, and
there are also additional biologically relevant methods like
``reverse_complement()``.
Sequences act like strings
--------------------------
In most ways, we can deal with Seq objects as if they were normal Python
strings, for example getting the length, or iterating over the elements:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> my_seq = Seq("GATCG")
>>> for index, letter in enumerate(my_seq):
... print("%i %s" % (index, letter))
...
0 G
1 A
2 T
3 C
4 G
>>> print(len(my_seq))
5
You can access elements of the sequence in the same way as for strings
(but remember, Python counts from zero!):
.. cont-doctest
.. code:: pycon
>>> print(my_seq[0]) # first letter
G
>>> print(my_seq[2]) # third letter
T
>>> print(my_seq[-1]) # last letter
G
The ``Seq`` object has a ``.count()`` method, just like a string. Note
that this means that like a Python string, this gives a
*non-overlapping* count:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> "AAAA".count("AA")
2
>>> Seq("AAAA").count("AA")
2
For some biological uses, you may actually want an overlapping count
(i.e. :math:`3` in this trivial example). When searching for single
letters, this makes no difference:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * (my_seq.count("G") + my_seq.count("C")) / len(my_seq)
46.875
While you could use the above snippet of code to calculate a GC%, note
that the ``Bio.SeqUtils`` module has several GC functions already built.
For example:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> from Bio.SeqUtils import gc_fraction
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
>>> gc_fraction(my_seq)
0.46875
Note that using the ``Bio.SeqUtils.gc_fraction()`` function should
automatically cope with mixed case sequences and the ambiguous
nucleotide S which means G or C.
Also note that just like a normal Python string, the ``Seq`` object is
in some ways “read-only”. If you need to edit your sequence, for example
simulating a point mutation, look at the
Section :ref:`sec:mutable-seq` below which talks about the
``MutableSeq`` object.
Slicing a sequence
------------------
A more complicated example, let’s get a slice of the sequence:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
>>> my_seq[4:12]
Seq('GATGGGCC')
Note that ‘Seq‘ objects follow the usual indexing conventions for Python
strings, with the first element of the sequence numbered 0. When you do
a slice the first item is included (i.e. 4 in this case) and the last is
excluded (12 in this case).
Also like a Python string, you can do slices with a start, stop and
*stride* (the step size, which defaults to one). For example, we can get
the first, second and third codon positions of this DNA sequence:
.. cont-doctest
.. code:: pycon
>>> my_seq[0::3]
Seq('GCTGTAGTAAG')
>>> my_seq[1::3]
Seq('AGGCATGCATC')
>>> my_seq[2::3]
Seq('TAGCTAAGAC')
Another stride trick you might have seen with a Python string is the use
of a -1 stride to reverse the string. You can do this with a ``Seq``
object too:
.. cont-doctest
.. code:: pycon
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG')
.. _`sec:seq-to-string`:
Turning Seq objects into strings
--------------------------------
If you really do just need a plain string, for example to write to a
file, or insert into a database, then this is very easy to get:
.. cont-doctest
.. code:: pycon
>>> str(my_seq)
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Since calling ``str()`` on a ``Seq`` object returns the full sequence as
a string, you often don’t actually have to do this conversion
explicitly. Python does this automatically in the print function:
.. cont-doctest
.. code:: pycon
>>> print(my_seq)
GATCGATGGGCCTATATAGGATCGAAAATCGC
You can also use the ``Seq`` object directly with a ``%s`` placeholder
when using the Python string formatting or interpolation operator
(``%``):
.. cont-doctest
.. code:: pycon
>>> fasta_format_string = ">Name\n%s\n" % my_seq
>>> print(fasta_format_string)
>Name
GATCGATGGGCCTATATAGGATCGAAAATCGC
<BLANKLINE>
This line of code constructs a simple FASTA format record (without
worrying about line wrapping).
Section :ref:`sec:SeqRecord-format` describes a
neat way to get a FASTA formatted string from a ``SeqRecord`` object,
while the more general topic of reading and writing FASTA format
sequence files is covered in
Chapter :ref:`chapter:seqio`.
Concatenating or adding sequences
---------------------------------
Two ``Seq`` objects can be concatenated by adding them:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> seq1 = Seq("ACGT")
>>> seq2 = Seq("AACCGG")
>>> seq1 + seq2
Seq('ACGTAACCGG')
Biopython does not check the sequence contents and will not raise an
exception if for example you concatenate a protein sequence and a DNA
sequence (which is likely a mistake):
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> protein_seq = Seq("EVRNAK")
>>> dna_seq = Seq("ACGT")
>>> protein_seq + dna_seq
Seq('EVRNAKACGT')
You may often have many sequences to add together, which can be done
with a for loop like this:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> list_of_seqs = [Seq("ACGT"), Seq("AACC"), Seq("GGTT")]
>>> concatenated = Seq("")
>>> for s in list_of_seqs:
... concatenated += s
...
>>> concatenated
Seq('ACGTAACCGGTT')
Like Python strings, Biopython ``Seq`` also has a ``.join`` method:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> contigs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]
>>> spacer = Seq("N" * 10)
>>> spacer.join(contigs)
Seq('ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCA')
Changing case
-------------
Python strings have very useful ``upper`` and ``lower`` methods for
changing the case. For example,
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> dna_seq = Seq("acgtACGT")
>>> dna_seq
Seq('acgtACGT')
>>> dna_seq.upper()
Seq('ACGTACGT')
>>> dna_seq.lower()
Seq('acgtacgt')
These are useful for doing case insensitive matching:
.. cont-doctest
.. code:: pycon
>>> "GTAC" in dna_seq
False
>>> "GTAC" in dna_seq.upper()
True
.. _`sec:seq-reverse-complement`:
Nucleotide sequences and (reverse) complements
----------------------------------------------
For nucleotide sequences, you can easily obtain the complement or
reverse complement of a ``Seq`` object using its built-in methods:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC')
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG')
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC')
As mentioned earlier, an easy way to just reverse a ``Seq`` object (or a
Python string) is slice it with -1 step:
.. cont-doctest
.. code:: pycon
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG')
If you do accidentally end up trying to do something weird like taking
the (reverse) complement of a protein sequence, the results are
biologically meaningless:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> protein_seq = Seq("EVRNAK")
>>> protein_seq.complement()
Seq('EBYNTM')
Here the letter “E” is not a valid IUPAC ambiguity code for nucleotides,
so was not complemented. However, “V” means “A”, “C” or “G” and has
complement “B“, and so on.
The example in
Section :ref:`sec:SeqIO-reverse-complement`
combines the ``Seq`` object’s reverse complement method with
``Bio.SeqIO`` for sequence input/output.
Transcription
-------------
Before talking about transcription, I want to try to clarify the strand
issue. Consider the following (made up) stretch of double stranded DNA
which encodes a short peptide:
.. math::
\begin{gathered}
\text{DNA coding strand (aka Crick strand, strand } +1 \text{)} \\
\text{5'} \qquad \texttt{ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG} \qquad \text{3'} \\
\texttt{|||||||||||||||||||||||||||||||||||||||} \\
\text{3'} \qquad \texttt{TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC} \qquad \text{5'} \\
\text{DNA template strand (aka Watson strand, strand } -1 \text{)}
\end{gathered}
Transcription of this DNA sequence produces the following RNA sequence:
.. math::
\begin{gathered}
\text{5'} \qquad \texttt{AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG} \qquad \text{3'} \\
\text{Single-stranded messenger RNA}
\end{gathered}
The actual biological transcription process works from the template
strand, doing a reverse complement (TCAG :math:`\rightarrow` CUGA) to
give the mRNA. However, in Biopython and bioinformatics in general, we
typically work directly with the coding strand because this means we can
get the mRNA sequence just by switching T :math:`\rightarrow` U.
Now let’s actually get down to doing a transcription in Biopython.
First, let’s create ``Seq`` objects for the coding and template DNA
strands:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> template_dna = coding_dna.reverse_complement()
>>> template_dna
Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT')
These should match the figure above - remember by convention nucleotide
sequences are normally read from the 5’ to 3’ direction, while in the
figure the template strand is shown reversed.
Now let’s transcribe the coding strand into the corresponding mRNA,
using the ``Seq`` object’s built-in ``transcribe`` method:
.. cont-doctest
.. code:: pycon
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> messenger_rna = coding_dna.transcribe()
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
As you can see, all this does is to replace T by U.
If you do want to do a true biological transcription starting with the
template strand, then this becomes a two-step process:
.. cont-doctest
.. code:: pycon
>>> template_dna.reverse_complement().transcribe()
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
The ``Seq`` object also includes a back-transcription method for going
from the mRNA to the coding strand of the DNA. Again, this is a simple U
:math:`\rightarrow` T substitution:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
*Note:* The ``Seq`` object’s ``transcribe`` and ``back_transcribe``
methods were added in Biopython 1.49. For older releases you would have
to use the ``Bio.Seq`` module’s functions instead, see
Section :ref:`sec:seq-module-functions`.
.. _`sec:translation`:
Translation
-----------
Sticking with the same example discussed in the transcription section
above, now let’s translate this mRNA into the corresponding protein
sequence - again taking advantage of one of the ``Seq`` object’s
biological methods:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> messenger_rna.translate()
Seq('MAIVMGR*KGAR*')
You can also translate directly from the coding strand DNA sequence:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*')
You should notice in the above protein sequences that in addition to the
end stop character, there is an internal stop as well. This was a
deliberate choice of example, as it gives an excuse to talk about some
optional arguments, including different translation tables (Genetic
Codes).
The translation tables available in Biopython are based on those `from
the NCBI <https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi>`__
(see the next section of this tutorial). By default, translation will
use the *standard* genetic code (NCBI table id 1). Suppose we are
dealing with a mitochondrial sequence. We need to tell the translation
function to use the relevant genetic code instead:
.. cont-doctest
.. code:: pycon
>>> coding_dna.translate(table="Vertebrate Mitochondrial")
Seq('MAIVMGRWKGAR*')
You can also specify the table using the NCBI table number which is
shorter, and often included in the feature annotation of GenBank files:
.. cont-doctest
.. code:: pycon
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*')
Now, you may want to translate the nucleotides up to the first in frame
stop codon, and then stop (as happens in nature):
.. cont-doctest
.. code:: pycon
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*')
>>> coding_dna.translate(to_stop=True)
Seq('MAIVMGR')
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*')
>>> coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR')
Notice that when you use the ``to_stop`` argument, the stop codon itself
is not translated - and the stop symbol is not included at the end of
your protein sequence.
You can even specify the stop symbol if you don’t like the default
asterisk:
.. cont-doctest
.. code:: pycon
>>> coding_dna.translate(table=2, stop_symbol="@")
Seq('MAIVMGRWKGAR@')
Now, suppose you have a complete coding sequence CDS, which is to say a
nucleotide sequence (e.g. mRNA – after any splicing) which is a whole
number of codons (i.e. the length is a multiple of three), commences
with a start codon, ends with a stop codon, and has no internal in-frame
stop codons. In general, given a complete CDS, the default translate
method will do what you want (perhaps with the ``to_stop`` option).
However, what if your sequence uses a non-standard start codon? This
happens a lot in bacteria – for example the gene yaaX in *E. coli*
K12:
.. code:: pycon
>>> from Bio.Seq import Seq
>>> gene = Seq(
... "GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"
... "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT"
... "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"
... "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT"
... "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA"
... )
>>> gene.translate(table="Bacterial")
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*',
ProteinAlpabet())
>>> gene.translate(table="Bacterial", to_stop=True)
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR')
In the bacterial genetic code ``GTG`` is a valid start codon, and while
it does *normally* encode Valine, if used as a start codon it should be
translated as methionine. This happens if you tell Biopython your
sequence is a complete CDS:
.. code:: pycon
>>> gene.translate(table="Bacterial", cds=True)
Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR')
In addition to telling Biopython to translate an alternative start codon
as methionine, using this option also makes sure your sequence really is
a valid CDS (you’ll get an exception if not).
The example in Section :ref:`sec:SeqIO-translate`
combines the ``Seq`` object’s translate method with ``Bio.SeqIO`` for
sequence input/output.
Translation Tables
------------------
In the previous sections we talked about the ``Seq`` object translation
method (and mentioned the equivalent function in the ``Bio.Seq`` module
– see Section :ref:`sec:seq-module-functions`). Internally these
use codon table objects derived from the NCBI information at
ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt, also shown on
https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi in a much more
readable layout.
As before, let’s just focus on two choices: the Standard translation
table, and the translation table for Vertebrate Mitochondrial DNA.
.. doctest
.. code:: pycon
>>> from Bio.Data import CodonTable
>>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
>>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
Alternatively, these tables are labeled with ID numbers 1 and 2,
respectively:
.. cont-doctest
.. code:: pycon
>>> from Bio.Data import CodonTable
>>> standard_table = CodonTable.unambiguous_dna_by_id[1]
>>> mito_table = CodonTable.unambiguous_dna_by_id[2]
You can compare the actual tables visually by printing them:
.. code:: pycon
>>> print(standard_table)
Table 1 Standard, SGC0
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L(s)| CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I | ACT T | AAT N | AGT S | T
A | ATC I | ACC T | AAC N | AGC S | C
A | ATA I | ACA T | AAA K | AGA R | A
A | ATG M(s)| ACG T | AAG K | AGG R | G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V | GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
and:
.. code:: pycon
>>> print(mito_table)
Table 2 Vertebrate Mitochondrial, SGC1
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA W | A
T | TTG L | TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L | CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T | AAT N | AGT S | T
A | ATC I(s)| ACC T | AAC N | AGC S | C
A | ATA M(s)| ACA T | AAA K | AGA Stop| A
A | ATG M(s)| ACG T | AAG K | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V(s)| GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
You may find these following properties useful – for example if you are
trying to do your own gene finding:
.. cont-doctest
.. code:: pycon
>>> mito_table.stop_codons
['TAA', 'TAG', 'AGA', 'AGG']
>>> mito_table.start_codons
['ATT', 'ATC', 'ATA', 'ATG', 'GTG']
>>> mito_table.forward_table["ACG"]
'T'
.. _`sec:seq-comparison`:
Comparing Seq objects
---------------------
Sequence comparison is actually a very complicated topic, and there is
no easy way to decide if two sequences are equal. The basic problem is
the meaning of the letters in a sequence are context dependent - the
letter “A” could be part of a DNA, RNA or protein sequence. Biopython
can track the molecule type, so comparing two ``Seq`` objects could mean
considering this too.
Should a DNA fragment “ACG” and an RNA fragment “ACG” be equal? What
about the peptide “ACG”? Or the Python string “ACG”? In everyday use,
your sequences will generally all be the same type of (all DNA, all RNA,
or all protein). Well, as of Biopython 1.65, sequence comparison only
looks at the sequence and compares like the Python string.
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> seq1 = Seq("ACGT")
>>> "ACGT" == seq1
True
>>> seq1 == "ACGT"
True
As an extension to this, using sequence objects as keys in a Python
dictionary is equivalent to using the sequence as a plain string for the
key. See also Section :ref:`sec:seq-to-string`.
Sequences with unknown sequence contents
----------------------------------------
In some cases, the length of a sequence may be known but not the actual
letters constituting it. For example, GenBank and EMBL files may
represent a genomic DNA sequence only by its config information, without
specifying the sequence contents explicitly. Such sequences can be
represented by creating a ``Seq`` object with the argument ``None``,
followed by the sequence length:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> unknown_seq = Seq(None, 10)
The ``Seq`` object thus created has a well-defined length. Any attempt
to access the sequence contents, however, will raise an
``UndefinedSequenceError``:
.. cont-doctest
.. code:: pycon
>>> unknown_seq
Seq(None, length=10)
>>> len(unknown_seq)
10
>>> print(unknown_seq)
Traceback (most recent call last):
...
Bio.Seq.UndefinedSequenceError: Sequence content is undefined
>>>
.. _`sec:partial-seq`:
Sequences with partially defined sequence contents
--------------------------------------------------
Sometimes the sequence contents is defined for parts of the sequence
only, and undefined elsewhere. For example, the following excerpt of a
MAF (Multiple Alignment Format) file shows an alignment of human, chimp,
macaque, mouse, rat, dog, and opossum genome sequences:
.. code:: text
s hg38.chr7 117512683 36 + 159345973 TTGAAAACCTGAATGTGAGAGTCAGTCAAGGATAGT
s panTro4.chr7 119000876 36 + 161824586 TTGAAAACCTGAATGTGAGAGTCACTCAAGGATAGT
s rheMac3.chr3 156330991 36 + 198365852 CTGAAATCCTGAATGTGAGAGTCAATCAAGGATGGT
s mm10.chr6 18207101 36 + 149736546 CTGAAAACCTAAGTAGGAGAATCAACTAAGGATAAT
s rn5.chr4 42326848 36 + 248343840 CTGAAAACCTAAGTAGGAGAGACAGTTAAAGATAAT
s canFam3.chr14 56325207 36 + 60966679 TTGAAAAACTGATTATTAGAGTCAATTAAGGATAGT
s monDom5.chr8 173163865 36 + 312544902 TTAAGAAACTGGAAATGAGGGTTGAATGACAAACTT
In each row, the first number indicates the starting position (in
zero-based coordinates) of the aligned sequence on the chromosome,
followed by the size of the aligned sequence, the strand, the size of
the full chromosome, and the aligned sequence.
A ``Seq`` object representing such a partially defined sequence can be
created using a dictionary for the ``data`` argument, where the keys are
the starting coordinates of the known sequence segments, and the values
are the corresponding sequence contents. For example, for the first
sequence we would use
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> seq = Seq({117512683: "TTGAAAACCTGAATGTGAGAGTCAGTCAAGGATAGT"}, length=159345973)
Extracting a subsequence from a partially define sequence may return a
fully defined sequence, an undefined sequence, or a partially defined
sequence, depending on the coordinates:
.. cont-doctest
.. code:: pycon
>>> seq[1000:1020]
Seq(None, length=20)
>>> seq[117512690:117512700]
Seq('CCTGAATGTG')
>>> seq[117512670:117512690]
Seq({13: 'TTGAAAA'}, length=20)
>>> seq[117512700:]
Seq({0: 'AGAGTCAGTCAAGGATAGT'}, length=41833273)
Partially defined sequences can also be created by appending sequences,
if at least one of the sequences is partially or fully undefined:
.. cont-doctest
.. code:: pycon
>>> seq = Seq("ACGT")
>>> undefined_seq = Seq(None, length=10)
>>> seq + undefined_seq + seq
Seq({0: 'ACGT', 14: 'ACGT'}, length=18)
.. _`sec:mutable-seq`:
MutableSeq objects
------------------
Just like the normal Python string, the ``Seq`` object is “read only”,
or in Python terminology, immutable. Apart from wanting the ``Seq``
object to act like a string, this is also a useful default since in many
biological applications you want to ensure you are not changing your
sequence data:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
Observe what happens if you try to edit the sequence:
.. cont-doctest
.. code:: pycon
>>> my_seq[5] = "G"
Traceback (most recent call last):
...
TypeError: 'Seq' object does not support item assignment
However, you can convert it into a mutable sequence (a ``MutableSeq``
object) and do pretty much anything you want with it:
.. cont-doctest
.. code:: pycon
>>> from Bio.Seq import MutableSeq
>>> mutable_seq = MutableSeq(my_seq)
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA')
Alternatively, you can create a ``MutableSeq`` object directly from a
string:
.. doctest
.. code:: pycon
>>> from Bio.Seq import MutableSeq
>>> mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
Either way will give you a sequence object which can be changed:
.. cont-doctest
.. code:: pycon
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA')
>>> mutable_seq[5] = "C"
>>> mutable_seq
MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA')
>>> mutable_seq.remove("T")
>>> mutable_seq
MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA')
>>> mutable_seq.reverse()
>>> mutable_seq
MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG')
Note that the ``MutableSeq`` object’s ``reverse()`` method, like the
``reverse()`` method of a Python list, reverses the sequence in place.
An important technical difference between mutable and immutable objects
in Python means that you can’t use a ``MutableSeq`` object as a
dictionary key, but you can use a Python string or a ``Seq`` object in
this way.
Once you have finished editing your a ``MutableSeq`` object, it’s easy
to get back to a read-only ``Seq`` object should you need to:
.. cont-doctest
.. code:: pycon
>>> from Bio.Seq import Seq
>>> new_seq = Seq(mutable_seq)
>>> new_seq
Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG')
You can also get a string from a ``MutableSeq`` object just like from a
``Seq`` object (Section :ref:`sec:seq-to-string`).
Finding subsequences
--------------------
Sequence objects have ``find``, ``rfind``, ``index``, and ``rindex``
methods that perform the same function as the corresponding methods
on plain string objects. The only difference is that the subsequence
can be a string (``str``), ``bytes``, ``bytearray``, ``Seq``, or
``MutableSeq`` object:
.. doctest
.. code:: pycon
>>> from Bio.Seq import Seq, MutableSeq
>>> seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
>>> seq.index("ATGGGCCGC")
9
>>> seq.index(b"ATGGGCCGC")
9
>>> seq.index(bytearray(b"ATGGGCCGC"))
9
>>> seq.index(Seq("ATGGGCCGC"))
9
>>> seq.index(MutableSeq("ATGGGCCGC"))
9
A ``ValueError`` is raised if the subsequence is not found:
.. cont-doctest
.. code:: pycon
>>> seq.index("ACTG") # doctest:+ELLIPSIS
Traceback (most recent call last):
...
ValueError: ...
while the ``find`` method returns -1 if the subsequence is not found:
.. cont-doctest
.. code:: pycon
>>> seq.find("ACTG")
-1
The methods ``rfind`` and ``rindex`` search for the subsequence starting
from the right hand side of the sequence:
.. cont-doctest
.. code:: pycon
>>> seq.find("CC")
1
>>> seq.rfind("CC")
29
Use the ``search`` method to search for multiple subsequences at the same
time. This method returns an iterator:
.. cont-doctest
.. code:: pycon
>>> for index, sub in seq.search(["CC", "GGG", "CC"]):
... print(index, sub)
...
1 CC
11 GGG
14 CC
23 GGG
28 CC
29 CC
The ``search`` method also takes plain strings, ``bytes``, ``bytearray``,
``Seq``, and ``MutableSeq`` objects as subsequences; identical subsequences
are reported only once, as in the example above.
.. _`sec:seq-module-functions`:
Working with strings directly
-----------------------------
To close this chapter, for those you who *really* don’t want to use the
sequence objects (or who prefer a functional programming style to an
object orientated one), there are module level functions in ``Bio.Seq``
will accept plain Python strings, ``Seq`` objects or ``MutableSeq``
objects:
.. doctest
.. code:: pycon
>>> from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate
>>> my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"
>>> reverse_complement(my_string)
'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'
>>> transcribe(my_string)
'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'
>>> back_transcribe(my_string)
'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'
>>> translate(my_string)
'AVMGRWKGGRAAG*'
You are, however, encouraged to work with ``Seq`` objects by default.
|