1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223
|
\chapter{Going 3D: The PDB module}
Bio.PDB is a Biopython module that focuses on working with crystal structures of biological macromolecules. Among other things, Bio.PDB includes a PDBParser class that produces a Structure object, which can be used to access the atomic data in the file in a convenient manner. There is limited support for parsing the information contained in the PDB header.
%Note the \verb|Bio.PDB| module requires Numerical Python (numpy) to be installed.
\section{Reading and writing crystal structure files}
\subsection{Reading a PDB file}
First we create a \texttt{PDBParser} object:
\begin{verbatim}
>>> from Bio.PDB.PDBParser import PDBParser
>>> p = PDBParser(PERMISSIVE=1)
\end{verbatim}
The {\tt PERMISSIVE} flag indicates that a number of common problems (see \ref{problem structures}) associated with PDB files will be ignored (but note that some atoms and/or residues will be missing). If the flag is not present a {\tt PDBConstructionException} will be generated if any problems are detected during the parse operation.
The Structure object is then produced by letting the \texttt{PDBParser} object parse a PDB file (the PDB file in this case is called 'pdb1fat.ent', '1fat' is a user defined name for the structure):
\begin{verbatim}
>>> structure_id = "1fat"
>>> filename = "pdb1fat.ent"
>>> s = p.get_structure(structure_id, filename)
\end{verbatim}
You can extract the header and trailer (simple lists of strings) of the PDB
file from the PDBParser object with the {\tt get\_header} and {\tt get\_trailer}
methods. Note however that many PDB files contain headers with
incomplete or erroneous information. Many of the errors have been
fixed in the equivalent mmCIF files. \emph{Hence, if you are interested
in the header information, it is a good idea to extract information
from mmCIF files using the} \texttt{\emph{MMCIF2Dict}} \emph{tool
described below, instead of parsing the PDB header. }
Now that is clarified, let's return to parsing the PDB header. The
structure object has an attribute called \texttt{header} which is
a Python dictionary that maps header records to their values.
Example:
\begin{verbatim}
>>> resolution = structure.header['resolution']
>>> keywords = structure.header['keywords']
\end{verbatim}
The available keys are \verb+name+, \verb+head+, \verb+deposition_date+, \verb+release_date+, \verb+structure_method+, \verb+resolution+, \verb+structure_reference+ (which maps to a list of references), \verb+journal_reference+, \verb+author+, and \verb+compound+ (which maps to a dictionary with various information about the crystallized compound).
The dictionary can also be created without creating a \texttt{Structure}
object, ie. directly from the PDB file:
\begin{verbatim}
>>> file = open(filename, 'r')
>>> header_dict = parse_pdb_header(file)
>>> file.close()
\end{verbatim}
\subsection{Reading an mmCIF file}
Similarly to the case the case of PDB files, first create an \texttt{MMCIFParser} object:
\begin{verbatim}
>>> from Bio.PDB.MMCIFParser import MMCIFParser
>>> parser = MMCIFParser()
\end{verbatim}
Then use this parser to create a structure object from the mmCIF file:
\begin{verbatim}
>>> structure = parser.get_structure('1fat', '1fat.cif')
\end{verbatim}
To have some more low level access to an mmCIF file, you can use the \verb+MMCIF2Dict+ class to create a Python dictionary that maps all mmCIF
tags in an mmCIF file to their values. If there are multiple values
(like in the case of tag \verb+_atom_site.Cartn_y+, which holds
the $y$ coordinates of all atoms), the tag is mapped to a list of values.
The dictionary is created from the mmCIF file as follows:
\begin{verbatim}
>>> from Bio.PDB.MMCIF2Dict import MMCIF2Dict
>>> mmcif_dict = MMCIF2Dict('1FAT.cif')
\end{verbatim}
Example: get the solvent content from an mmCIF file:
\begin{verbatim}
>>> sc = mmcif_dict['_exptl_crystal.density_percent_sol']
\end{verbatim}
Example: get the list of the $y$ coordinates of all atoms
\begin{verbatim}
>>> y_list = mmcif_dict['_atom_site.Cartn_y']
\end{verbatim}
\subsection{Reading files in the MMTF format}
You can use the direct MMTFParser to read a structure from a file:
%want to use doctest ../Tests lib:mmtf
%but will trigger PDBConstructionWarning
\begin{verbatim}
>>> from Bio.PDB.mmtf import MMTFParser
>>> structure = MMTFParser.get_structure("PDB/4CUP.mmtf")
\end{verbatim}
Or you can use the same class to get a structure by it's PDB ID:
%want online doctest here
\begin{verbatim}
>>> structure = MMTFParser.get_structure_from_url("4CUP")
\end{verbatim}
This gives you a Structure object as if read from a PDB or mmCIF file.
You can also have access to the underlying data using the external
MMTF library which Biopython is using internally:
%want online doctest here
\begin{verbatim}
>>> from mmtf import fetch
>>> decoded_data = fetch("4CUP")
\end{verbatim}
For example you can access just the X-coordinate.
\begin{verbatim}
>>> print(decoded_data.x_coord_list)
...
\end{verbatim}
\subsection{Reading files in the PDB XML format}
That's not yet supported, but we are definitely planning to support that
in the future (it's not a lot of work). Contact the Biopython developers
(\mailto{biopython-dev@biopython.org}) if you need this).
\subsection{Writing PDB files}
Use the PDBIO class for this. It's easy to write out specific parts
of a structure too, of course.
Example: saving a structure
\begin{verbatim}
>>> io = PDBIO()
>>> io.set_structure(s)
>>> io.save('out.pdb')
\end{verbatim}
If you want to write out a part of the structure, make use of the
\texttt{Select} class (also in \texttt{PDBIO}). Select has four methods:
\begin{itemize}
\item \verb+accept_model(model)+
\item \verb+accept_chain(chain)+
\item \verb+accept_residue(residue)+
\item \verb+accept_atom(atom)+
\end{itemize}
By default, every method returns 1 (which means the model/\-chain/\-residue/\-atom
is included in the output). By subclassing \texttt{Select} and returning
0 when appropriate you can exclude models, chains, etc. from the output.
Cumbersome maybe, but very powerful. The following code only writes
out glycine residues:
\begin{verbatim}
>>> class GlySelect(Select):
... def accept_residue(self, residue):
... if residue.get_name()=='GLY':
... return True
... else:
... return False
...
>>> io = PDBIO()
>>> io.set_structure(s)
>>> io.save('gly_only.pdb', GlySelect())
\end{verbatim}
If this is all too complicated for you, the \texttt{Dice} module contains
a handy \texttt{extract} function that writes out all residues in
a chain between a start and end residue.
\section{Structure representation}
The overall layout of a \texttt{Structure} object follows the so-called SMCRA
(Structure/Model/Chain/Residue/Atom) architecture:
\begin{itemize}
\item A structure consists of models
\item A model consists of chains
\item A chain consists of residues
\item A residue consists of atoms
\end{itemize}
This is the way many structural biologists/bioinformaticians think
about structure, and provides a simple but efficient way to deal with
structure. Additional stuff is essentially added when needed. A UML
diagram of the \texttt{Structure} object (forget about the \texttt{Disordered}
classes for now) is shown in Fig. \ref{fig:smcra}. Such a data structure is not
necessarily best suited for the representation of the macromolecular content of
a structure, but it is absolutely necessary for a good interpretation of the
data present in a file that describes the structure (typically a PDB or MMCIF
file). If this hierarchy cannot represent the contents of a structure file, it
is fairly certain that the file contains an error or at least does not describe
the structure unambiguously. If a SMCRA data structure cannot be generated,
there is reason to suspect a problem. Parsing a PDB file can thus be used to
detect likely problems. We will give several examples of this in section
\ref{problem structures}.
\begin{figure}[htbp]
\begin{htmlonly}
\imgsrc[width=650, height=750]{images/smcra.png}
\end{htmlonly}
\begin{latexonly}
\centering
\includegraphics[width=0.8\textwidth]{images/smcra.png}
\end{latexonly}
\caption{UML diagram of SMCRA architecture of the \texttt{Structure} class used to represent a macromolecular structure.
Full lines with diamonds denote aggregation, full lines with
arrows denote referencing, full lines with triangles denote inheritance
and dashed lines with triangles denote interface realization.}
\label{fig:smcra}
\end{figure}
Structure, Model, Chain and Residue are all subclasses of the Entity base class.
The Atom class only (partly) implements the Entity interface (because an Atom
does not have children).
For each Entity subclass, you can extract a child by using a unique id for that
child as a key (e.g. you can extract an Atom object from a Residue object by
using an atom name string as a key, you can extract a Chain object from a Model
object by using its chain identifier as a key).
Disordered atoms and residues are represented by DisorderedAtom and DisorderedResidue
classes, which are both subclasses of the DisorderedEntityWrapper base class.
They hide the complexity associated with disorder and behave exactly as Atom
and Residue objects.
In general, a child Entity object (i.e. Atom, Residue, Chain, Model) can be
extracted from its parent (i.e. Residue, Chain, Model, Structure, respectively)
by using an id as a key.
\begin{verbatim}
>>> child_entity = parent_entity[child_id]
\end{verbatim}
You can also get a list of all child Entities of a parent Entity object. Note
that this list is sorted in a specific way (e.g. according to chain identifier
for Chain objects in a Model object).
\begin{verbatim}
>>> child_list = parent_entity.get_list()
\end{verbatim}
You can also get the parent from a child:
\begin{verbatim}
>>> parent_entity = child_entity.get_parent()
\end{verbatim}
At all levels of the SMCRA hierarchy, you can also extract a \emph{full id}.
The full id is a tuple containing all id's starting from the top object (Structure)
down to the current object. A full id for a Residue object e.g. is something
like:
\begin{verbatim}
>>> full_id = residue.get_full_id()
>>> print(full_id)
("1abc", 0, "A", ("", 10, "A"))
\end{verbatim}
This corresponds to:
\begin{itemize}
\item The Structure with id \char`\"{}1abc\char`\"{}
\item The Model with id 0
\item The Chain with id \char`\"{}A\char`\"{}
\item The Residue with id (\char`\"{} \char`\"{}, 10, \char`\"{}A\char`\"{}).
\end{itemize}
The Residue id indicates that the residue is not a hetero-residue (nor a water)
because it has a blank hetero field, that its sequence identifier is 10 and
that its insertion code is \char`\"{}A\char`\"{}.
To get the entity's id, use the \verb+get_id+ method:
\begin{verbatim}
>>> entity.get_id()
\end{verbatim}
You can check if the entity has a child with a given id by using the \verb+has_id+ method:
\begin{verbatim}
>>> entity.has_id(entity_id)
\end{verbatim}
The length of an entity is equal to its number of children:
\begin{verbatim}
>>> nr_children = len(entity)
\end{verbatim}
It is possible to delete, rename, add, etc. child entities from a parent entity,
but this does not include any sanity checks (e.g. it is possible to add two
residues with the same id to one chain). This really should be done via a nice
Decorator class that includes integrity checking, but you can take a look at
the code (Entity.py) if you want to use the raw interface.
\subsection{Structure}
The Structure object is at the top of the hierarchy. Its id is a user given
string. The Structure contains a number of Model children. Most crystal structures
(but not all) contain a single model, while NMR structures typically consist
of several models. Disorder in crystal structures of large parts of molecules
can also result in several models.
\subsection{Model}
The id of the Model object is an integer, which is derived from the position
of the model in the parsed file (they are automatically numbered starting from
0).
Crystal structures generally have only one model (with id 0), while NMR files usually have several models. Whereas many PDB parsers assume that there is only one model, the \verb+Structure+ class in \verb+Bio.PDB+ is designed such that it can easily handle PDB files with more than one model.
As an example, to get the first model from a Structure object, use
\begin{verbatim}
>>> first_model = structure[0]
\end{verbatim}
The Model object stores a list of Chain children.
\subsection{Chain}
The id of a Chain object is derived from the chain identifier in the PDB/mmCIF
file, and is a single character (typically a letter). Each Chain in a Model object has a unique id. As an example, to get the Chain object with identifier ``A'' from a Model object, use
\begin{verbatim}
>>> chain_A = model["A"]
\end{verbatim}
The Chain object stores a list of Residue children.
\subsection{Residue}
A residue id is a tuple with three elements:
\begin{itemize}
\item The \textbf{hetero-field} (hetfield): this is
\begin{itemize}
\item \verb+'W'+ in the case of a water molecule;
\item \verb+'H_'+ followed by the residue name for other hetero residues (e.g. \verb+'H_GLC'+ in the case of a glucose molecule);
\item blank for standard amino and nucleic acids.
\end{itemize}
This scheme is adopted for reasons described in section \ref{hetero problems}.
\item The \textbf{sequence identifier} (resseq), an integer describing the position of the residue in the chain (e.g., 100);
\item The \textbf{insertion code} (icode); a string, e.g. 'A'. The insertion code is sometimes used to preserve a certain desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81
residue) could e.g. have sequence identifiers and insertion codes
as follows: Thr 80 A, Ser 80 B, Asn 81. In this way the residue numbering
scheme stays in tune with that of the wild type structure.
\end{itemize}
The id of the above glucose residue would thus be \texttt{('H\_GLC',
100, 'A')}. If the hetero-flag and insertion code are blank, the sequence
identifier alone can be used:
\begin{verbatim}
# Full id
>>> residue=chain[(' ', 100, ' ')]
# Shortcut id
>>> residue=chain[100]
\end{verbatim}
The reason for the hetero-flag is that many, many PDB files use the
same sequence identifier for an amino acid and a hetero-residue or
a water, which would create obvious problems if the hetero-flag was
not used.
Unsurprisingly, a Residue object stores a set of Atom children. It also contains a string that specifies the residue name (e.g. ``ASN'')
and the segment identifier of the residue (well known to X-PLOR users, but not
used in the construction of the SMCRA data structure).
Let's look at some examples. Asn 10 with a blank insertion code would have residue
id {\tt (' ', 10, ' ')}. Water 10 would have residue id {\tt ('W', 10, ' ')}.
A glucose molecule (a hetero residue with residue name GLC) with sequence identifier
10 would have residue id {\tt ('H\_GLC', 10, ' ')}. In this way, the three
residues (with the same insertion code and sequence identifier) can be part
of the same chain because their residue id's are distinct.
In most cases, the hetflag and insertion code fields will be blank, e.g. {\tt (' ', 10, ' ')}.
In these cases, the sequence identifier can be used as a shortcut for the full
id:
\begin{verbatim}
# use full id
>>> res10 = chain[(' ', 10, ' ')]
# use shortcut
>>> res10 = chain[10]
\end{verbatim}
Each Residue object in a Chain object should have a unique id. However, disordered
residues are dealt with in a special way, as described in section \ref{point mutations}.
A Residue object has a number of additional methods:
\begin{verbatim}
>>> residue.get_resname() # returns the residue name, e.g. "ASN"
>>> residue.is_disordered() # returns 1 if the residue has disordered atoms
>>> residue.get_segid() # returns the SEGID, e.g. "CHN1"
>>> residue.has_id(name) # test if a residue has a certain atom
\end{verbatim}
You can use \texttt{is\_aa(residue)} to test if a Residue object is an amino acid.
\subsection{Atom}
The Atom object stores the data associated with an atom, and has no children.
The id of an atom is its atom name (e.g. ``OG'' for the side chain oxygen
of a Ser residue). An Atom id needs to be unique in a Residue. Again, an exception is made for disordered atoms, as described in section \ref{disordered atoms}.
The atom id is simply the atom name (eg. \texttt{'CA'}). In practice,
the atom name is created by stripping all spaces from the atom name
in the PDB file.
However, in PDB files, a space can be part of an atom name. Often,
calcium atoms are called \texttt{'CA..'} in order to distinguish them
from C$\alpha$ atoms (which are called \texttt{'.CA.'}). In cases
were stripping the spaces would create problems (ie. two atoms called
\texttt{'CA'} in the same residue) the spaces are kept.
In a PDB file, an atom name consists of 4 chars, typically with leading and
trailing spaces. Often these spaces can be removed for ease of use (e.g. an
amino acid C\( \alpha \) atom is labeled ``.CA.'' in a PDB file, where
the dots represent spaces). To generate an atom name (and thus an atom id) the
spaces are removed, unless this would result in a name collision in a Residue
(i.e. two Atom objects with the same atom name and id). In the latter case,
the atom name including spaces is tried. This situation can e.g. happen when
one residue contains atoms with names ``.CA.'' and ``CA..'', although
this is not very likely.
The atomic data stored includes the atom name, the atomic coordinates (including
standard deviation if present), the B factor (including anisotropic B factors
and standard deviation if present), the altloc specifier and the full atom name
including spaces. Less used items like the atom element number or the atomic
charge sometimes specified in a PDB file are not stored.
To manipulate the atomic coordinates, use the \texttt{transform} method of
the \texttt{Atom} object. Use the \texttt{set\_coord} method to specify the
atomic coordinates directly.
An Atom object has the following additional methods:
\begin{verbatim}
>>> a.get_name() # atom name (spaces stripped, e.g. "CA")
>>> a.get_id() # id (equals atom name)
>>> a.get_coord() # atomic coordinates
>>> a.get_vector() # atomic coordinates as Vector object
>>> a.get_bfactor() # isotropic B factor
>>> a.get_occupancy() # occupancy
>>> a.get_altloc() # alternative location specifier
>>> a.get_sigatm() # standard deviation of atomic parameters
>>> a.get_siguij() # standard deviation of anisotropic B factor
>>> a.get_anisou() # anisotropic B factor
>>> a.get_fullname() # atom name (with spaces, e.g. ".CA.")
\end{verbatim}
To represent the atom coordinates, siguij, anisotropic B factor and sigatm Numpy
arrays are used.
The \texttt{get\_vector} method returns a \texttt{Vector} object representation of the coordinates of the \texttt{Atom} object, allowing you to do vector operations on atomic coordinates. \texttt{Vector} implements the full set of 3D vector operations, matrix multiplication (left and right) and some advanced rotation-related operations as well.
As an example of the capabilities of Bio.PDB's \texttt{Vector} module,
suppose that you would like to find the position of a Gly residue's C$\beta$
atom, if it had one. Rotating the N atom of
the Gly residue along the C$\alpha$-C bond over -120 degrees roughly
puts it in the position of a virtual C$\beta$ atom. Here's how to
do it, making use of the \texttt{rotaxis} method (which can be used
to construct a rotation around a certain axis) of the \texttt{Vector}
module:
\begin{verbatim}
# get atom coordinates as vectors
>>> n = residue['N'].get_vector()
>>> c = residue['C'].get_vector()
>>> ca = residue['CA'].get_vector()
# center at origin
>>> n = n - ca
>>> c = c - ca
# find rotation matrix that rotates n
# -120 degrees along the ca-c vector
>>> rot = rotaxis(-pi * 120.0/180.0, c)
# apply rotation to ca-n vector
>>> cb_at_origin = n.left_multiply(rot)
# put on top of ca atom
>>> cb = cb_at_origin+ca
\end{verbatim}
This example shows that it's possible to do some quite nontrivial
vector operations on atomic data, which can be quite useful. In addition
to all the usual vector operations (cross (use \texttt{{*}{*}}), and
dot (use \texttt{{*}}) product, angle, norm, etc.) and the above mentioned
\texttt{rotaxis} function, the \texttt{Vector} module also has methods
to rotate (\texttt{rotmat}) or reflect (\texttt{refmat}) one vector
on top of another.
\subsection{Extracting a specific \texttt{Atom/\-Residue/\-Chain/\-Model}
from a Structure}
These are some examples:
\begin{verbatim}
>>> model = structure[0]
>>> chain = model['A']
>>> residue = chain[100]
>>> atom = residue['CA']
\end{verbatim}
Note that you can use a shortcut:
\begin{verbatim}
>>> atom = structure[0]['A'][100]['CA']
\end{verbatim}
\section{Disorder}
Bio.PDB can handle both disordered atoms and point mutations (i.e. a
Gly and an Ala residue in the same position).
\subsection{General approach\label{disorder problems}}
Disorder should be dealt with from two points of view: the atom and the residue
points of view. In general, we have tried to encapsulate all the complexity that
arises from disorder. If you just want to loop over all C$\alpha$ atoms,
you do not care that some residues have a disordered side chain. On the other
hand it should also be possible to represent disorder completely in the data
structure. Therefore, disordered atoms or residues are stored in special objects
that behave as if there is no disorder. This is done by only representing a
subset of the disordered atoms or residues. Which subset is picked (e.g. which
of the two disordered OG side chain atom positions of a Ser residue is used)
can be specified by the user.
\subsection{Disordered atoms\label{disordered atoms}}
Disordered atoms are represented by ordinary \texttt{Atom} objects, but
all \texttt{Atom} objects that represent the same physical atom are stored
in a \texttt{Disordered\-Atom} object (see Fig. \ref{fig:smcra}).
Each \texttt{Atom} object in a \texttt{Disordered\-Atom} object can
be uniquely indexed using its altloc specifier. The \texttt{Disordered\-Atom}
object forwards all uncaught method calls to the selected Atom object,
by default the one that represents the atom with the highest
occupancy. The user can of course change the selected \texttt{Atom}
object, making use of its altloc specifier. In this way atom disorder
is represented correctly without much additional complexity. In other
words, if you are not interested in atom disorder, you will not be
bothered by it.
Each disordered atom has a characteristic altloc identifier. You can
specify that a \texttt{Disordered\-Atom} object should behave like
the \texttt{Atom} object associated with a specific altloc identifier:
\begin{verbatim}
>>> atom.disordered_select('A') # select altloc A atom
>>> print(atom.get_altloc())
"A"
>>> atom.disordered_select('B') # select altloc B atom
>>> print(atom.get_altloc())
"B"
\end{verbatim}
\subsection{Disordered residues}
\subsubsection*{Common case}
The most common case is a residue that contains one or more disordered atoms.
This is evidently solved by using DisorderedAtom objects to represent the disordered
atoms, and storing the DisorderedAtom object in a Residue object just like ordinary
Atom objects. The DisorderedAtom will behave exactly like an ordinary atom (in
fact the atom with the highest occupancy) by forwarding all uncaught method
calls to one of the Atom objects (the selected Atom object) it contains.
\subsubsection*{Point mutations\label{point mutations}}
A special case arises when disorder is due to a point mutation, i.e. when two
or more point mutants of a polypeptide are present in the crystal. An example
of this can be found in PDB structure 1EN2.
Since these residues belong to a different residue type (e.g. let's
say Ser 60 and Cys 60) they should not be stored in a single \texttt{Residue}
object as in the common case. In this case, each residue is represented
by one \texttt{Residue} object, and both \texttt{Residue} objects
are stored in a single \texttt{Disordered\-Residue} object (see Fig.
\ref{fig:smcra}).
The \texttt{Dis\-ordered\-Residue} object forwards all un\-caught methods to
the selected \texttt{Residue} object (by default the last \texttt{Residue}
object added), and thus behaves like an ordinary residue. Each
\texttt{Residue} object in a \texttt{Disordered\-Residue} object can be
uniquely identified by its residue name. In the above example, residue Ser 60
would have id ``SER'' in the \texttt{Disordered\-Residue} object, while
residue Cys 60 would have id ``CYS''. The user can select the active
\texttt{Residue} object in a \texttt{Disordered\-Residue} object via this id.
Example: suppose that a chain has a point mutation at position 10,
consisting of a Ser and a Cys residue. Make sure that residue 10 of
this chain behaves as the Cys residue.
\begin{verbatim}
>>> residue = chain[10]
>>> residue.disordered_select('CYS')
\end{verbatim}
In addition, you can get a list of all \texttt{Atom} objects (ie.
all \texttt{DisorderedAtom} objects are 'unpacked' to their individual
\texttt{Atom} objects) using the \texttt{get\_unpacked\_list} method
of a \texttt{(Disordered)\-Residue} object.
\section{Hetero residues}
\subsection{Associated problems\label{hetero problems}}
A common problem with hetero residues is that several hetero and non-hetero
residues present in the same chain share the same sequence identifier (and insertion
code). Therefore, to generate a unique id for each hetero residue, waters and
other hetero residues are treated in a different way.
Remember that Residue object have the tuple (hetfield, resseq, icode) as id.
The hetfield is blank (`` '') for amino and nucleic acids, and a string
for waters and other hetero residues. The content of the hetfield is explained
below.
\subsection{Water residues}
The hetfield string of a water residue consists of the letter ``W''. So
a typical residue id for a water is (``W'', 1, `` '').
\subsection{Other hetero residues}
The hetfield string for other hetero residues starts with ``H\_'' followed
by the residue name. A glucose molecule e.g. with residue name ``GLC''
would have hetfield ``H\_GLC''. Its residue id could e.g. be (``H\_GLC'',
1, `` '').
\section{Navigating through a Structure object}
\subsubsection*{Parse a PDB file, and extract some Model, Chain, Residue and Atom objects}
\begin{verbatim}
>>> from Bio.PDB.PDBParser import PDBParser
>>> parser = PDBParser()
>>> structure = parser.get_structure("test", "1fat.pdb")
>>> model = structure[0]
>>> chain = model["A"]
>>> residue = chain[1]
>>> atom = residue["CA"]
\end{verbatim}
\subsubsection*{Iterating through all atoms of a structure}
\begin{verbatim}
>>> p = PDBParser()
>>> structure = p.get_structure('X', 'pdb1fat.ent')
>>> for model in structure:
... for chain in model:
... for residue in chain:
... for atom in residue:
... print(atom)
...
\end{verbatim}
There is a shortcut if you want to iterate over all atoms in a structure:
\begin{verbatim}
>>> atoms = structure.get_atoms()
>>> for atom in atoms:
... print(atom)
...
\end{verbatim}
Similarly, to iterate over all atoms in a chain, use
\begin{verbatim}
>>> atoms = chain.get_atoms()
>>> for atom in atoms:
... print(atom)
...
\end{verbatim}
\subsubsection*{Iterating over all residues of a model}
or if you want to iterate over all residues in a model:
\begin{verbatim}
>>> residues = model.get_residues()
>>> for residue in residues:
... print(residue)
...
\end{verbatim}
You can also use the \verb+Selection.unfold_entities+ function to get all residues from a structure:
\begin{verbatim}
>>> res_list = Selection.unfold_entities(structure, 'R')
\end{verbatim}
or to get all atoms from a chain:
\begin{verbatim}
>>> atom_list = Selection.unfold_entities(chain, 'A')
\end{verbatim}
Obviously, \verb+A=atom, R=residue, C=chain, M=model, S=structure+.
You can use this to go up in the hierarchy, e.g. to get a list of
(unique) \verb+Residue+ or \verb+Chain+ parents from a list of
\verb+Atoms+:
\begin{verbatim}
>>> residue_list = Selection.unfold_entities(atom_list, 'R')
>>> chain_list = Selection.unfold_entities(atom_list, 'C')
\end{verbatim}
For more info, see the API documentation.
\subsubsection*{Extract a hetero residue from a chain (e.g. a glucose (GLC) moiety with resseq 10)}
\begin{verbatim}
>>> residue_id = ("H_GLC", 10, " ")
>>> residue = chain[residue_id]
\end{verbatim}
\subsubsection*{Print all hetero residues in chain}
\begin{verbatim}
>>> for residue in chain.get_list():
... residue_id = residue.get_id()
... hetfield = residue_id[0]
... if hetfield[0]=="H":
... print(residue_id)
...
\end{verbatim}
\subsubsection*{Print out the coordinates of all CA atoms in a structure with B factor greater than 50}
\begin{verbatim}
>>> for model in structure.get_list():
... for chain in model.get_list():
... for residue in chain.get_list():
... if residue.has_id("CA"):
... ca = residue["CA"]
... if ca.get_bfactor() > 50.0:
... print(ca.get_coord())
...
\end{verbatim}
\subsubsection*{Print out all the residues that contain disordered atoms}
\begin{verbatim}
>>> for model in structure.get_list():
... for chain in model.get_list():
... for residue in chain.get_list():
... if residue.is_disordered():
... resseq = residue.get_id()[1]
... resname = residue.get_resname()
... model_id = model.get_id()
... chain_id = chain.get_id()
... print(model_id, chain_id, resname, resseq)
...
\end{verbatim}
\subsubsection*{Loop over all disordered atoms, and select all atoms with altloc A (if present)}
This will make sure that the SMCRA data structure will behave as if only the
atoms with altloc A are present.
\begin{verbatim}
>>> for model in structure.get_list():
... for chain in model.get_list():
... for residue in chain.get_list():
... if residue.is_disordered():
... for atom in residue.get_list():
... if atom.is_disordered():
... if atom.disordered_has_id("A"):
... atom.disordered_select("A")
...
\end{verbatim}
\subsubsection*{Extracting polypeptides from a \texttt{Structure} object\label{subsubsec:extracting_polypeptides}}
To extract polypeptides from a structure, construct a list of \texttt{Polypeptide} objects from a \texttt{Structure} object using \texttt{PolypeptideBuilder} as follows:
\begin{verbatim}
>>> model_nr = 1
>>> polypeptide_list = build_peptides(structure, model_nr)
>>> for polypeptide in polypeptide_list:
... print(polypeptide)
...
\end{verbatim}
A Polypeptide object is simply a UserList of Residue objects, and is always created from a single Model (in this case model 1).
You can use the resulting \texttt{Polypeptide} object to get the sequence as a \texttt{Seq} object or to get a list of C$\alpha$ atoms as well. Polypeptides can be built using a C-N or a C$\alpha$-C$\alpha$ distance criterion.
Example:
\begin{verbatim}
# Using C-N
>>> ppb=PPBuilder()
>>> for pp in ppb.build_peptides(structure):
... print(pp.get_sequence())
...
# Using CA-CA
>>> ppb=CaPPBuilder()
>>> for pp in ppb.build_peptides(structure):
... print(pp.get_sequence())
...
\end{verbatim}
Note that in the above case only model 0 of the structure is considered
by \texttt{PolypeptideBuilder}. However, it is possible to use \texttt{PolypeptideBuilder}
to build \texttt{Polypeptide} objects from \texttt{Model} and \texttt{Chain}
objects as well.
\subsubsection*{Obtaining the sequence of a structure}
The first thing to do is to extract all polypeptides from the structure
(as above). The sequence of each polypeptide can then easily
be obtained from the \texttt{Polypeptide} objects. The sequence is
represented as a Biopython \texttt{Seq} object, and its alphabet is
defined by a \texttt{ProteinAlphabet} object.
Example:
\begin{verbatim}
>>> seq = polypeptide.get_sequence()
>>> print(seq)
Seq('SNVVE...', <class Bio.Alphabet.ProteinAlphabet>)
\end{verbatim}
\section{Analyzing structures}
\subsection{Measuring distances}
The minus operator for atoms has been overloaded to return the distance between two atoms.
\begin{verbatim}
# Get some atoms
>>> ca1 = residue1['CA']
>>> ca2 = residue2['CA']
# Simply subtract the atoms to get their distance
>>> distance = ca1-ca2
\end{verbatim}
\subsection{Measuring angles}
Use the vector representation of the atomic coordinates, and
the \texttt{calc\_angle} function from the \texttt{Vector} module:
\begin{verbatim}
>>> vector1 = atom1.get_vector()
>>> vector2 = atom2.get_vector()
>>> vector3 = atom3.get_vector()
>>> angle = calc_angle(vector1, vector2, vector3)
\end{verbatim}
\subsection{Measuring torsion angles}
Use the vector representation of the atomic coordinates, and
the \texttt{calc\_dihedral} function from the \texttt{Vector} module:
\begin{verbatim}
>>> vector1 = atom1.get_vector()
>>> vector2 = atom2.get_vector()
>>> vector3 = atom3.get_vector()
>>> vector4 = atom4.get_vector()
>>> angle = calc_dihedral(vector1, vector2, vector3, vector4)
\end{verbatim}
\subsection{Determining atom-atom contacts}
Use \texttt{NeighborSearch} to perform neighbor lookup.
The neighbor lookup is done using a KD tree module written in C (see \texttt{Bio.KDTree}), making it very fast.
It also includes a fast method to find all point pairs within a certain distance of each other.
\subsection{Superimposing two structures}
Use a \texttt{Superimposer} object to superimpose two coordinate sets.
This object calculates the rotation and translation matrix that rotates
two lists of atoms on top of each other in such a way that their RMSD
is minimized. Of course, the two lists need to contain the same number
of atoms. The \texttt{Superimposer} object can also apply the rotation/translation
to a list of atoms. The rotation and translation are stored as a tuple
in the \texttt{rotran} attribute of the \texttt{Superimposer} object
(note that the rotation is right multiplying!). The RMSD is stored
in the \texttt{rmsd} attribute.
The algorithm used by \texttt{Superimposer} comes from \cite[Golub \& Van Loan]{golub1989} and makes use of singular value decomposition (this is implemented in the general \texttt{Bio.SVDSuperimposer} module).
Example:
\begin{verbatim}
>>> sup = Superimposer()
# Specify the atom lists
# 'fixed' and 'moving' are lists of Atom objects
# The moving atoms will be put on the fixed atoms
>>> sup.set_atoms(fixed, moving)
# Print rotation/translation/rmsd
>>> print(sup.rotran)
>>> print(sup.rms)
# Apply rotation/translation to the moving atoms
>>> sup.apply(moving)
\end{verbatim}
To superimpose two structures based on their active sites, use the active site atoms to calculate the rotation/translation matrices (as above), and apply these to the whole molecule.
\subsection{Mapping the residues of two related structures onto each other}
First, create an alignment file in FASTA format, then use the \texttt{StructureAlignment}
class. This class can also be used for alignments with more than two
structures.
\subsection{Calculating the Half Sphere Exposure}
Half Sphere Exposure (HSE) is a new, 2D measure of solvent exposure
\cite{hamelryck2005}.
Basically, it counts the number of C$\alpha$ atoms around a residue
in the direction of its side chain, and in the opposite direction
(within a radius of $13 \AA$). Despite its simplicity, it outperforms
many other measures of solvent exposure.
HSE comes in two flavors: HSE$\alpha$ and HSE$\beta$. The former
only uses the C$\alpha$ atom positions, while the latter uses the
C$\alpha$ and C$\beta$ atom positions. The HSE measure is calculated
by the \texttt{HSExposure} class, which can also calculate the contact
number. The latter class has methods which return dictionaries that
map a \texttt{Residue} object to its corresponding HSE$\alpha$, HSE$\beta$
and contact number values.
Example:
\begin{verbatim}
>>> model = structure[0]
>>> hse = HSExposure()
# Calculate HSEalpha
>>> exp_ca = hse.calc_hs_exposure(model, option='CA3')
# Calculate HSEbeta
>>> exp_cb=hse.calc_hs_exposure(model, option='CB')
# Calculate classical coordination number
>>> exp_fs = hse.calc_fs_exposure(model)
# Print HSEalpha for a residue
>>> print(exp_ca[some_residue])
\end{verbatim}
\subsection{Determining the secondary structure}
For this functionality, you need to install DSSP (and obtain a license
for it --- free for academic use, see \url{http://www.cmbi.kun.nl/gv/dssp/}).
Then use the \texttt{DSSP} class, which maps \texttt{Residue} objects
to their secondary structure (and accessible surface area). The DSSP
codes are listed in Table \ref{cap:DSSP-codes}. Note that DSSP (the
program, and thus by consequence the class) cannot handle multiple
models!
\begin{table}
\begin{tabular}{|c|c|}
\hline
Code&
Secondary structure \\
\hline
\hline
H&
$\alpha$-helix \\
\hline
B&
Isolated $\beta$-bridge residue \\
\hline
E&
Strand \\
\hline
G&
3-10 helix \\
\hline
I&
$\Pi$-helix \\
\hline
T&
Turn\\
\hline
S&
Bend \\
\hline
-&
Other\\
\hline
\end{tabular}
\caption{\label{cap:DSSP-codes}DSSP codes in Bio.PDB.}
\end{table}
The \texttt{DSSP} class can also be used to calculate the accessible surface area of a residue. But see also section \ref{subsec:residue_depth}.
\subsection{Calculating the residue depth\label{subsec:residue_depth}}
Residue depth is the average distance of a residue's atoms from the
solvent accessible surface. It's a fairly new and very powerful parameterization
of solvent accessibility. For this functionality, you need to install
Michel Sanner's MSMS program (\url{http://www.scripps.edu/pub/olson-web/people/sanner/html/msms_home.html}).
Then use the \texttt{ResidueDepth} class. This class behaves as a
dictionary which maps \texttt{Residue} objects to corresponding (residue
depth, C$\alpha$ depth) tuples. The C$\alpha$ depth is the distance
of a residue's C$\alpha$ atom to the solvent accessible surface.
Example:
\begin{verbatim}
>>> model = structure[0]
>>> rd = ResidueDepth(model, pdb_file)
>>> residue_depth, ca_depth=rd[some_residue]
\end{verbatim}
You can also get access to the molecular surface itself (via the \texttt{get\_surface}
function), in the form of a Numeric Python array with the surface points.
\section{Common problems in PDB files}
It is well known that many PDB files contain semantic errors (not the
structures themselves, but their representation in PDB files).
Bio.PDB tries to handle this in two ways. The PDBParser
object can behave in two ways: a restrictive way and a permissive
way, which is the default.
Example:
\begin{verbatim}
# Permissive parser
>>> parser = PDBParser(PERMISSIVE=1)
>>> parser = PDBParser() # The same (default)
# Strict parser
>>> strict_parser = PDBParser(PERMISSIVE=0)
\end{verbatim}
In the permissive state (DEFAULT), PDB files that obviously contain
errors are ``corrected'' (i.e. some residues or atoms are left out).
These errors include:
\begin{itemize}
\item Multiple residues with the same identifier
\item Multiple atoms with the same identifier (taking into account the altloc
identifier)
\end{itemize}
These errors indicate real problems in the PDB file (for details see
\cite[Hamelryck and Manderick, 2003]{hamelryck2003a}). In the restrictive state, PDB files with errors cause an exception to occur. This is useful to find errors in PDB files.
Some errors however are automatically corrected. Normally each disordered
atom should have a non-blank altloc identifier. However, there are
many structures that do not follow this convention, and have a blank
and a non-blank identifier for two disordered positions of the same
atom. This is automatically interpreted in the right way.
Sometimes a structure contains a list of residues belonging to chain
A, followed by residues belonging to chain B, and again followed by
residues belonging to chain A, i.e. the chains are 'broken'. This
is also correctly interpreted.
\subsection{Examples\label{problem structures}}
The PDBParser/Structure class was tested on about 800 structures (each belonging
to a unique SCOP superfamily). This takes about 20 minutes, or on average 1.5
seconds per structure. Parsing the structure of the large ribosomal subunit
(1FKK), which contains about 64000 atoms, takes 10 seconds on a 1000 MHz PC.
Three exceptions were generated in cases where an unambiguous data structure
could not be built. In all three cases, the likely cause is an error in the
PDB file that should be corrected. Generating an exception in these cases
is much better than running the chance of incorrectly describing
the structure in a data structure.
\subsubsection{Duplicate residues}
One structure contains two amino acid residues in one chain with the same sequence
identifier (resseq 3) and icode. Upon inspection it was found that this chain
contains the residues Thr A3, \ldots, Gly A202, Leu A3, Glu A204. Clearly,
Leu A3 should be Leu A203. A couple of similar situations exist for structure
1FFK (which e.g. contains Gly B64, Met B65, Glu B65, Thr B67, i.e. residue Glu
B65 should be Glu B66).
\subsubsection{Duplicate atoms}
Structure 1EJG contains a Ser/Pro point mutation in chain A at position 22.
In turn, Ser 22 contains some disordered atoms. As expected, all atoms belonging
to Ser 22 have a non-blank altloc specifier (B or C). All atoms of Pro 22 have
altloc A, except the N atom which has a blank altloc. This generates an exception,
because all atoms belonging to two residues at a point mutation should have
non-blank altloc. It turns out that this atom is probably shared by Ser and
Pro 22, as Ser 22 misses the N atom. Again, this points to a problem in the
file: the N atom should be present in both the Ser and the Pro residue, in both
cases associated with a suitable altloc identifier.
\subsection{Automatic correction}
Some errors are quite common and can be easily corrected without much risk of
making a wrong interpretation. These cases are listed below.
\subsubsection{A blank altloc for a disordered atom}
Normally each disordered atom should have a non-blank altloc identifier. However,
there are many structures that do not follow this convention, and have a blank
and a non-blank identifier for two disordered positions of the same atom. This
is automatically interpreted in the right way.
\subsubsection{Broken chains}
Sometimes a structure contains a list of residues belonging to chain A, followed
by residues belonging to chain B, and again followed by residues belonging to
chain A, i.e. the chains are ``broken''. This is correctly interpreted.
\subsection{Fatal errors}
Sometimes a PDB file cannot be unambiguously interpreted. Rather than guessing
and risking a mistake, an exception is generated, and the user is expected to
correct the PDB file. These cases are listed below.
\subsubsection{Duplicate residues}
All residues in a chain should have a unique id. This id is generated based
on:
\begin{itemize}
\item The sequence identifier (resseq).
\item The insertion code (icode).
\item The hetfield string (``W'' for waters and ``H\_'' followed by the
residue name for other hetero residues)
\item The residue names of the residues in the case of point mutations (to store the
Residue objects in a DisorderedResidue object).
\end{itemize}
If this does not lead to a unique id something is quite likely wrong, and an
exception is generated.
\subsubsection{Duplicate atoms}
All atoms in a residue should have a unique id. This id is generated based on:
\begin{itemize}
\item The atom name (without spaces, or with spaces if a problem arises).
\item The altloc specifier.
\end{itemize}
If this does not lead to a unique id something is quite likely wrong, and an
exception is generated.
\section{Accessing the Protein Data Bank}
\subsection{Downloading structures from the Protein Data Bank}
Structures can be downloaded from the PDB (Protein Data Bank)
by using the \texttt{retrieve\_pdb\_file} method on a \texttt{PDBList} object.
The argument for this method is the PDB identifier of the structure.
\begin{verbatim}
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file('1FAT')
\end{verbatim}
The \texttt{PDBList} class can also be used as a command-line tool:
\begin{verbatim}
python PDBList.py 1fat
\end{verbatim}
The downloaded file will be called \texttt{pdb1fat.ent} and stored
in the current working directory. Note that the \texttt{retrieve\_pdb\_file}
method also has an optional argument \texttt{pdir} that specifies
a specific directory in which to store the downloaded PDB files.
The \texttt{retrieve\_pdb\_file} method also has some options to specify
the compression format used for the download, and the program used
for local decompression (default \texttt{.Z} format and \texttt{gunzip}).
In addition, the PDB ftp site can be specified upon creation of the
\texttt{PDBList} object. By default, the server of the Worldwide Protein Data Bank (\url{ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/})
is used. See the API documentation for more details. Thanks again
to Kristian Rother for donating this module.
\subsection{Downloading the entire PDB}
The following commands will store all PDB files in the \texttt{/data/pdb}
directory:
\begin{verbatim}
python PDBList.py all /data/pdb
python PDBList.py all /data/pdb -d
\end{verbatim}
\noindent The API method for this is called \texttt{download\_entire\_pdb}.
Adding the \texttt{-d} option will store all files in the same directory.
Otherwise, they are sorted into PDB-style subdirectories according
to their PDB ID's. Depending on the traffic, a complete download will
take 2-4 days.
\subsection{Keeping a local copy of the PDB up to date}
This can also be done using the \texttt{PDBList} object. One simply
creates a \texttt{PDBList} object (specifying the directory where
the local copy of the PDB is present) and calls the \texttt{update\_pdb}
method:
\begin{verbatim}
>>> pl = PDBList(pdb='/data/pdb')
>>> pl.update_pdb()
\end{verbatim}
One can of course make a weekly \texttt{cronjob} out of this to keep
the local copy automatically up-to-date. The PDB ftp site can also
be specified (see API documentation).
\texttt{PDBList} has some additional methods that can be of use. The
\texttt{get\_all\_obsolete} method can be used to get a list of all
obsolete PDB entries. The \texttt{changed\_this\_week} method can
be used to obtain the entries that were added, modified or obsoleted
during the current week. For more info on the possibilities of \texttt{PDBList},
see the API documentation.
\section{General questions}
\subsection{How well tested is Bio.PDB?}
Pretty well, actually. Bio.PDB has been extensively tested on nearly
5500 structures from the PDB - all structures seemed to be parsed
correctly. More details can be found in the Bio.PDB Bioinformatics
article. Bio.PDB has been used/is being used in many research projects
as a reliable tool. In fact, I'm using Bio.PDB almost daily for research
purposes and continue working on improving it and adding new features.
\subsection{How fast is it?}
The \texttt{PDBParser} performance was tested on about 800 structures
(each belonging to a unique SCOP superfamily). This takes about 20
minutes, or on average 1.5 seconds per structure. Parsing the structure
of the large ribosomal subunit (1FKK), which contains about 64000
atoms, takes 10 seconds on a 1000 MHz PC. In short: it's more than
fast enough for many applications.
\subsection{Is there support for molecular graphics?}
Not directly, mostly since there are quite a few Python based/Python
aware solutions already, that can potentially be used with Bio.PDB.
My choice is Pymol, BTW (I've used this successfully with Bio.PDB,
and there will probably be specific PyMol modules in Bio.PDB soon/some
day). Python based/aware molecular graphics solutions include:
\begin{itemize}
\item PyMol: \url{http://pymol.sourceforge.net/}
\item Chimera: \url{http://www.cgl.ucsf.edu/chimera/}
\item PMV: \url{http://www.scripps.edu/~sanner/python/}
\item Coot: \url{http://www.ysbl.york.ac.uk/~emsley/coot/}
\item CCP4mg: \url{http://www.ysbl.york.ac.uk/~lizp/molgraphics.html}
\item mmLib: \url{http://pymmlib.sourceforge.net/}
\item VMD: \url{http://www.ks.uiuc.edu/Research/vmd/}
\item MMTK: \url{http://starship.python.net/crew/hinsen/MMTK/}
\end{itemize}
\subsection{Who's using Bio.PDB?}
Bio.PDB was used in the construction of DISEMBL, a web server that
predicts disordered regions in proteins (\url{http://dis.embl.de/}),
and COLUMBA, a website that provides annotated protein structures
(\url{http://www.columba-db.de/}). Bio.PDB has also been used to
perform a large scale search for active sites similarities between
protein structures in the PDB \cite[Hamelryck, 2003]{hamelryck2003b}, and to develop a new algorithm
that identifies linear secondary structure elements \cite[Majumdar \textit{et al.}, 2005]{majumdar2005}.
Judging from requests for features and information, Bio.PDB is also
used by several LPCs (Large Pharmaceutical Companies :-).
% Perhaps this section should be updated for other Biopython modules, and
% moved to the Biopython FAQ above.
|