1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298
|
\chapter{Advanced}
\label{chapter:advanced}
\section{Parser Design}
Many of the older Biopython parsers were built around an event-oriented
design that includes Scanner and Consumer objects.
Scanners take input from a data source and analyze it line by line,
sending off an event whenever it recognizes some information in the
data. For example, if the data includes information about an organism
name, the scanner may generate an \verb|organism_name| event whenever it
encounters a line containing the name.
Consumers are objects that receive the events generated by Scanners.
Following the previous example, the consumer receives the
\verb|organism_name| event, and the processes it in whatever manner
necessary in the current application.
This is a very flexible framework, which is advantageous if you want to
be able to parse a file format into more than one representation. For
example, the \verb|Bio.GenBank| module uses this to construct either
\verb|SeqRecord| objects or file-format-specific record objects.
More recently, many of the parsers added for \verb|Bio.SeqIO| and
\verb|Bio.AlignIO| take a much simpler approach, but only generate a
single object representation (\verb|SeqRecord| and
\verb|MultipleSeqAlignment| objects respectively). In some cases the
\verb|Bio.SeqIO| parsers actually wrap
another Biopython parser - for example, the \verb|Bio.SwissProt| parser
produces SwissProt format specific record objects, which get converted
into \verb|SeqRecord| objects.
\section{Substitution Matrices}
\subsection{SubsMat}
This module provides a class and a few routines for generating substitution matrices, similar to BLOSUM or PAM matrices, but based on user-provided data. Additionally, you may select a matrix from MatrixInfo.py, a collection of established substitution matrices. The \verb+SeqMat+ class derives from a dictionary:
\begin{verbatim}
class SeqMat(dict)
\end{verbatim}
The dictionary is of the form \verb|{(i1,j1):n1, (i1,j2):n2,...,(ik,jk):nk}| where i, j are alphabet letters, and n is a value.
\begin{enumerate}
\item Attributes
\begin{enumerate}
\item \verb|self.alphabet|: a class as defined in Bio.Alphabet
\item \verb|self.ab_list|: a list of the alphabet's letters, sorted. Needed mainly for internal purposes
\end{enumerate}
\item Methods
\begin{enumerate}
\item
\begin{verbatim}
__init__(self,data=None,alphabet=None, mat_name='', build_later=0):
\end{verbatim}
\begin{enumerate}
\item \verb|data|: can be either a dictionary, or another SeqMat instance.
\item \verb|alphabet|: a Bio.Alphabet instance. If not provided, construct an alphabet from data.
\item \verb|mat_name|: matrix name, such as "BLOSUM62" or "PAM250"
\item \verb|build_later|: default false. If true, user may supply only alphabet and empty dictionary, if intending to build the matrix later. this skips the sanity check of alphabet size vs. matrix size.
\end{enumerate}
\item
\begin{verbatim}
entropy(self,obs_freq_mat)
\end{verbatim}
\begin{enumerate}
\item \verb|obs_freq_mat|: an observed frequency matrix. Returns the matrix's entropy, based on the frequency in \verb|obs_freq_mat|. The matrix instance should be LO or SUBS.
\end{enumerate}
\item
\begin{verbatim}
sum(self)
\end{verbatim}
Calculates the sum of values for each letter in the matrix's alphabet, and returns it as a dictionary of the form \verb|{i1: s1, i2: s2,...,in:sn}|, where:
\begin{itemize}
\item i: an alphabet letter;
\item s: sum of all values in a half-matrix for that letter;
\item n: number of letters in alphabet.
\end{itemize}
\item
\begin{verbatim}
print_mat(self,f,format="%4d",bottomformat="%4s",alphabet=None)
\end{verbatim}
prints the matrix to file handle f. \verb|format| is the format field for the matrix values; \verb|bottomformat| is the format field for the bottom row, containing matrix letters. Example output for a 3-letter alphabet matrix:
\begin{verbatim}
A 23
B 12 34
C 7 22 27
A B C
\end{verbatim}
The \verb|alphabet| optional argument is a string of all characters in the alphabet. If supplied, the order of letters along the axes is taken from the string, rather than by alphabetical order.
\end{enumerate}
\item Usage
The following section is laid out in the order by which most people wish to generate a log-odds matrix. Of course, interim matrices can be generated and
investigated. Most people just want a log-odds matrix, that's all.
\begin{enumerate}
\item Generating an Accepted Replacement Matrix
Initially, you should generate an accepted replacement matrix (ARM) from your data. The values in ARM are the counted number of replacements according to your data. The data could be a set of pairs or multiple alignments. So for instance if Alanine was replaced by Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding ARM entries would be:
\begin{verbatim}
('A','C'): 10, ('C','A'): 12
\end{verbatim}
as order doesn't matter, user can already provide only one entry:
\begin{verbatim}
('A','C'): 22
\end{verbatim}
A SeqMat instance may be initialized with either a full (first method of counting: 10, 12) or half (the latter method, 22) matrices. A full protein
alphabet matrix would be of the size 20x20 = 400. A half matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because same-letter entries don't
change. (The matrix diagonal). Given an alphabet size of N:
\begin{enumerate}
\item Full matrix size: N*N
\item Half matrix size: N(N+1)/2
\end{enumerate}
The SeqMat constructor automatically generates a half-matrix, if a full matrix is passed. If a half matrix is passed, letters in the key should be provided in alphabetical order: ('A','C') and not ('C',A').
At this point, if all you wish to do is generate a log-odds matrix, please go to the section titled Example of Use. The following text describes the nitty-gritty of internal functions, to be used by people who wish to investigate their nucleotide/amino-acid frequency data more thoroughly.
\item Generating the observed frequency matrix (OFM)
Use:
\begin{verbatim}
OFM = SubsMat._build_obs_freq_mat(ARM)
\end{verbatim}
The OFM is generated from the ARM, only instead of replacement counts, it contains replacement frequencies.
\item Generating an expected frequency matrix (EFM)
Use:
\begin{verbatim}
EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
\end{verbatim}
\begin{enumerate}
\item \verb|exp_freq_table|: should be a FreqTable instance. See section~\ref{sec:freq_table} for detailed information on FreqTable. Briefly, the expected frequency table has the frequencies of appearance for each member of the alphabet. It is
implemented as a dictionary with the alphabet letters as keys, and each letter's frequency as a value. Values sum to 1.
\end{enumerate}
The expected frequency table can (and generally should) be generated from the observed frequency matrix. So in most cases you will generate \verb|exp_freq_table| using:
\begin{verbatim}
>>> exp_freq_table = SubsMat._exp_freq_table_from_obs_freq(OFM)
>>> EFM = SubsMat._build_exp_freq_mat(OFM, exp_freq_table)
\end{verbatim}
But you can supply your own \verb|exp_freq_table|, if you wish
\item Generating a substitution frequency matrix (SFM)
Use:
\begin{verbatim}
SFM = SubsMat._build_subs_mat(OFM,EFM)
\end{verbatim}
Accepts an OFM, EFM. Provides the division product of the corresponding values.
\item Generating a log-odds matrix (LOM)
Use:
\begin{verbatim}
LOM=SubsMat._build_log_odds_mat(SFM[,logbase=10,factor=10.0,round_digit=1])
\end{verbatim}
\begin{enumerate}
\item Accepts an SFM.
\item \verb|logbase|: base of the logarithm used to generate the log-odds values.
\item \verb|factor|: factor used to multiply the log-odds values. Each entry is generated by log(LOM[key])*factor And rounded to the \verb|round_digit| place after the decimal point, if required.
\end{enumerate}
\end{enumerate}
\item Example of use
As most people would want to generate a log-odds matrix, with minimum hassle, SubsMat provides one function which does it all:
\begin{verbatim}
make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=10,
factor=10.0,round_digit=0):
\end{verbatim}
\begin{enumerate}
\item \verb|acc_rep_mat|: user provided accepted replacements matrix
\item \verb|exp_freq_table|: expected frequencies table. Used if provided, if not, generated from the \verb|acc_rep_mat|.
\item \verb|logbase|: base of logarithm for the log-odds matrix. Default base 10.
\item \verb|round_digit|: number after decimal digit to which result should be rounded. Default zero.
\end{enumerate}
\end{enumerate}
\subsection{FreqTable}
\label{sec:freq_table}
\begin{verbatim}
FreqTable.FreqTable(UserDict.UserDict)
\end{verbatim}
\begin{enumerate}
\item Attributes:
\begin{enumerate}
\item \verb|alphabet|: A Bio.Alphabet instance.
\item \verb|data|: frequency dictionary
\item \verb|count|: count dictionary (in case counts are provided).
\end{enumerate}
\item Functions:
\begin{enumerate}
\item \verb|read_count(f)|: read a count file from stream f. Then convert to frequencies.
\item \verb|read_freq(f)|: read a frequency data file from stream f. Of course, we then don't have the counts, but it is usually the letter frequencies which are interesting.
\end{enumerate}
\item Example of use:
The expected count of the residues in the database is sitting in a file, whitespace delimited, in the following format (example given for a 3-letter alphabet):
\begin{verbatim}
A 35
B 65
C 100
\end{verbatim}
And will be read using the \verb|FreqTable.read_count(file_handle)| function.
An equivalent frequency file:
\begin{verbatim}
A 0.175
B 0.325
C 0.5
\end{verbatim}
Conversely, the residue frequencies or counts can be passed as a dictionary.
Example of a count dictionary (3-letter alphabet):
\begin{verbatim}
{'A': 35, 'B': 65, 'C': 100}
\end{verbatim}
Which means that an expected data count would give a 0.5 frequency
for 'C', a 0.325 probability of 'B' and a 0.175 probability of 'A'
out of 200 total, sum of A, B and C)
A frequency dictionary for the same data would be:
\begin{verbatim}
{'A': 0.175, 'B': 0.325, 'C': 0.5}
\end{verbatim}
Summing up to 1.
When passing a dictionary as an argument, you should indicate whether it is a count or a frequency dictionary. Therefore the FreqTable class constructor requires two arguments: the dictionary itself, and FreqTable.COUNT or FreqTable.FREQ indicating counts or frequencies, respectively.
Read expected counts. readCount will already generate the frequencies
Any one of the following may be done to geerate the frequency table (ftab):
\begin{verbatim}
>>> from SubsMat import *
>>> ftab = FreqTable.FreqTable(my_frequency_dictionary, FreqTable.FREQ)
>>> ftab = FreqTable.FreqTable(my_count_dictionary, FreqTable.COUNT)
>>> ftab = FreqTable.read_count(open('myCountFile'))
>>> ftab = FreqTable.read_frequency(open('myFrequencyFile'))
\end{verbatim}
\end{enumerate}
|