1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590
|
% Documentation describing writing Tests for Biopython modules
\documentclass{article}
\usepackage{url}
\usepackage{fullpage}
\usepackage{hevea}
\usepackage{graphicx}
% Make links between references
\usepackage{hyperref}
\newif\ifpdf
\ifx\pdfoutput\undefined
\pdffalse
\else
\pdfoutput=1
\pdftrue
\fi
\ifpdf
\hypersetup{colorlinks=true, hyperindex=true, citecolor=red, urlcolor=blue}
\fi
\begin{document}
\title{Sequence motif analysis using Biopython}
\author{Bartek Wilczynski (bartek@mimuw.edu.pl)}
\maketitle
\tableofcontents
\section{Short introduction}
\label{sec:intro}
This short tutorial describes some of the functionality of the
\verb|Bio.Motif| package included in Biopython distribution. It is intended
for people who are involved in analysis of sequence motif, so I'll
assume that you are familiar with basic notions of motif analysis. In
case something is unclear, please look into Section \ref{sec:links}
for some relevant links.
It should be also noted, that \verb|Bio.Motif| is based on two
Biopython modules, \verb|Bio.AlignAce| and \verb|Bio.MEME| and is
meant to replace them. It provides (almost) all of the functionality
of these modules and unifies the basic motif object implementation.
Speaking of other libraries, if you are reading this you might be
interested in the TAMO \cite{tamo}
(\url{http://fraenkel.mit.edu/TAMO/}), which is another python library
designed to deal with sequence motifs. It supports more \emph{de-novo}
motif finders, but it is not a part of biopython (so requires a bit of
work) and has some restrictions on commercial use.
\section{Motif objects}
\label{sec:object}
Since we are interested in motif analysis, we need to take a look at
\verb|Motif| objects in the first place. For that we need to import
the Motif library:
\begin{verbatim}
from Bio import Motif
\end{verbatim}
and we can start creating our first motif objects. Let's create a DNA motif:
\begin{verbatim}
from Bio.Alphabet import IUPAC
m=Motif.Motif(alphabet=IUPAC.unambiguous_dna)
\end{verbatim}
This is for now just an empty container, so let's add some sequences to our newly created motif:
\begin{verbatim}
from Bio.Seq import Seq
m.add_instance(Seq("TATAA", m.alphabet))
m.add_instance(Seq("TATTA", m.alphabet))
m.add_instance(Seq("TATAA", m.alphabet))
m.add_instance(Seq("TATAA", m.alphabet))
\end{verbatim}
Now we have a full \verb|Motif| instance, so we can try to get some
basic information about it. Let's start with length and consensus
sequence:
\begin{verbatim}
>>> m.length
5
>>> m.consensus
Seq('TATAA', IUPACUnambiguousDNA())
\end{verbatim}
In case of DNA motifs, we can also get a reverse complement of a motif:
\begin{verbatim}
>>> m.reverse_complement()
<Bio.Motif.Motif.Motif object at 0xb39890>
>>> m.reverse_complement().consensus
Seq('TTATA', IUPACUnambiguousDNA())
>>> m.reverse_complement().instances
[Seq('TTATA', IUPACUnambiguousDNA()),
Seq('TAATA', IUPACUnambiguousDNA()),
Seq('TTATA', IUPACUnambiguousDNA()),
Seq('TTATA', IUPACUnambiguousDNA())]
\end{verbatim}
We can also calculate the information content of a motif with a simple call:
\begin{verbatim}
>>> m.ic()
5.2735010263278932
\end{verbatim}
This gives us a number of bits of information provided by the motif,
which tells us how much we are different from background.
The most common representation of a motif is a PWM (Position Weight
Matrix). It summarizes the probabilities of finding any symbol (in
this case nucleotide) in any position of a motif. It can be computed by calling the \verb|.pwm()| method:
\begin{verbatim}
>>> m.pwm()
[{'A': 0.05, 'C': 0.05, 'T': 0.85, 'G': 0.05},
{'A': 0.85, 'C': 0.05, 'T': 0.05, 'G': 0.05},
{'A': 0.05, 'C': 0.05, 'T': 0.85, 'G': 0.05},
{'A': 0.65, 'C': 0.05, 'T': 0.25, 'G': 0.05},
{'A': 0.85, 'C': 0.05, 'T': 0.05, 'G': 0.05}]
\end{verbatim}
The probabilities in the motif's PWM are based on the counts in the
instances, but we can see, that even though there were no Gs and no Cs
in the instances, we still have non-zero probabilities assigned to
them. These come from pseudo-counts which are, roughly speaking, a
commonly used way to acknowledge the incompleteness of our knowledge
and avoid technical problems with calculating logarithms of $0$.
We can control the way that pseudo-counts are added with two
properties of Motif objects \verb|.background| is the probability
distribution over all symbols in the alphabet that we assume is found
in background (usually based on the GC content of the respective
genome). It is by default set to a uniform distribution upon creation of a motif:
\begin{verbatim}
>>> m.background
{'A': 0.25, 'C': 0.25, 'T': 0.25, 'G': 0.25}
\end{verbatim}
The other parameter is \verb|.beta|, which states the amount of
pseudo-counts we should add to the PWM. By default it is set to $1.0$,
\begin{verbatim}
>>> m.beta
1.0
\end{verbatim}
so that the total input of pseudo-counts is equal to that of one instance.
Using the background distribution and pwm with pseudo-counts added,
it's easy to compute the log-odds ratios, telling us what are the log
odds of a particular symbol to be coming from a motif against the
background. We can use the \verb|.log_odds()| method:
\begin{verbatim}
>>> m.log_odds()
[{'A': -2.3219280948873622,
'C': -2.3219280948873622,
'T': 1.7655347463629771,
'G': -2.3219280948873622},
{'A': 1.7655347463629771,
'C': -2.3219280948873622,
'T': -2.3219280948873622,
'G': -2.3219280948873622},
{'A': -2.3219280948873622,
'C': -2.3219280948873622,
'T': 1.7655347463629771,
'G': -2.3219280948873622},
{'A': 1.3785116232537298,
'C': -2.3219280948873622,
'T': 0.0,
'G': -2.3219280948873622},
{'A': 1.7655347463629771,
'C': -2.3219280948873622,
'T': -2.3219280948873622,
'G': -2.3219280948873622}
]
\end{verbatim}
Here we can see positive values for symbols more frequent in the motif
than in the background and negative for symbols more frequent in the
background. $0.0$ means that it's equally likely to see a symbol in
background and in the motif (e.g. 'T' in the second-last position).
\subsection{Reading and writing}
\label{sec:io}
Creating motifs from instances by hand is useful but boring, so it's
useful to have some I/O functions for reading and writing
motifs. There are no really well established standards for storing
motifs, but there's a couple of formats which are more used than
others. The most important distinction is whether the motif
representation is based on instances or on some version of PWM matrix.
On of the most popular motif databases JASPAR
(\url{http://jaspar.genereg.net}) stores motifs in both formats, so
let's look at how we can import JASPAR motifs from instances:
\begin{verbatim}
arnt = Motif.read(open("Arnt.sites"), "jaspar-sites")
\end{verbatim}
and from a count matrix:
\begin{verbatim}
srf = Motif.read(open("SRF.pfm"), "jaspar-pfm")
\end{verbatim}
The \verb|arnt| and \verb|srf| motifs can both do the same things for
us, but they use different internal representations of the motif. We
can tell that by inspecting the \verb|has_counts| and
\verb|has_instances| properties:
\begin{verbatim}
>>> arnt.has_instances
True
>>> srf.has_instances
False
>>> srf.has_counts
True
>>> srf.counts
{'A': [2, 9, 0, 1, 32, 3, 46, 1, 43, 15, 2, 2],
'C': [1, 33, 45, 45, 1, 1, 0, 0, 0, 1, 0, 1],
'G': [39, 2, 1, 0, 0, 0, 0, 0, 0, 0, 44, 43],
'T': [4, 2, 0, 0, 13, 42, 0, 45, 3, 30, 0, 0]}
\end{verbatim}
There are conversion functions, which can help us convert between
different representations:
\begin{verbatim}
>>> arnt.make_counts_from_instances()
{'A': [8, 38, 0, 0, 0, 0],
'C': [32, 0, 40, 0, 0, 0],
'G': [0, 2, 0, 40, 0, 40],
'T': [0, 0, 0, 0, 40, 0]}
>>> srf.make_instances_from_counts()
[Seq('GGGAAAAAAAGG', IUPACUnambiguousDNA()),
Seq('GGCCAAATAAGG', IUPACUnambiguousDNA()),
Seq('GACCAAATAAGG', IUPACUnambiguousDNA()),
....
\end{verbatim}
The important thing to remember here is that the method
\verb|make_instances_from_counts()| creates fake instances, because
usually there are very many possible sets of instances which give rise
to the same pwm, and if we have only the count matrix, we cannot
reconstruct the original one. This does not make any difference if we
are using the PWM as the representation of the motif, but one should
be careful with exporting instances from count-based motifs.
Speaking of exporting, let's look at export functions. We can export to fasta:
\begin{verbatim}
>>> print(m.format("fasta"))
> instance 0
TATAA
> instance 1
TATTA
> instance 2
TATAA
> instance 3
TATAA
\end{verbatim}
or to TRANSFAC-like matrix format (used by some motif processing software)
\begin{verbatim}
>>> print(m.format("transfac"))
XX
TY Motif
ID
BF undef
P0 G A T C
01 0 0 4 0
02 0 4 0 0
03 0 0 4 0
04 0 3 1 0
05 0 4 0 0
XX
\end{verbatim}
Finally, if we have internet access, we can create a weblogo using a nice service at \url{http://weblogo.berkeley.edu} by Crooks et al. \cite{crooks2004}:
\begin{verbatim}
>>> arnt.weblogo("Arnt.png")
\end{verbatim}
We should get our logo saved as a png in the specified file.
\subsection{Searching for instances}
\label{sec:search}
The most frequent use for a motif is to find its instances in some
sequence. For the sake of this section, we will use an artificial sequence like this:
\begin{verbatim}
test_seq=Seq("TATGATGTAGTATAATATAATTATAA",m.alphabet)
\end{verbatim}
The simplest way to find instances, is to look for exact matches of
the true instances of the motif:
\begin{verbatim}
>>> for pos, seq in m.search_instances(test_seq):
... print("%i %s" % (pos, seq))
...
10 TATAA
15 TATAA
21 TATAA
\end{verbatim}
We can do the same with the reverse complement (to find instances on the complementary strand):
\begin{verbatim}
>>> for pos, seq in m.reverse_complement().search_instances(test_seq):
... print("%i %s" % (pos, seq))
...
12 TAATA
20 TTATA
\end{verbatim}
It's just as easy to look for positions, giving rise to high log-odds scores against our motif:
\begin{verbatim}
>>> for pos, score in m.search(test_seq,threshold=5.0):
... print("%i %f" % (pos, score))
...
10 8.44065060871
-12 7.06213898545
15 8.44065060871
-20 8.44065060871
21 8.44065060871
\end{verbatim}
You may notice the threshold parameter, here set arbitrarily to
$5.0$. This is in $log_2$, so we are now looking only for words, which
are 32 times more likely to occur under the motif model than in the
background. The default threshold is $0.0$, which selects everything
that looks more like the motif, than the background.
If you want to use a less arbitrary way of selecting thresholds, you
can explore the \verb|Motif.score_distribution| class implementing an
distribution of scores for a given motif. Since the space for a score
distribution grows exponentially with motif.length, we are using an
approximation with a given precision to keep computation cost manageable:
\begin{verbatim}
>>> sd = Motif.score_distribution(m,precision=10**4)
\end{verbatim}
The \verb+sd+ object can be used to determine a number of different thresholds.
We can specify the requested false-positive rate (probability of ``finding'' a motif instance in background generated sequence):
\begin{verbatim}
>>> sd.threshold_fpr(0.01)
4.3535838726139886
\end{verbatim}
or the false-negative rate (probability of ``not finding'' an instance generated from the motif):
\begin{verbatim}
>>> sd.threshold_fnr(0.1)
0.26651713652234044
\end{verbatim}
or a threshold (approximately) satisfying some relation between fpr
and fnr ($\frac{fnr}{fpr}$, as proposed by Rahmann \cite{Rahmann2003}):
\begin{verbatim}
>>> sd.threshold_balanced(1000)
8.4406506087056368
\end{verbatim}
or a threshold satisfying (roughly) the equality between the
false-positive rate and the $-log$ of the information content (as used
in patser software by Hertz and Stormo).
For example, in case of our motif, you can get the threshold giving
you exactly the same results (for this sequence) as searching for
instances with balanced threshold with rate of $1000$.
\begin{verbatim}
>>> threshold = sd.threshold_balanced(1000)
>>> for pos, score in m.search(test_seq, threshold=threshold):
... print("%s %s" % (pos, score))
...
10 8.44065060871
15 8.44065060871
-20 8.44065060871
21 8.44065060871
\end{verbatim}
\subsection{Comparing motifs}
\label{sec:comp}
Once we have more than one motif, we might want to compare them. For
that, we have currently three different methods of \verb|Bio.Motif|
objects.
Before we start comparing motifs, I should point out that motif
boundaries are usually quite arbitrary. This means, that we often need
to compare motifs of different lengths, so comparison needs to involve
some kind of alignment. This means, that we have to take into account two things:
\begin{itemize}
\item alignment of motifs
\item some function to compare aligned motifs
\end{itemize}
In \verb|Bio.Motif| we have 3 different functions for motif
comparison, which are based on the same idea behind motif alignment,
but use different functions to compare aligned motifs. Briefly
speaking, we are using ungapped alignment of PWMs and substitute the
missing columns at the beginning and end of the matrices with
background distribution. All three comparison functions are written in
such a way, that they can be interpreted as distance measures, however
only one (\verb|dist_dpq|) satisfies the triangle inequality. All of
them return the minimal distance and the corresponding offset between
motifs.
To show how these functions work, let us first load another motif,
which is similar to our test motif \verb|m|:
\begin{verbatim}
>>> ubx=Motif.read(open("Ubx.pfm"), "jaspar-pfm")
<Bio.Motif.Motif.Motif object at 0xc29b90>
>>> ubx.consensus
Seq('TAAT', IUPACUnambiguousDNA())
\end{verbatim}
The first function we'll use to compare these motifs is based on
Pearson correlation. Since we want it to resemble a distance
measure. we actually take $1-r$, where r is the Pearson correlation
coefficient (PCC):
\begin{verbatim}
>>> m.dist_pearson(ubx)
(0.41740393308237722, 2)
\end{verbatim}
This means, that the best PCC between motif \verb|m| and \verb|Ubx| is obtained with the following alignment:
\begin{verbatim}
bbTAAT
TATAAb
\end{verbatim}
where \verb|b| stands for background distribution. The PCC itself is
roughly $1-0.42=0.58$. If we try the reverse complement of the Ubx motif:
\begin{verbatim}
>>> m.dist_pearson(ubx.reverse_complement())
(0.25784180151584823, 1)
\end{verbatim}
We can see that the PCC is better (almost $0.75$), and the alignment is also different:
\begin{verbatim}
bATTA
TATAA
\end{verbatim}
There are two other functions \verb|dist_dpq|, which is a true metric based on the Kullback-Leibler divergence and proposed by Endres and Sendelin \cite{Endres2003}
\begin{verbatim}
>>> m.dist_dpq(ubx.reverse_complement())
(0.49292358382899853, 1)
\end{verbatim}
In case you need yet another way of comparing motifs, you can use the
\verb|dist_product| method, which is based on the product of
probabilities which can be interpreted as the probability of
independently generating the same instance by both motifs.
\begin{verbatim}
>>> m.dist_product(ubx.reverse_complement())
(0.16224587301064275, 1)
\end{verbatim}
\section{\emph{De novo} motif finding}
\label{sec:find}
Currently, biopython has only limited support for \emph{de novo} motif
finding. Namely, we support running and parsing of AlignAce \cite{Hughes2000} and
MEME\cite{Bailey1994}. Since the number of motif finding tools is growing rapidly,
contributions of new parsers are welcome.
\subsection{MEME}
\label{sec:meme}
Let's assume, you have run MEME on sequences of your choice with your
favorite parameters and saved the output in the file
\verb|meme.out|. You can retrieve the motifs reported by MEME by
running the following piece of code:
\begin{verbatim}
>>> motifsM = list(Motif.parse(open("meme.out"), "MEME"))
>>> motifsM
[<Bio.Motif.MEMEMotif.MEMEMotif object at 0xc356b0>]
\end{verbatim}
Besides the most wanted list of motifs, the result object contains more useful information, accessible through properties with self-explanatory names:
\begin{itemize}
\item \verb|.alphabet|
\item \verb|.datafile|
\item \verb|.sequence_names|
\item \verb|.version|
\item \verb|.command|
\end{itemize}
The motifs returned by MEMEParser can be treated exactly like regular
Motif objects (with instances), they also provide some extra
functionality, by adding additional information about the instances.
\begin{verbatim}
>>> motifsM[0].consensus
Seq('CTCAATCGTA', IUPACUnambiguousDNA())
>>> motifsM[0].instances[0].pvalue
8.71e-07
>>> motifsM[0].instances[0].sequence_name
'SEQ10;'
>>> motifsM[0].instances[0].start
3
>>> motifsM[0].instances[0].strand
'+'
\end{verbatim}
\subsection{AlignAce}
\label{sec:alignace}
We can do very similar things with AlignACE program. Assume, you have
your output in the file \verb|alignace.out|. You can parse your output
with the following code:
\begin{verbatim}
>>> motifsA = list(Motif.parse(open("alignace.out"), "AlignAce"))
\end{verbatim}
Again, your motifs behave as they should:
\begin{verbatim}
>>> motifsA[0].consensus
Seq('TCTACGATTGAG', IUPACUnambiguousDNA())
\end{verbatim}
In fact you can even see, that AlignAce found a very similar motif as
MEME, it is just a longer version of a reverse complement of MEME
motif:
\begin{verbatim}
>>> motifsM[0].reverse_complement().consensus
Seq('TACGATTGAG', IUPACUnambiguousDNA())
\end{verbatim}
If you have AlignAce installed on the same machine, you can also run
it directly from Biopython. Short example of how this can be done is
shown below (other parameters can be specified as keyword parameters):
\begin{verbatim}
>>> command = "/opt/bin/AlignACE"
>>> input_file = "test.fa"
>>> result = Motif.AlignAce(input_file, cmd=command, gcback=0.6, numcols=10)
>>> result
(<Bio.Application.ApplicationResult instance at 0xf49d78>,
<Bio.File.UndoHandle instance at 0xf49d00>,
<Bio.File.UndoHandle instance at 0xf49dc8>)
\end{verbatim}
Since AlignAce prints all its output to standard output, you can get
to your motifs by parsing the second member of the result:
\begin{verbatim}
motifs = list(Motif.parse(result[1], "AlignAce"))
\end{verbatim}
\section{Useful links }
\label{sec:links}
Wikipedia definitions:
\begin{itemize}
\item \url{http://en.wikipedia.org/wiki/Sequence_motif}
\item \url{http://en.wikipedia.org/wiki/Position_weight_matrix}
\item \url{http://en.wikipedia.org/wiki/Consensus_sequence}
\end{itemize}
Motif finding and comparison methods:
\begin{itemize}
\item AlignAce and CompareAce\url{http://www.psc.edu/general/software/packages/alignace/}
\item MEME and MAST \url{http://meme.sdsc.edu/meme/}
\item BioProspector \url{http://bioprospector.stanford.edu/}
\item MDScan \url{http://ai.stanford.edu/~xsliu/MDscan/}
\item Weeder
\item STAMP \url{http://www.benoslab.pitt.edu/stamp/}
\item WebMotifs \url{http://fraenkel.mit.edu/webmotifs/}
\item Comparison of different programs \url{http://bio.cs.washington.edu/assessment/}
\end{itemize}
\begin{thebibliography}{10}
\bibitem{tamo}
Gordon DB, Nekludova L, McCallum S, Fraenkel E: \textbf{TAMO: a flexible,
object-oriented framework for analyzing transcriptional regulation using
DNA-sequence motifs.} \emph{Bioinformatics} 2005, \textbf{21}(14):3164--5.
\bibitem{crooks2004}
Crooks GE, Hon G, Chandonia JM, steven E~Brenner: \textbf{WebLogo: A Sequence
Logo Generator}. \emph{Genome Research} 2004, \textbf{14}:1188--1190.
\bibitem{Rahmann2003}
Rahmann S, Mueller T, Vingron M: \textbf{On the power of profiles for
transcription factor binding site detection.} \emph{Stat Appl Genet Mol Biol}
2003, \textbf{2}:Article7.
\bibitem{Endres2003}
Endres D, Schindelin J: \textbf{A new metric for probability distributions}.
\emph{IEEE transactions on Information Theory} 2003,
\textbf{49}(7):1858--1860.
\bibitem{Bailey1994}
Bailey TL, Elkan C: \textbf{{Fitting a mixture model by expectation
maximization to discover motifs in biopolymers}}. \emph{Proc Int Conf Intell
Syst Mol Biol} 1994, \textbf{2}:28--36.
\bibitem{Hughes2000}
Hughes JD, Estep PW, Tavazoie S, Church GM: \textbf{{Computational
identification of cis-regulatory elements associated with groups of
functionally related genes in Saccharomyces cerevisiae}}. \emph{J Mol Biol}
2000, \textbf{296}(5):1205--1214.
\end{thebibliography}
\end{document}
|