1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
|
\chapter{Bio.phenotype: analyse phenotypic data}
\label{chapter:phenotype}
This chapter gives an overview of the functionalities of the
\verb|Bio.phenotype| package included in Biopython. The scope of this
package is the analysis of phenotypic data, which means parsing and
analysing growth measurements of cell cultures.
In its current state the package is focused on the analysis of
high-throughput phenotypic experiments produced by the
\href{https://en.wikipedia.org/wiki/Phenotype_microarray}{Phenotype Microarray technology},
but future developments may include other platforms and formats.
\section{Phenotype Microarrays}
\label{sec:phenotypemicroarrays}
The \href{https://en.wikipedia.org/wiki/Phenotype_microarray}{Phenotype Microarray}
is a technology that measures the metabolism of bacterial
and eukaryotic cells on roughly 2000 chemicals, divided in twenty 96-well
plates.
The technology measures the reduction of a tetrazolium dye by
NADH, whose production by the cell is used as a proxy for cell metabolism;
color development due to the reduction of this dye is typically measured
once every 15 minutes.
When cells are grown in a media that sustains cell metabolism, the
recorded phenotypic data resembles a sigmoid growth curve, from which a
series of growth parameters can be retrieved.
\subsection{Parsing Phenotype Microarray data}
The \verb|Bio.phenotype| package can parse two different formats of
Phenotype Microarray data: the
\href{https://en.wikipedia.org/wiki/Comma-separated_values}{CSV}
(comma separated values) files produced by the machine's proprietary
software and \href{https://en.wikipedia.org/wiki/JSON}{JSON}
files produced by analysis software, like
\href{https://www.dsmz.de/research/microorganisms/projects/analysis-of-omnilog-phenotype-microarray-data.html}{opm}
or \href{http://combogenomics.github.io/DuctApe/}{DuctApe}.
The parser will return one or a generator of PlateRecord objects, depending
on whether the read or parse method is being used.
You can test the parse function by using the \href{http://biopython.org/SRC/biopython/Doc/examples/Plates.csv}{\texttt{Plates.csv}} file provided with the Biopython source code.
%doctest examples lib:numpy
\begin{verbatim}
>>> from Bio import phenotype
>>> for record in phenotype.parse("Plates.csv", "pm-csv"):
... print("%s %i" % (record.id, len(record)))
...
PM01 96
PM01 96
PM09 96
PM09 96
\end{verbatim}
The parser returns a series of PlateRecord objects, each one containing a series of WellRecord objects
(holding each well's experimental data) arranged in 8 rows and 12 columns; each row is indicated by
a uppercase character from A to H, while columns are indicated by a two digit number, from 01 to 12.
There are several ways to access WellRecord objects from a PlateRecord objects:
\begin{description}
\item[Well identifier]
If you know the well identifier (row + column identifiers) you can access the desired well directly.
\begin{verbatim}
>>> record['A02']
\end{verbatim}
\item[Well plate coordinates]
The same well can be retrieved by using the row and columns numbers (0-based index).
%doctest examples lib:numpy
\begin{verbatim}
>>> from Bio import phenotype
>>> record = list(phenotype.parse("Plates.csv", "pm-csv"))[-1]
>>> print(record[0, 1].id)
A02
\end{verbatim}
\item[Row or column coordinates]
A series of WellRecord objects contiguous to each other in the plate can be retrieved in bulk by
using the python list slicing syntax on PlateRecord objects; rows and columns are numbered with
a 0-based index.
%cont-doctest
\begin{verbatim}
>>> print(record[0])
Plate ID: PM09
Well: 12
Rows: 1
Columns: 12
PlateRecord('WellRecord['A01'], WellRecord['A02'], WellRecord['A03'], ..., WellRecord['A12']')
>>> print(record[:, 0])
Plate ID: PM09
Well: 8
Rows: 8
Columns: 1
PlateRecord('WellRecord['A01'], WellRecord['B01'], WellRecord['C01'], ..., WellRecord['H01']')
>>> print(record[:3, :3])
Plate ID: PM09
Well: 9
Rows: 3
Columns: 3
PlateRecord('WellRecord['A01'], WellRecord['A02'], WellRecord['A03'], ..., WellRecord['C03']')
\end{verbatim}
\end{description}
\subsection{Manipulating Phenotype Microarray data}
\subsubsection{Accessing raw data}
The raw data extracted from the PM files is comprised of a series of tuples for each well,
containing the time (in hours) and the colorimetric measure (in arbitrary units).
Usually the instrument collects data every fifteen minutes, but that can vary between
experiments. The raw data can be accessed by iterating on a WellRecord object;
in the example below only the first ten time points are shown.
%doctest examples lib:numpy
\begin{verbatim}
>>> from Bio import phenotype
>>> record = list(phenotype.parse("Plates.csv", "pm-csv"))[-1]
>>> well = record['A02']
\end{verbatim}
%rest of code snippet output is truncated
\begin{verbatim}
>>> for time, signal in well:
... print(time, signal)
...
(0.0, 12.0)
(0.25, 18.0)
(0.5, 27.0)
(0.75, 35.0)
(1.0, 37.0)
(1.25, 41.0)
(1.5, 44.0)
(1.75, 44.0)
(2.0, 44.0)
(2.25, 44.0)
[...]
\end{verbatim}
This method, while providing a way to access the raw data, doesn't allow a direct
comparison between different WellRecord objects, which may have measurements at
different time points.
\subsubsection{Accessing interpolated data}
To make it easier to compare different experiments and in general to allow a more intuitive handling
of the phenotypic data, the module allows to define a custom slicing of the time points that are present
in the WellRecord object. Colorimetric data for time points that have not been directly measured are
derived through a linear interpolation of the available data, otherwise a NaN is returned.
This method only works in the time interval where actual data is available.
Time intervals can be defined with the same syntax as list
indexing; the default time interval is therefore one hour.
%cont-doctest
\begin{verbatim}
>>> well[:10]
[12.0, 37.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0]
\end{verbatim}
Different time intervals can be used, for instance five minutes:
%Rounding makes this tricky as a doctest...
\begin{verbatim}
>>> well[63:64:0.083]
[12.0, 37.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0]
>>> well[9.55]
44.0
>>> well[63.33:73.33]
[113.31999999999999,
117.0,
120.31999999999999,
128.0,
129.63999999999999,
132.95999999999998,
136.95999999999998,
140.0,
142.0,
nan]
\end{verbatim}
\subsubsection{Control well subtraction}
Many Phenotype Microarray plates contain a control well (usually A01), that is a well where the media shouldn't support
any growth; the low signal produced by this well can be subtracted from the other wells.
The PlateRecord objects have a dedicated function for that, which returns another PlateRecord object
with the corrected data.
%cont-doctest
\begin{verbatim}
>>> corrected = record.subtract_control(control='A01')
>>> record['A01'][63]
336.0
>>> corrected['A01'][63]
0.0
\end{verbatim}
\subsubsection{Parameters extraction}
Those wells where metabolic activity is observed show a sigmoid behavior for the colorimetric data.
To allow an easier way to compare different experiments a sigmoid curve can be fitted onto the data,
so that a series of summary parameters can be extracted and used for comparisons.
The parameters that can be extracted from the curve are:
\begin{itemize}
\item Minimum (\textbf{min}) and maximum (\textbf{max}) signal;
\item Average height (\textbf{average\_height});
\item Area under the curve (\textbf{area});
\item Curve plateau point (\textbf{plateau});
\item Curve slope during exponential metabolic activity (\textbf{slope});
\item Curve lag time (\textbf{lag}).
\end{itemize}
All the parameters (except \textbf{min}, \textbf{max} and \textbf{average\_height}) require the
\href{http://www.scipy.org/}{scipy library} to be installed.
The fit function uses three sigmoid functions:
\begin{description}
\item[Gompertz] $Ae^{-e^{(\frac{\mu_{m}e}{A}(\lambda - t) + 1)}} + y0$
\item[Logistic] $\frac{A}{1+e^{(\frac{4\mu_{m}}{A}(\lambda - t) + 2)}} + y_{0}$
\item[Richards] $A(1 + ve^{1 + v} + e^{\frac{\mu_{m}}{A}(1 + v)(1 + \frac{1}{v})(\lambda - t)})^{-\frac{1}{v}} + y0$
\end{description}
Where:
\begin{itemize}
\item[\textbf{A}] corresponds to the \textbf{plateau}
\item[\textbf{$\mu_{m}$}] corresponds to the \textbf{slope}
\item[\textbf{$\lambda$}] corresponds to the \textbf{lag}
\end{itemize}
These functions have been derived from \href{http://www.ncbi.nlm.nih.gov/pubmed/16348228}{this publication}.
The fit method by default tries first to fit the gompertz function: if it fails it will then try to fit
the logistic and then the richards function. The user can also specify one of the three functions to be applied.
%TODO: enable with doctest examples
\begin{verbatim}
>>> from Bio import phenotype
>>> record = list(phenotype.parse("Plates.csv", "pm-csv"))[-1]
>>> well = record['A02']
>>> well.fit()
>>> print("Function fitted: %s" % well.model)
Function fitted: gompertz
>>> for param in ["area", "average_height", "lag", "max", "min",
... "plateau", "slope"]:
... print("%s\t%.2f" % (param, getattr(well, param)))
...
area 4414.38
average_height 61.58
lag 48.60
max 143.00
min 12.00
plateau 120.02
slope 4.99
\end{verbatim}
\subsection{Writing Phenotype Microarray data}
PlateRecord objects can be written to file in the form of
\href{https://en.wikipedia.org/wiki/JSON}{JSON}
files, a format compatible with other software packages such as
\href{https://www.dsmz.de/research/microorganisms/projects/analysis-of-omnilog-phenotype-microarray-data.html}{opm}
or \href{http://combogenomics.github.io/DuctApe/}{DuctApe}.
\begin{verbatim}
>>> phenotype.write(record, "out.json", "pm-json")
1
\end{verbatim}
|