1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274
|
..
restindex
page-title: Pyem
crumb: Pyem
link-title: Pyem
encoding: utf-8
output-encoding: None
tags: python,pyem,Expectation Maximization,EM,online EM,recursive EM
file: basic_example1.py
file: basic_example2.py
file: basic_example3.py
file: example1.png
file: Bic_example.png
/restindex
.. Last Change: Mon Jul 02 09:00 PM 2007 J
===================================================
PyEM, a python package for Gaussian mixture models
===================================================
.. contents:: Tables of contents
PyEM is a package which enables to create Gaussian Mixture Models
(diagonal and full covariance matrices supported), to sample them,
and to estimate them from data using Expectation Maximization algorithm.
It can also draw confidence ellipsoides for multivariate models, and
compute the Bayesian Information Criterion to assess the number of
clusters in the data. In a near future, I hope to add so-called
online EM (ie recursive EM) and variational Bayes implementation.
PyEM is implemented in python, and uses the excellent numpy and scipy
packages. Numpy is a python packages which gives python a fast
multi-dimensional array capabilities (ala matlab and the likes); scipy
leverages numpy to build common scientific features for signal processing,
linear algebra, statistics, etc...
Installation
============
.. _scipy: http://www.scipy.org
Pyem depends on several packages to work:
- numpy
- matplotlib (if you wish to use the plotting facilities of pyem)
Those packages are likely to be already installed in a typical numpy/scipy environment.
Since september 2006, pyem is included in the sandbox of `scipy`_. The sandbox
contains packages which are pending for approval in main scipy; that means it
is not installed by default, and that you need to install scipy from sources.
For the most up-to-date version of pyem, you need to download scipy from
subversion, which contains the development branch of scipy.
To install pyem, you just need to edit (or create if it does not exist)
the file Lib/sandbox/enabled_packages.txt in scipy sources, and
add one line with the name of the package (eg pyem).
After, you just need to install scipy normally as explained
`here <http://www.scipy.org/Installing_SciPy>`_.
You can (and should) also test pyem installation using the following:
.. raw:: html
{+mycoloring}
from scipy.sandbox import pyem
pyem.test()
{-mycoloring}
basic usage
============
Once you are inside a python interpreter, you can import pyem
using the follwing command:
.. raw:: html
{+mycoloring}
from scipy.sandbox import pyem
{-mycoloring}
Creating, sampling and plotting a mixture
-----------------------------------------
Importing pyem gives access to 3 classes: GM (for Gausssian Mixture), GMM
(Gaussian Mixture Model) and EM (for Expectation Maximization). The first
class GM can be used to create an artificial Gaussian Model, samples it,
or plot it. The following example show how to create
a 2 dimension Gaussian Model with 3 components, sample it and plot
its confidence ellipsoids with matplotlib:
.. raw:: html
{mycolorize;input/softwares/pyem/basic_example1.py}
.. raw:: latex
\input{basic_example1.tex}
which plots this figure:
.. image:: example1.png
:width: 500
:height: 400
There are basically two ways to create a GM instance: either an empty one (eg
without mean, weights and covariances) by giving hyper parameters (dimension,
number of clusters, type of covariance matrices) during instanciation, or
giving all parameters using the class method GM.fromvalues. The first method is
mostly useful as a container for learning. There are also methods to create
random (but valid !) parameters for a Gaussian Mixture: this can be done by the
function method GM.generate_params (a class method).
Basic estimation of mixture parameters from data
------------------------------------------------
If you want to learn a Gaussian mixture from data with EM, you need to use two
classes from pyem: GMM and EM. You first create a GMM object from a GM
instance; then you can give the GMM object as a parameter to the EM class to
compute iterations of EM; once the EM has finished the computation, the GM
instance of GMM contains the computed parameters.
.. raw:: html
{mycolorize;input/softwares/pyem/basic_example2.py}
.. raw:: latex
\input{basic_example2.tex}
GMM class do all the hard work for learning: it can compute the sufficient
statistics given the current state of the model, and update its parameters from
the sufficient statistics; the EM class uses a GMM instance to compute several
iterations. The idea is that you can implements a different mixture model and
uses the same EM class if you want (there are several optimized models for GMM,
which are subclasses of GMM).
Advanced topics
===============
Bayesian Information Criterion and automatic clustering
-------------------------------------------------------
The GMM class is also able to compute the bayesian information criterion
(BIC), which can be used to assess the number of clusters into the data.
It was first suggested by Schwarz (see bibliography), who gave a Bayesian
argument for adopting the BIC. The BIC is derived from an approximation
of the integrated likelihood of the model, based on regularity assumptions.
The following code generates an artificial mixture of 4 clusters, runs
EM with models of 1 to 6 clusters, and prints which number of clusters
is the most likely from the BIC:
.. raw:: html
{mycolorize;input/softwares/pyem/basic_example3.py}
.. raw:: latex
\input{basic_example3.tex}
which plots this figure:
.. image:: Bic_example.png
:width: 500
:height: 400
The above example also shows that you can control the plotting
parameters by using returned handles from plot methods. This can be
useful for complex drawing.
Examples
=========
Using EM for pdf estimation
---------------------------
The following example uses the old faithful dataset and is available in the
example directory. It models the joint distribution (d(t), w(t+1)), where d(t)
is the duration time, and w(t+1) the waiting time for the next eruption. It
selects the best model using the BIC.
.. raw:: latex
\input{pdfestimation.tex}
.. figure:: pdfestimation.png
:width: 500
:height: 400
isodensity curves for the old faithful data modeled by a 1, 2, 3 and 4
componenits model (up to bottom, left to right).
Using EM for clustering
-----------------------
TODO (this is fundamentally the same than pdf estimation, though)
Using PyEM for supervised learning
----------------------------------
The following example shows how to do classification using discriminative
analysis on the famous iris dataset. The idea is to model each class
distribution P(data|class) by a mixture of Gaussian, and to classify the data
with Maximum A Posteriori (eg takes the class which maximizes P(class | data)).
Not surprisingly, the errors lie on the border between the two classes which
are not linearly separable.
.. raw:: latex
\input{discriminant_analysis.tex}
.. figure:: dclass.png
:width: 600
:height: 400
learning data points on the left with isodensity curves as estimated by
each mixture. On the right, the misclassified points are in red, green are
correctly classified.
Note on performances
====================
Pyem is implemented in python (100% of the code has a python implementation),
but thanks to the Moore Law, it is reasonably fast so that it can be used for
other problems than toys problem. On my computer (linux on bi xeon 3.2 Ghz, 2Gb
RAM), running 10 iterations of EM algorithm on 100 000 samples of dimension 15,
for a diagonal model with 30 components, takes around 1 minute and 15 seconds:
this makes the implementation usable for moderately complex problems such as
speaker recognition using MFCC. If this is too slow, there is a C
implementation for Gaussian densities which can be enabled by commenting out
one line in pyem/gmm_em.py, which should gives a speed up of a factor 2 at
least; this has not been tested much, though, so beware.
Also, increasing the number of components and/or dimension is much more
expensive than increasing the number of samples: a 5 dimension model of 100
components will be much slower to estimate with 500 samples than a 5 dimension
10 components with 5000 samples. This is because loops on dimension/components
are in python, whereas loops on samples are in C (through numpy). I don't
think there is an easy fix to this problem.
Full covariances will be slow, because you cannot avoid nested loop
in python this case without insane amount of memory. A C implementation
may be implemented, but this is not my top priority; most of the time, you
should avoid full covariance models anyway.
TODO
====
I believe the current API simple and powerful enough, except
maybe for plotting (if you think otherwise, I would be happy to hear
your suggestions). Now, I am considering adding some more functionalities
to the toolbox:
- add simple methods for regularization of covariance matrix (easy)
- add bayes prior (using variational Bayes approximation) for overfitting and
model selection problems (not trivial, but doable)
- improve online EM
Other things which are doable but which I don't intend to implement are:
- add other models (mixtures of multinomial: easy, simple HMM: easy, other ?)
- add bayes prior using MCMC (hard, use PyMCMC for sampling ?)
Bibliography
============
TODO.
|