1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354
|
..
Copyright (C) 2014 Jaakko Luttinen
This file is licensed under the MIT License. See LICENSE for a text of the
license.
.. testsetup::
# This is the PCA model from the previous section
import numpy as np
np.random.seed(1)
from bayespy.nodes import GaussianARD, Gamma, Dot
D = 3
X = GaussianARD(0, 1,
shape=(D,),
plates=(1,100),
name='X')
alpha = Gamma(1e-3, 1e-3,
plates=(D,),
name='alpha')
C = GaussianARD(0, alpha,
shape=(D,),
plates=(10,1),
name='C')
F = Dot(C, X)
tau = Gamma(1e-3, 1e-3, name='tau')
Y = GaussianARD(F, tau)
Performing inference
====================
Approximation of the posterior distribution can be divided into several steps:
- Observe some nodes
- Choose the inference engine
- Initialize the posterior approximation
- Run the inference algorithm
In order to illustrate these steps, we'll be using the PCA model constructed in
the previous section.
Observing nodes
---------------
First, let us generate some toy data:
>>> c = np.random.randn(10, 2)
>>> x = np.random.randn(2, 100)
>>> data = np.dot(c, x) + 0.1*np.random.randn(10, 100)
The data is provided by simply calling ``observe`` method of a stochastic node:
>>> Y.observe(data)
It is important that the shape of the ``data`` array matches the plates and
shape of the node ``Y``. For instance, if ``Y`` was :class:`Wishart` node for
:math:`3\times 3` matrices with plates ``(5,1,10)``, the full shape of ``Y``
would be ``(5,1,10,3,3)``. The ``data`` array should have this shape exactly,
that is, no broadcasting rules are applied.
Missing values
++++++++++++++
It is possible to mark missing values by providing a mask which is a boolean
array:
>>> Y.observe(data, mask=[[True], [False], [False], [True], [True],
... [False], [True], [True], [True], [False]])
``True`` means that the value is observed and ``False`` means that the value is
missing. The shape of the above mask is ``(10,1)``, which broadcasts to the
plates of Y, ``(10,100)``. Thus, the above mask means that the second, third,
sixth and tenth rows of the :math:`10\times 100` data matrix are missing.
The mask is applied to the *plates*, not to the data array directly. This means
that it is not possible to observe a random variable partially, each repetition
defined by the plates is either fully observed or fully missing. Thus, the mask
is applied to the plates. It is often possible to circumvent this seemingly
tight restriction by adding an observable child node which factorizes more.
The shape of the mask is broadcasted to plates using standard NumPy broadcasting
rules. So, if the variable has plates ``(5,1,10)``, the mask could have a shape
``()``, ``(1,)``, ``(1,1)``, ``(1,1,1)``, ``(10,)``, ``(1,10)``, ``(1,1,10)``,
``(5,1,1)`` or ``(5,1,10)``. In order to speed up the inference, missing values
are automatically integrated out if they are not needed as latent variables to
child nodes. This leads to faster convergence and more accurate approximations.
Choosing the inference method
-----------------------------
Inference methods can be found in :mod:`bayespy.inference` package. Currently,
only variational Bayesian approximation is implemented
(:class:`bayespy.inference.VB`). The inference engine is constructed by giving
the stochastic nodes of the model.
>>> from bayespy.inference import VB
>>> Q = VB(Y, C, X, alpha, tau)
There is no need to give any deterministic nodes. Currently, the inference
engine does not automatically search for stochastic parents and children, thus
it is important that all stochastic nodes of the model are given. This should
be made more robust in future versions.
A node of the model can be obtained by using the name of the node as a key:
>>> Q['X']
<bayespy.inference.vmp.nodes.gaussian.GaussianARD object at 0x...>
Note that the returned object is the same as the node object itself:
>>> Q['X'] is X
True
Thus, one may use the object ``X`` when it is available. However, if the model
and the inference engine are constructed in another function or module, the node
object may not be available directly and this feature becomes useful.
Initializing the posterior approximation
----------------------------------------
The inference engines give some initialization to the stochastic nodes by
default. However, the inference algorithms can be sensitive to the
initialization, thus it is sometimes necessary to have better control over the
initialization. For VB, the following initialization methods are available:
- ``initialize_from_prior``: Use the current states of the parent nodes to
update the node. This is the default initialization.
- ``initialize_from_parameters``: Use the given parameter values for the
distribution.
- ``initialize_from_value``: Use the given value for the variable.
- ``initialize_from_random``: Draw a random value for the variable. The random
sample is drawn from the current state of the node's distribution.
Note that ``initialize_from_value`` and ``initialize_from_random`` initialize
the distribution with a value of the variable instead of parameters of the
distribution. Thus, the distribution is actually a delta distribution with a
peak on the value after the initialization. This state of the distribution does
not have proper natural parameter values nor normalization, thus the VB lower
bound terms are ``np.nan`` for this initial state.
These initialization methods can be used to perform even a bit more complex
initializations. For instance, a Gaussian distribution could be initialized
with a random mean and variance 0.1. In our PCA model, this can be obtained by
>>> X.initialize_from_parameters(np.random.randn(1, 100, D), 10)
Note that the shape of the random mean is the sum of the plates ``(1, 100)`` and
the variable shape ``(D,)``. In addition, instead of variance,
:class:`GaussianARD` uses precision as the second parameter, thus we initialized
the variance to :math:`\frac{1}{10}`. This random initialization is important
in our PCA model because the default initialization gives ``C`` and ``X`` zero
mean. If the mean of the other variable was zero when the other is updated, the
other variable gets zero mean too. This would lead to an update algorithm where
both means remain zeros and effectively no latent space is found. Thus, it is
important to give non-zero random initialization for ``X`` if ``C`` is updated
before ``X`` the first time. It is typical that at least some nodes need be
initialized with some randomness.
By default, nodes are initialized with the method ``initialize_from_prior``.
The method is not very time consuming but if for any reason you want to avoid
that default initialization computation, you can provide ``initialize=False``
when creating the stochastic node. However, the node does not have a proper
state in that case, which leads to errors in VB learning unless the distribution
is initialized using the above methods.
Running the inference algorithm
-------------------------------
The approximation methods are based on iterative algorithms, which can
be run using ``update`` method. By default, it takes one iteration step
updating all nodes once:
>>> Q.update()
Iteration 1: loglike=-9.305259e+02 (... seconds)
The ``loglike`` tells the VB lower bound. The order in which the nodes are
updated is the same as the order in which the nodes were given when creating
``Q``. If you want to change the order or update only some of the nodes, you
can give as arguments the nodes you want to update and they are updated in the
given order:
>>> Q.update(C, X)
Iteration 2: loglike=-8.818976e+02 (... seconds)
It is also possible to give the same node several times:
>>> Q.update(C, X, C, tau)
Iteration 3: loglike=-8.071222e+02 (... seconds)
Note that each call to ``update`` is counted as one iteration step although not
variables are necessarily updated. Instead of doing one iteration step,
``repeat`` keyword argument can be used to perform several iteration steps:
>>> Q.update(repeat=10)
Iteration 4: loglike=-7.167588e+02 (... seconds)
Iteration 5: loglike=-6.827873e+02 (... seconds)
Iteration 6: loglike=-6.259477e+02 (... seconds)
Iteration 7: loglike=-4.725400e+02 (... seconds)
Iteration 8: loglike=-3.270816e+02 (... seconds)
Iteration 9: loglike=-2.208865e+02 (... seconds)
Iteration 10: loglike=-1.658761e+02 (... seconds)
Iteration 11: loglike=-1.469468e+02 (... seconds)
Iteration 12: loglike=-1.420311e+02 (... seconds)
Iteration 13: loglike=-1.405139e+02 (... seconds)
The VB algorithm stops automatically if it converges, that is, the relative
change in the lower bound is below some threshold:
>>> Q.update(repeat=1000)
Iteration 14: loglike=-1.396481e+02 (... seconds)
...
Iteration 488: loglike=-1.224106e+02 (... seconds)
Converged at iteration 488.
Now the algorithm stopped before taking 1000 iteration steps because it
converged. The relative tolerance can be adjusted by providing ``tol`` keyword
argument to the ``update`` method:
>>> Q.update(repeat=10000, tol=1e-6)
Iteration 489: loglike=-1.224094e+02 (... seconds)
...
Iteration 847: loglike=-1.222506e+02 (... seconds)
Converged at iteration 847.
Making the tolerance smaller, may improve the result but it may also
significantly increase the iteration steps until convergence.
Instead of using ``update`` method of the inference engine ``VB``, it is
possible to use the ``update`` methods of the nodes directly as
>>> C.update()
or
>>> Q['C'].update()
However, this is not recommended, because the ``update`` method of the inference
engine ``VB`` is a wrapper which, in addition to calling the nodes' ``update``
methods, checks for convergence and does a few other useful minor things. But
if for any reason these direct update methods are needed, they can be used.
.. _sec-parameter-expansion:
Parameter expansion
+++++++++++++++++++
Sometimes the VB algorithm converges very slowly. This may happen when the
variables are strongly coupled in the true posterior but factorized in the
approximate posterior. This coupling leads to zigzagging of the variational
parameters which progresses slowly. One solution to this problem is to use
parameter expansion. The idea is to add an auxiliary variable which
parameterizes the posterior approximation of several variables. Then optimizing
this auxiliary variable actually optimizes several posterior approximations
jointly leading to faster convergence.
The parameter expansion is model specific. Currently in BayesPy, only
state-space models have built-in parameter expansions available. These
state-space models contain a variable which is a dot product of two variables
(plus some noise):
.. math::
y = \mathbf{c}^T\mathbf{x} + \mathrm{noise}
The parameter expansion can be motivated by noticing that we can add an
auxiliary variable which rotates the variables :math:`\mathbf{c}` and
:math:`\mathbf{x}` so that the dot product is unaffected:
.. math::
y &= \mathbf{c}^T\mathbf{x} + \mathrm{noise}
= \mathbf{c}^T \mathbf{R} \mathbf{R}^{-1}\mathbf{x} + \mathrm{noise}
= (\mathbf{R}^T\mathbf{c})^T(\mathbf{R}^{-1}\mathbf{x}) + \mathrm{noise}
Now, applying this rotation to the posterior approximations
:math:`q(\mathbf{c})` and :math:`q(\mathbf{x})`, and optimizing the VB lower
bound with respect to the rotation leads to parameterized joint optimization of
:math:`\mathbf{c}` and :math:`\mathbf{x}`.
The available parameter expansion methods are in module ``transformations``:
>>> from bayespy.inference.vmp import transformations
First, you create the rotation transformations for the two variables:
>>> rotX = transformations.RotateGaussianARD(X)
>>> rotC = transformations.RotateGaussianARD(C, alpha)
.. currentmodule:: bayespy.inference.vmp.transformations
Here, the rotation for ``C`` provides the ARD parameters ``alpha`` so they are
updated simultaneously. In addition to :class:`RotateGaussianARD`, there are a
few other built-in rotations defined, for instance, :class:`RotateGaussian` and
:class:`RotateGaussianMarkovChain`. It is extremely important that the model
satisfies the assumptions made by the rotation class and the user is mostly
responsible for this. The optimizer for the rotations is constructed by giving
the two rotations and the dimensionality of the rotated space:
.. currentmodule:: bayespy.nodes
>>> R = transformations.RotationOptimizer(rotC, rotX, D)
Now, calling ``rotate`` method will find optimal rotation and update the
relevant nodes (``X``, ``C`` and ``alpha``) accordingly:
>>> R.rotate()
Let us see how our iteration would have gone if we had used this parameter
expansion. First, let us re-initialize our nodes and VB algorithm:
>>> alpha.initialize_from_prior()
>>> C.initialize_from_prior()
>>> X.initialize_from_parameters(np.random.randn(1, 100, D), 10)
>>> tau.initialize_from_prior()
>>> Q = VB(Y, C, X, alpha, tau)
Then, the rotation is set to run after each iteration step:
>>> Q.callback = R.rotate
Now the iteration converges to the relative tolerance :math:`10^{-6}` much
faster:
>>> Q.update(repeat=1000, tol=1e-6)
Iteration 1: loglike=-9.363...e+02 (... seconds)
...
Iteration 18: loglike=-1.221354e+02 (... seconds)
Converged at iteration 18.
The convergence took 18 iterations with rotations and 488 or 847 iterations
without the parameter expansion. In addition, the lower bound is improved
slightly. One can compare the number of iteration steps in this case because
the cost per iteration step with or without parameter expansion is approximately
the same. Sometimes the parameter expansion can have the drawback that it
converges to a bad local optimum. Usually, this can be solved by updating the
nodes near the observations a few times before starting to update the
hyperparameters and to use parameter expansion. In any case, the parameter
expansion is practically necessary when using state-space models in order to
converge to a proper solution in a reasonable time.
|