File: advanced.rst

package info (click to toggle)
python-bayespy 0.6.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 2,132 kB
  • sloc: python: 22,402; makefile: 156
file content (324 lines) | stat: -rw-r--r-- 14,493 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
..
   Copyright (C) 2014-2015 Jaakko Luttinen

   This file is licensed under the MIT License. See LICENSE for a text of the
   license.


.. testsetup::

    import numpy as np
    np.random.seed(1)
    # This is the PCA model from the previous sections
    from bayespy.nodes import GaussianARD, Gamma, Dot
    D = 3
    X = GaussianARD(0, 1,
                    shape=(D,),
                    plates=(1,100),
                    name='X')
    alpha = Gamma(1e-3, 1e-3,
                  plates=(D,),
                  name='alpha')
    C = GaussianARD(0, alpha,
                    shape=(D,),
                    plates=(10,1),
                    name='C')
    F = Dot(C, X)
    tau = Gamma(1e-3, 1e-3, name='tau')
    Y = GaussianARD(F, tau, name='Y')
    c = np.random.randn(10, 2)
    x = np.random.randn(2, 100)
    data = np.dot(c, x) + 0.1*np.random.randn(10, 100)
    Y.observe(data)
    from bayespy.inference import VB
    import bayespy.plot as bpplt
    Q = VB(Y, C, X, alpha, tau)
    X.initialize_from_parameters(np.random.randn(1, 100, D), 10)
    from bayespy.inference.vmp import transformations
    rotX = transformations.RotateGaussianARD(X)
    rotC = transformations.RotateGaussianARD(C, alpha)
    R = transformations.RotationOptimizer(rotC, rotX, D)
    Q = VB(Y, C, X, alpha, tau)
    Q.callback = R.rotate
    Q.update(repeat=1000, tol=1e-6, verbose=False)
    import warnings
    warnings.simplefilter('error', UserWarning)

Advanced topics
===============

This section contains brief information on how to implement some advanced
methods in BayesPy.  These methods include Riemannian conjugate gradient
methods, pattern search, simulated annealing, collapsed variational inference
and stochastic variational inference.  In order to use these methods properly,
the user should understand them to some extent.  They are also considered
experimental, thus you may encounter bugs or unimplemented features.  In any
case, these methods may provide huge performance improvements easily compared to
the standard VB-EM algorithm.


Gradient-based optimization
---------------------------

Variational Bayesian learning basically means that the parameters of the
approximate posterior distributions are optimized to maximize the lower bound of
the marginal log likelihood :cite:`Honkela:2010`.  This optimization can be done
by using gradient-based optimization methods.  In order to improve the
gradient-based methods, it is recommended to take into account the information
geometry by using the Riemannian (a.k.a. natural) gradient.  In fact, the
standard VB-EM algorithm is equivalent to a gradient ascent method which uses
the Riemannian gradient and step length 1.  Thus, it is natural to try to
improve this method by using non-linear conjugate gradient methods instead of
gradient ascent.  These optimization methods are especially useful when the
VB-EM update equations are not available but one has to use fixed form
approximation.  But it is possible that the Riemannian conjugate gradient method
improve performance even when the VB-EM update equations are available.


.. currentmodule:: bayespy.inference

The optimization algorithm in :func:`VB.optimize` has a simple interface.
Instead of using the default Riemannian geometry, one can use the Euclidean
geometry by giving :code:`riemannian=False`.  It is also possible to choose the
optimization method from gradient ascent (:code:`method='gradient'`) or
conjugate gradient methods (only :code:`method='fletcher-reeves'` implemented at
the moment).  For instance, we could optimize nodes ``C`` and ``X`` jointly
using Euclidean gradient ascent as:

>>> Q = VB(Y, C, X, alpha, tau)
>>> Q.optimize(C, X, riemannian=False, method='gradient', maxiter=5)
Iteration ...

Note that this is very inefficient way of updating those nodes (bad geometry and
not using conjugate gradients).  Thus, one should understand the idea of these
optimization methods, otherwise one may do something extremely inefficient.
Most likely this method can be found useful in combination with the advanced
tricks in the following sections.


.. note::

   The Euclidean gradient has not been implemented for all nodes yet.  The
   Euclidean gradient is required by the Euclidean geometry based optimization
   but also by the conjugate gradient methods in the Riemannian geometry.  Thus,
   the Riemannian conjugate gradient may not yet work for all models.


It is possible to construct custom optimization algorithms with the tools
provided by :class:`VB`.  For instance, :func:`VB.get_parameters` and
:func:`VB.set_parameters` can be used to handle the parameters of nodes.
:func:`VB.get_gradients` is used for computing the gradients of nodes.  The
parameter and gradient objects are not numerical arrays but more complex nested
lists not meant to be accessed by the user.  Thus, for simple arithmetics with
the parameter and gradient objects, use functions :func:`VB.add` and
:func:`VB.dot`.  Finally, :func:`VB.compute_lowerbound` and
:func:`VB.has_converged` can be used to monitor the lower bound.


Collapsed inference
-------------------

The optimization method can be used efficiently in such a way that some of the
variables are collapsed, that is, marginalized out :cite:`Hensman:2012`.  The
collapsed variables must be conditionally independent given the observations and
all other variables.  Probably, one also wants that the size of the marginalized
variables is large and the size of the optimized variables is small.  For
instance, in our PCA example, we could optimize as follows:

>>> Q.optimize(C, tau, maxiter=10, collapsed=[X, alpha])
Iteration ...

The collapsed variables are given as a list.  This optimization does basically
the following: It first computes the gradients for ``C`` and ``tau`` and takes
an update step using the desired optimization method.  Then, it updates the
collapsed variables by using the standard VB-EM update equations.  These two
steps are taken in turns.  Effectively, this corresponds to collapsing the
variables ``X`` and ``alpha`` in a particular way.  The point of this method is
that the number of parameters in the optimization reduces significantly and the
collapsed variables are updated optimally.  For more details, see
:cite:`Hensman:2012`.

It is possible to use this method in such a way, that the collapsed variables
are not conditionally independent given the observations and all other
variables.  However, in that case, the method does not anymore correspond to
collapsing the variables but just using VB-EM updates after gradient-based
updates.  The method does not check for conditional independence, so the user is
free to do this.

.. note::

   Although the Riemannian conjugate gradient method has not yet been
   implemented for all nodes, it may be possible to collapse those nodes and
   optimize the other nodes for which the Euclidean gradient is already
   implemented.


Pattern search
--------------

The pattern search method estimates the direction in which the approximate
posterior distributions are updating and performs a line search in that
direction :cite:`Honkela:2003`.  The search direction is based on the difference
in the VB parameters on successive updates (or several updates).  The idea is
that the VB-EM algorithm may be slow because it just zigzags and this can be
fixed by moving to the direction in which the VB-EM is slowly moving.

BayesPy offers a simple built-in pattern search method
:func:`VB.pattern_search`.  The method updates the nodes twice, measures the
difference in the parameters and performs a line search with a small number of
function evaluations:

>>> Q.pattern_search(C, X)
Iteration ...

Similarly to the collapsed optimization, it is possible to collapse some of the
variables in the pattern search.  The same rules of conditional independence
apply as above.  The collapsed variables are given as list:

>>> Q.pattern_search(C, tau, collapsed=[X, alpha])
Iteration ...

Also, a maximum number of iterations can be set by using ``maxiter`` keyword
argument.  It is not always obvious whether a pattern search will improve the
rate of convergence or not but if it seems that the convergence is slow because
of zigzagging, it may be worth a try.  Note that the computational cost of the
pattern search is quite high, thus it is not recommended to perform it after
every VB-EM update but every now and then, for instance, after every 10
iterations.  In addition, it is possible to write a more customized VB learning
algorithm which uses pattern searches by using the different methods of
:class:`VB` discussed above.


Deterministic annealing
-----------------------

The standard VB-EM algorithm converges to a local optimum which can often be
inferior to the global optimum and many other local optima.  Deterministic
annealing aims at finding a better local optimum, hopefully even the global
optimum :cite:`Katahira:2008`.  It does this by increasing the weight on the
entropy of the posterior approximation in the VB lower bound.  Effectively, the
annealed lower bound becomes closer to a uniform function instead of the
original multimodal lower bound.  The weight on the entropy is recovered slowly
and the optimization is much more robust to initialization.

In BayesPy, the annealing can be set by using :func:`VB.set_annealing`.  The
given annealing should be in range :math:`(0,1]` but this is not validated in
case the user wants to do something experimental.  If annealing is set to 1, the
original VB lower bound is recovered.  Annealing with 0 would lead to an
improper uniform distribution, thus it will lead to errors.  The entropy term is
weighted by the inverse of this annealing term.  An alternative view is that the
model probability density functions are raised to the power of the annealing
term.

Typically, the annealing is used in such a way that the annealing is small at
the beginning and increased after every convergence of the VB algorithm until
value 1 is reached.  After the annealing value is increased, the algorithm
continues from where it had just converged.  The annealing can be used for
instance as:

>>> beta = 0.1
>>> while beta < 1.0:
...     beta = min(beta*1.5, 1.0)
...     Q.set_annealing(beta)
...     Q.update(repeat=100, tol=1e-4)
Iteration ...

Here, the ``tol`` keyword argument is used to adjust the threshold for
convergence.  In this case, it is a bit larger than by default so the algorithm
does not need to converge perfectly but a rougher convergence is sufficient for
the next iteration with a new annealing value.


Stochastic variational inference
--------------------------------

In stochastic variational inference :cite:`Hoffman:2013`, the idea is to use
mini-batches of large datasets to compute noisy gradients and learn the VB
distributions by using stochastic gradient ascent.  In order for it to be
useful, the model must be such that it can be divided into "intermediate" and
"global" variables.  The number of intermediate variables increases with the
data but the number of global variables remains fixed.  The global variables are
learnt in the stochastic optimization.

By denoting the data as :math:`Y=[Y_1, \ldots, Y_N]`, the intermediate variables
as :math:`Z=[Z_1, \ldots, Z_N]` and the global variables as :math:`\theta`, the
model needs to have the following structure:

.. math::

   p(Y, Z, \theta) &= p(\theta) \prod^N_{n=1} p(Y_n|Z_n,\theta) p(Z_n|\theta)

The algorithm consists of three steps which are iterated: 1) a random mini-batch
of the data is selected, 2) the corresponding intermediate variables are updated
by using normal VB update equations, and 3) the global variables are updated
with (stochastic) gradient ascent as if there was as many replications of the
mini-batch as needed to recover the original dataset size.

The learning rate for the gradient ascent must satisfy:

.. math::

   \sum^\infty_{i=1} \alpha_i = \infty \qquad \text{and} \qquad
   \sum^\infty_{i=1} \alpha^2 < \infty,

where :math:`i` is the iteration number.  An example of a valid learning
parameter is :math:`\alpha_i = (\delta + i)^{-\gamma}`, where :math:`\delta \geq
0` is a delay and :math:`\gamma\in (0.5, 1]` is a forgetting rate.

Stochastic variational inference is relatively easy to use in BayesPy.  The idea
is that the user creates a model for the size of a mini-batch and specifies a
multiplier for those plate axes that are replicated.  For the PCA example, the
mini-batch model can be costructed as follows.  We decide to use ``X`` as an
intermediate variable and the other variables are global.  The global variables
``alpha``, ``C`` and ``tau`` are constructed identically as before.  The
intermediate variable ``X`` is constructed as:

>>> X = GaussianARD(0, 1,
...                 shape=(D,),
...                 plates=(1,5),
...                 plates_multiplier=(1,20),
...                 name='X')

Note that the plates are ``(1,5)`` whereas they are ``(1,100)`` in the full
model.  Thus, we need to provide a plates multiplier ``(1,20)`` to define how
the plates are replicated to get the full dataset.  These multipliers do not
need to be integers, in this case the latter plate axis is multiplied by
:math:`100/5=20`.  The remaining variables are defined as before:

>>> F = Dot(C, X)
>>> Y = GaussianARD(F, tau, name='Y')

Note that the plates of ``Y`` and ``F`` also correspond to the size of the
mini-batch and they also deduce the plate multipliers from their parents, thus
we do not need to specify the multiplier here explicitly (although it is ok to
do so).

Let us construct the inference engine for the new mini-batch model:

>>> Q = VB(Y, C, X, alpha, tau)

Use random initialization for ``C`` to break the symmetry in ``C`` and ``X``:

>>> C.initialize_from_random()

Then, stochastic variational inference algorithm could look as follows:

>>> Q.ignore_bound_checks = True
>>> for n in range(200):
...     subset = np.random.choice(100, 5)
...     Y.observe(data[:,subset])
...     Q.update(X)
...     learning_rate = (n + 2.0) ** (-0.7)
...     Q.gradient_step(C, alpha, tau, scale=learning_rate)
Iteration ...

First, we ignore the bound checks because they are noisy.  Then, the loop
consists of three parts: 1) Draw a random mini-batch of the data (5 samples from
100).  2) Update the intermediate variable ``X``.  3) Update global variables
with gradient ascent using a proper learning rate.


Black-box variational inference
-------------------------------

NOT YET IMPLEMENTED.