File: fit_maths.tex

package info (click to toggle)
pyxplot 0.9.2-14
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 10,288 kB
  • sloc: ansic: 50,373; xml: 1,339; python: 570; sh: 318; makefile: 89
file content (424 lines) | stat: -rw-r--r-- 18,262 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
% fit_maths.tex
%
% The documentation in this file is part of Pyxplot
% <http://www.pyxplot.org.uk>
%
% Copyright (C) 2006-2012 Dominic Ford <coders@pyxplot.org.uk>
%               2008-2012 Ross Church
%
% $Id: fit_maths.tex 1261 2012-07-11 21:38:05Z dcf21 $
%
% Pyxplot is free software; you can redistribute it and/or modify it under the
% terms of the GNU General Public License as published by the Free Software
% Foundation; either version 2 of the License, or (at your option) any later
% version.
%
% You should have received a copy of the GNU General Public License along with
% Pyxplot; if not, write to the Free Software Foundation, Inc., 51 Franklin
% Street, Fifth Floor, Boston, MA  02110-1301, USA

% ----------------------------------------------------------------------------

% LaTeX source for the Pyxplot Users' Guide

\chapter{The {\tt fit} command: mathematical details}
\chaptermark{Details of the {\tt fit} command}
\label{ch:fit_maths}

In this section, the mathematical details of the workings of the \indcmdt{fit}
are described. This may be of interest in diagnosing its limitations, and also
in understanding the various quantities that it outputs after a fit is found.
This discussion must necessarily be a rather brief treatment of a large
subject; for a fuller account, the reader is referred to D.S.\ Sivia's {\it
Data Analysis: A Bayesian Tutorial}.

\section{Notation}
\label{sec:bayes_notation}

I shall assume that we have some function $f()$, which takes $n_\mathrm{x}$
parameters, $x_0$...$x_{n_\mathrm{x}-1}$, the set of which may collectively be
written as the vector $\mathbf{x}$. We are supplied a datafile, containing a
number $n_\mathrm{d}$ of datapoints, each consisting of a set of values for
each of the $n_\mathrm{x}$ parameters, and one for the value which we are
seeking to make $f(\mathbf{x})$ match. I shall call of parameter values for the
$i$th datapoint $\mathbf{x}_i$, and the corresponding value which we are trying
to match $f_i$. The \datafile\ may contain error estimates for the values $f_i$,
which I shall denote $\sigma_i$. If these are not supplied, then I shall
consider these quantities to be unknown, and equal to some constant
$\sigma_\mathrm{data}$.

Finally, I assume that there are $n_\mathrm{u}$ coefficients within the
function $f()$ that we are able to vary, corresponding to those variable names
listed after the {\tt via} statement in the {\tt fit} command. I shall
call these coefficients $u_0$...$u_{n_\mathrm{u}-1}$, and refer to them
collectively as $\mathbf{u}$.

I model the values $f_i$ in the supplied \datafile\ as being noisy
Gaussian-distributed observations of the true function $f()$, and within this
framework, seek to find that vector of values $\mathbf{u}$ which is most
probable, given these observations. The probability of any given $\mathbf{u}$
is written
$\mathrm{P}\left( \mathbf{u} | \left\{ \mathbf{x}_i, f_i, \sigma_i \right\} \right)$.

\section{The probability density function}
\label{sec:bayes_pdf}

Bayes' Theorem states that:

\begin{equation}
\mathrm{P}\left( \mathbf{u} | \left\{ \mathbf{x}_i, f_i, \sigma_i \right\} \right) =
\frac{
\mathrm{P}\left( \left\{f_i \right\} | \mathbf{u}, \left\{ \mathbf{x}_i, \sigma_i \right\} \right)
\mathrm{P}\left( \mathbf{u} | \left\{ \mathbf{x}_i, \sigma_i \right\} \right)
}{
\mathrm{P}\left( \left\{f_i \right\} | \left\{ \mathbf{x}_i, \sigma_i \right\} \right)
}
\end{equation}

Since we are only seeking to maximise the quantity on the left, and the
denominator, termed the Bayesian \textit{evidence}, is independent of
$\mathbf{u}$, we can neglect it and replace the equality sign with a
proportionality sign.  Furthermore, if we assume a uniform prior, that is, we
assume that we have no prior knowledge to bias us towards certain more favoured
values of $\mathbf{u}$, then $\mathrm{P}\left( \mathbf{u} \right)$ is also a
constant which can be neglected. We conclude that maximising $\mathrm{P}\left(
\mathbf{u} | \left\{ \mathbf{x}_i, f_i, \sigma_i \right\} \right)$ is
equivalent to maximising $\mathrm{P}\left( \left\{f_i \right\} | \mathbf{u},
\left\{ \mathbf{x}_i, \sigma_i \right\} \right)$.

Since we are assuming $f_i$ to be Gaussian-distributed observations of the true
function $f()$, this latter probability can be written as a product of
$n_\mathrm{d}$ Gaussian distributions:

\begin{equation}
\mathrm{P}\left( \left\{f_i \right\} | \mathbf{u}, \left\{ \mathbf{x}_i, \sigma_i \right\} \right)
=
\prod_{i=0}^{n_\mathrm{d}-1} \frac{1}{\sigma_i\sqrt{2\pi}} \exp \left(
\frac{
-\left[f_i - f_\mathbf{u}(\mathbf{x}_i)\right]^2
}{
2 \sigma_i^2
} \right)
\end{equation}

The product in this equation can be converted into a more computationally
workable sum by taking the logarithm of both sides. Since logarithms are
monotonically increasing functions, maximising a probability is equivalent to
maximising its logarithm. We may write the logarithm $L$ of $\mathrm{P}\left(
\mathbf{u} | \left\{ \mathbf{x}_i, f_i, \sigma_i \right\} \right)$ as:

\begin{equation}
L = \sum_{i=0}^{n_\mathrm{d}-1}
\left( \frac{
-\left[f_i - f_\mathbf{u}(\mathbf{x}_i)\right]^2
}{
2 \sigma_i^2
} \right) + k
\end{equation}

\noindent where $k$ is some constant which does not affect the maximisation
process. It is this quantity, the familiar sum-of-square-residuals, that we
numerically maximise to find our best-fitting set of parameters, which I shall
refer to from here on as $\mathbf{u}^0$.

\section{Estimating the error in $\mathbf{u}^0$}

To estimate the error in the best-fitting parameter values that we find, we
assume $\mathrm{P}\left( \mathbf{u} | \left\{ \mathbf{x}_i, f_i, \sigma_i
\right\} \right)$ to be approximated by an $n_\mathrm{u}$-dimensional Gaussian
distribution around $\mathbf{u}^0$. Taking a Taylor expansion of
$L(\mathbf{u})$ about $\mathbf{u}^0$, we can write:

\begin{eqnarray}
L(\mathbf{u}) & = & L(\mathbf{u}^0) +
    \underbrace{
      \sum_{i=0}^{n_\mathrm{u}-1} \left( u_i - u^0_i \right)
      \left.\frac{\partial L}{\partial u_i}\right|_{\mathbf{u}^0}
    }_{\textrm{Zero at $\mathbf{u}^0$ by definition}} + \label{eqa:L_taylor_expand}\\
& & \sum_{i=0}^{n_\mathrm{u}-1} \sum_{j=0}^{n_\mathrm{u}-1} \frac{\left( u_i - u^0_i \right) \left( u_j - u^0_j \right)}{2}
    \left.\frac{\partial^2 L}{\partial u_i \partial u_j}\right|_{\mathbf{u}^0} +
    \mathcal{O}\left( \mathbf{u} - \mathbf{u}^0\right)^3 \nonumber
\end{eqnarray}

Since the logarithm of a Gaussian distribution is a parabola, the quadratic
terms in the above expansion encode the Gaussian component of the probability
distribution $\mathrm{P}\left( \mathbf{u} | \left\{ \mathbf{x}_i, f_i, \sigma_i
\right\} \right)$ about $\mathbf{u}^0$.\footnote{The use of this is called
\textit{Gauss' Method}. Higher order terms in the expansion represent any
non-Gaussianity in the probability distribution, which we neglect. See MacKay,
D.J.C., \textit{Information Theory, Inference and Learning Algorithms}, CUP
(2003).} We may write the sum of these terms, which we denote $Q$, in matrix
form:

\begin{equation}
Q = \frac{1}{2} \left(\mathbf{u} - \mathbf{u}^0\right)^\mathbf{T} \mathbf{A} \left(\mathbf{u} - \mathbf{u}^0\right)
\label{eqn:Q_vector}
\end{equation}

\noindent where the superscript $^\mathbf{T}$ represents the transpose of the
vector displacement from $\mathbf{u}^0$, and $\mathbf{A}$ is the Hessian matrix
of $L$, given by:

\begin{equation}
A_{ij} = \nabla\nabla L = \left.\frac{\partial^2 L}{\partial u_i \partial u_j}\right|_{\mathbf{u}^0}
\end{equation}
\index{Hessian matrix}

This is the Hessian matrix which is output by the {\tt fit} command. In
general, an $n_\mathrm{u}$-dimensional Gaussian distribution such as that given
by Equation~(\ref{eqa:L_taylor_expand}) yields elliptical contours of
equi-probability in parameter space, whose principal axes need not be aligned
with our chosen coordinate axes -- the variables $u_0 ... u_{n_u-1}$. The
eigenvectors $\mathbf{e}_i$ of $\mathbf{A}$ are the principal axes of these
ellipses, and the corresponding eigenvalues $\lambda_i$ equal $1/\sigma_i^2$,
where $\sigma_i$ is the standard deviation of the probability density function
along the direction of these axes.

This can be visualised by imagining that we diagonalise $\mathbf{A}$, and
expand Equation~(\ref{eqn:Q_vector}) in our diagonal basis. The resulting
expression for $L$ is a sum of square terms; the cross terms vanish in this
basis by definition. The equations of the equi-probability contours become the
equations of ellipses:

\begin{equation}
Q = \frac{1}{2} \sum_{i=0}^{n_\mathrm{u}-1} A_{ii} \left(u_i - u^0_i\right)^2 = k
\end{equation}

\noindent where $k$ is some constant. By comparison with the equation for the
logarithm of a Gaussian distribution, we can associate $A_{ii}$ with
$-1/\sigma_i^2$ in our eigenvector basis.

The problem of evaluating the standard deviations of our variables $u_i$ is
more complicated, however, as we are attempting to evaluate the width of these
elliptical equi-probability contours in directions which are, in general, not
aligned with their principal axes. To achieve this, we first convert our
Hessian matrix into a covariance matrix.

\section{The covariance matrix}
\index{covariance matrix}

The terms of the covariance matrix $V_{ij}$ are defined by:

\begin{equation}
V_{ij} = \left< \left(u_i - u^0_i\right) \left(u_j - u^0_j\right) \right>
\label{eqn:def_covar}
\end{equation}

\noindent Its leading diagonal terms may be recognised as equalling the
variances of each of our $n_\mathrm{u}$ variables; its cross terms measure the
correlation between the variables. If a component $V_{ij} > 0$, it implies that
higher estimates of the coefficient $u_i$ make higher estimates of $u_j$ more
favourable also; if $V_{ij} < 0$, the converse is true.

It is a standard statistical result that $\mathbf{V} = (-\mathbf{A})^{-1}$. In
the remainder of this section we prove this; readers who are willing to accept
this may skip onto Section~\ref{sec:correlation_matrix}.

Using $\Delta u_i$ to denote $\left(u_i - u^0_i\right)$, we may proceed by
rewriting Equation~(\ref{eqn:def_covar}) as:

\begin{eqnarray}
V_{ij} & = & \idotsint_{u_i=-\infty}^{\infty}
\Delta u_i \Delta u_j
\mathrm{P}\left(
\mathbf{u} | \left\{ \mathbf{x}_i, f_i, \sigma_i \right\} \right)
\,\mathrm{d}^{n_\mathrm{u}}\mathbf{u} \\
 & = & \frac{
\idotsint_{u_i=-\infty}^{\infty} \Delta u_i \Delta u_j \exp(-Q) \,\mathrm{d}^{n_\mathrm{u}}\mathbf{u}
}{
\idotsint_{u_i=-\infty}^{\infty} \exp(-Q) \,\mathrm{d}^{n_\mathrm{u}}\mathbf{u}
}
\nonumber
\end{eqnarray}

The normalisation factor in the denominator of this expression, which we denote
as $Z$, the \textit{partition function}, may be evaluated by
$n_\mathrm{u}$-dimensional Gaussian integration, and is a standard result:

\begin{eqnarray}
Z & = & \idotsint_{u_i=-\infty}^{\infty} \exp\left(\frac{1}{2} \Delta \mathbf{u}^\mathbf{T} \mathbf{A} \Delta \mathbf{u} \right) \,\mathrm{d}^{n_\mathrm{u}}\mathbf{u} \\
& = & \frac{(2\pi)^{n_\mathrm{u}/2}}{\mathrm{Det}(\mathbf{-A})} \nonumber
\end{eqnarray}

Differentiating $\log_e(Z)$ with respect of any given component of the Hessian
matrix $A_{ij}$ yields:

\begin{equation}
-2 \frac{\partial}{\partial A_{ij}} \left[ \log_e(Z) \right] = \frac{1}{Z}
\idotsint_{u_i=-\infty}^{\infty} \Delta u_i \Delta u_j \exp(-Q) \,\mathrm{d}^{n_\mathrm{u}}\mathbf{u}
\end{equation}

\noindent which we may identify as equalling $V_{ij}$:

\begin{eqnarray}
\label{eqa:v_zrelate}
V_{ij} & = & -2 \frac{\partial}{\partial A_{ij}} \left[ \log_e(Z) \right] \\
& = & -2 \frac{\partial}{\partial A_{ij}} \left[ \log_e((2\pi)^{n_\mathrm{u}/2}) - \log_e(\mathrm{Det}(\mathbf{-A})) \right] \nonumber \\
& = & 2 \frac{\partial}{\partial A_{ij}} \left[ \log_e(\mathrm{Det}(\mathbf{-A})) \right] \nonumber
\end{eqnarray}

\noindent This expression may be simplified by recalling that the determinant
of a matrix is equal to the scalar product of any of its rows with its
cofactors, yielding the result:

\begin{equation}
\frac{\partial}{\partial A_{ij}} \left[\mathrm{Det}(\mathbf{-A})\right] = -a_{ij}
\end{equation}

\noindent where $a_{ij}$ is the cofactor of $A_{ij}$. Substituting this into
Equation~(\ref{eqa:v_zrelate}) yields:

\begin{equation}
V_{ij} = \frac{-a_{ij}}{\mathrm{Det}(\mathbf{-A})}
\end{equation}

Recalling that the adjoint $\mathbf{A}^\dagger$ of the Hessian matrix is the
matrix of cofactors of its transpose, and that $\mathbf{A}$ is symmetric, we
may write:

\begin{equation}
V_{ij} = \frac{-\mathbf{A}^\dagger}{\mathrm{Det}(\mathbf{-A})} \equiv (-\mathbf{A})^{-1}
\end{equation}

\noindent which proves the result stated earlier.

\section{The correlation matrix}
\label{sec:correlation_matrix}
\index{correlation matrix}

Having evaluated the covariance matrix, we may straightforwardly find the
standard deviations in each of our variables, by taking the square roots of the
terms along its leading diagonal. For datafiles where the user does not specify
the standard deviations $\sigma_i$ in each value $f_i$, the task is not quite
complete, as the Hessian matrix depends critically upon these uncertainties,
even if they are assumed the same for all of our $f_i$. This point is returned
to in Section~\ref{sec:finding_sigmai}.

The correlation matrix $\mathbf{C}$, whose terms are given by:

\begin{equation}
C_{ij} = \frac{V_{ij}}{\sigma_i\sigma_j}
\end{equation}

\noindent may be considered a more user-friendly version of the covariance
matrix for inspecting the correlation between parameters. The leading diagonal
terms are all clearly equal unity by construction. The cross terms lie in the
range $-1 \leq C_{ij} \leq 1$, the upper limit of this range representing
perfect correlation between parameters, and the lower limit perfect
anti-correlation.

\section{Finding $\sigma_i$}
\label{sec:finding_sigmai}

Throughout the preceding sections, the uncertainties in the supplied target
values $f_i$ have been denoted $\sigma_i$ (see
Section~\ref{sec:bayes_notation}).  The user has the option of supplying these
in the source datafile, in which case the provisions of the previous sections
are now complete; both best-estimate parameter values and their uncertainties
can be calculated. The user may also, however, leave the uncertainties in $f_i$
unstated, in which case, as described in Section~\ref{sec:bayes_notation}, we
assume all of the data values to have a common uncertainty
$\sigma_\mathrm{data}$, which is an unknown.

In this case, where $\sigma_i = \sigma_\mathrm{data} \,\forall\, i$, the best
fitting parameter values are independent of $\sigma_\mathrm{data}$, but the
same is not true of the uncertainties in these values, as the terms of the
Hessian matrix do depend upon $\sigma_\mathrm{data}$. We must therefore
undertake a further calculation to find the most probable value of
$\sigma_\mathrm{data}$, given the data. This is achieved by maximising
$\mathrm{P}\left( \sigma_\mathrm{data} | \left\{ \mathbf{x}_i, f_i \right\}
\right)$. Returning once again to Bayes' Theorem, we can write:

\begin{equation}
\mathrm{P}\left( \sigma_\mathrm{data} | \left\{ \mathbf{x}_i, f_i \right\} \right)
= \frac{
\mathrm{P}\left( \left\{ f_i \right\} | \sigma_\mathrm{data}, \left\{ \mathbf{x}_i \right\} \right)
\mathrm{P}\left( \sigma_\mathrm{data} | \left\{ \mathbf{x}_i \right\} \right)
}{
\mathrm{P}\left( \left\{ f_i \right\} | \left\{ \mathbf{x}_i \right\} \right)
}
\end{equation}

As before, we neglect the denominator, which has no effect upon the
maximisation problem, and assume a uniform prior $\mathrm{P}\left(
\sigma_\mathrm{data} | \left\{ \mathbf{x}_i \right\} \right)$. This reduces the
problem to the maximisation of $\mathrm{P}\left( \left\{ f_i \right\} |
\sigma_\mathrm{data}, \left\{ \mathbf{x}_i \right\} \right)$, which we may
write as a marginalised probability distribution over $\mathbf{u}$:

\begin{eqnarray}
\label{eqa:p_f_given_sigma}
\mathrm{P}\left( \left\{ f_i \right\} | \sigma_\mathrm{data}, \left\{ \mathbf{x}_i \right\} \right) =
\idotsint_{-\infty}^{\infty}
&
\mathrm{P}\left( \left\{ f_i \right\} | \sigma_\mathrm{data}, \left\{ \mathbf{x}_i \right\}, \mathbf{u} \right)
\times & \\ &
\mathrm{P}\left( \mathbf{u} | \sigma_\mathrm{data}, \left\{ \mathbf{x}_i \right\} \right)
\,\mathrm{d}^{n_\mathrm{u}}\mathbf{u}
& \nonumber
\end{eqnarray}

Assuming a uniform prior for $\mathbf{u}$, we may neglect the latter term in
the integral, but even with this assumption, the integral is not generally
tractable, as $\mathrm{P}\left( \left\{ f_i \right\} | \sigma_\mathrm{data},
\left\{ \mathbf{x}_i \right\}, \left\{ \mathbf{u}_i \right\} \right)$ may well
be multimodal in form. However, if we neglect such possibilities, and assume
this probability distribution to be approximate a Gaussian \textit{globally},
we can make use of the standard result for an $n_\mathrm{u}$-dimensional Gaussian integral:

\begin{equation}
\idotsint_{-\infty}^{\infty}
\exp \left(
\frac{1}{2}\mathbf{u}^\mathbf{T} \mathbf{A} \mathbf{u}
\right) \,\mathrm{d}^{n_\mathrm{u}}\mathbf{u}
=
\frac{
(2\pi)^{n_\mathrm{u}/2}
}{
\sqrt{\mathrm{Det}\left(-\mathbf{A}\right)}
}
\end{equation}

\noindent We may thus approximate Equation~(\ref{eqa:p_f_given_sigma}) as:

\begin{eqnarray}
\mathrm{P}\left( \left\{ f_i \right\} | \sigma_\mathrm{data}, \left\{ \mathbf{x}_i \right\} \right)
& \approx &
\mathrm{P}\left( \left\{ f_i \right\} | \sigma_\mathrm{data}, \left\{ \mathbf{x}_i \right\}, \mathbf{u}^0 \right)
\times \\
& &
\mathrm{P}\left( \mathbf{u}^0 | \sigma_\mathrm{data}, \left\{ \mathbf{x}_i, f_i \right\} \right)
\frac{
(2\pi)^{n_\mathrm{u}/2}
}{
\sqrt{\mathrm{Det}\left(-\mathbf{A}\right)}
}
\nonumber
\end{eqnarray}

As in Section~\ref{sec:bayes_pdf}, it is numerically easier to maximise this
quantity via its logarithm, which we denote $L_2$, and can write as:

\begin{eqnarray}
L_2 & = &
\sum_{i=0}^{n_\mathrm{d}-1}
\left(
\frac{
-\left[f_i - f_{\mathbf{u}^0}(\mathbf{x}_i)\right]^2
}{
2\sigma_\mathrm{data}^2
}
- \log_e \left(2\pi\sqrt{\sigma_\mathrm{data}} \right)
\right) +
\\ & & \nonumber
\log_e \left(
\frac{
(2\pi)^{n_\mathrm{u}/2}
}{
\sqrt{\mathrm{Det}\left(-\mathbf{A}\right)}
}
\right)
\end{eqnarray}

This quantity is maximised numerically, a process simplified by the fact that
$\mathbf{u}^0$ is independent of $\sigma_\mathrm{data}$.