File: dist_tutorial.qbk

package info (click to toggle)
boost1.35 1.35.0-5
  • links: PTS
  • area: main
  • in suites: lenny
  • size: 203,856 kB
  • ctags: 337,867
  • sloc: cpp: 938,683; xml: 56,847; ansic: 41,589; python: 18,999; sh: 11,566; makefile: 664; perl: 494; yacc: 456; asm: 353; csh: 6
file content (391 lines) | stat: -rw-r--r-- 16,335 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
[/ def names end in distrib to avoid clashes]
[def __binomial_distrib [link math_toolkit.dist.dist_ref.dists.binomial_dist Binomial Distribution]]
[def __chi_squared_distrib [link math_toolkit.dist.dist_ref.dists.chi_squared_dist Chi Squared Distribution]]
[def __normal_distrib [link math_toolkit.dist.dist_ref.dists.normal_dist Normal Distribution]]
[def __F_distrib [link math_toolkit.dist.dist_ref.dists.f_dist Fisher F Distribution]]
[def __students_t_distrib [link math_toolkit.dist.dist_ref.dists.students_t_dist Students t Distribution]]

[def __handbook [@http://www.itl.nist.gov/div898/handbook/ 
NIST/SEMATECH e-Handbook of Statistical Methods.]]

[section:stat_tut Statistical Distributions Tutorial]
This library is centred around statistical distributions, this tutorial
will give you an overview of what they are, how they can be used, and 
provides a few worked examples of applying the library to statistical tests.

[section:overview Overview]

[h4 Headers and Namespaces]

All the code in this library is inside namespace boost::math.

In order to use a distribution /my_distribution/ you will need to include
either the header <boost/math/distributions/my_distribution.hpp> or
the "include everything" header: <boost/math/distributions.hpp>.

For example, to use the Students-t distribution include either
<boost/math/distributions/students_t.hpp> or
<boost/math/distributions.hpp>

[h4 Distributions are Objects]

Each kind of distribution in this library is a class type.

[link math_toolkit.policy Policies] provide fine-grained control 
of the behaviour of these classes, allowing the user to customise
behaviour such as how errors are handled, or how the quantiles
of discrete distribtions behave.

[tip If you are familiar with statistics libraries using functions,
and 'Distributions as Objects' seem alien, see
[link math_toolkit.dist.stat_tut.weg.nag_library the comparison to 
other statistics libraries.]  
] [/tip]

Making distributions class types does two things:

* It encapsulates the kind of distribution in the C++ type system; 
so, for example, Students-t distributions are always a different C++ type from 
Chi-Squared distributions.
* The distribution objects store any parameters associated with the 
distribution: for example, the Students-t distribution has a
['degrees of freedom] parameter that controls the shape of the distribution.
This ['degrees of freedom] parameter has to be provided
to the Students-t object when it is constructed.

Although the distribution classes in this library are templates, there
are typedefs on type /double/ that mostly take the usual name of the
distribution
(except where there is a clash with a function of the same name: beta and gamma,
in which case using the default template arguments - `RealType = double` - 
is nearly as convenient).
Probably 95% of uses are covered by these typedefs:

   using namespace boost::math;
   
   // Construct a students_t distribution with 4 degrees of freedom:
   students_t d1(4);
   
   // Construct a double-precision beta distribution 
   // with parameters a = 10, b = 20
   beta_distribution<> d2(10, 20); // Note: _distribution<> suffix !
   
If you need to use the distributions with a type other than `double`,
then you can instantiate the template directly: the names of the 
templates are the same as the `double` typedef but with `_distribution`
appended, for example: __students_t_distrib or __binomial_distrib:

   // Construct a students_t distribution, of float type,
   // with 4 degrees of freedom:
   students_t_distribution<float> d3(4);
   
   // Construct a binomial distribution, of long double type,
   // with probability of success 0.3
   // and 20 trials in total:
   binomial_distribution<long double> d4(20, 0.3);
   
The parameters passed to the distributions can be accessed via getter member
functions:

   d1.degrees_of_freedom();  // returns 4.0 
   
This is all well and good, but not very useful so far.  What we often want
is to be able to calculate the /cumulative distribution functions/ and
/quantiles/ etc for these distributions.

[h4 Generic operations common to all distributions are non-member functions]

Want to calculate the PDF (Probability Density Function) of a distribution?
No problem, just use:

   pdf(my_dist, x);  // Returns PDF (density) at point x of distribution my_dist.
   
Or how about the CDF (Cumulative Distribution Function):

   cdf(my_dist, x);  // Returns CDF (integral from -infinity to point x)
                     // of distribution my_dist.
   
And quantiles are just the same:

   quantile(my_dist, p);  // Returns the value of the random variable x
                          // such that cdf(my_dist, x) == p.
                          
If you're wondering why these aren't member functions, it's to 
make the library more easily extensible: if you want to add additional
generic operations - let's say the /n'th moment/ - then all you have to
do is add the appropriate non-member functions, overloaded for each
implemented distribution type.

[tip

[*Random numbers that approximate Quantiles of Distributions]

If you want random numbers that are distributed in a specific way,
for example in a uniform, normal or triangular,
see [@http://www.boost.org/libs/random/ Boost.Random].

Whilst in principal there's nothing to prevent you from using the 
quantile function to convert a uniformly distributed random
number to another distribution, in practice there are much more 
efficient algorithms available that are specific to random number generation.
] [/tip Random numbers that approximate Quantiles of Distributions]

For example, the binomial distribution has two parameters:
n (the number of trials) and p (the probability of success on one trial).

The `binomial_distribution` constructor therefore has two parameters:

`binomial_distribution(RealType n, RealType p);`
  
For this distribution the random variate is k: the number of successes observed.
The probability density\/mass function (pdf) is therefore written as ['f(k; n, p)].

[note

[*Random Variates and Distribution Parameters]

[@http://en.wikipedia.org/wiki/Random_variate Random variates]
and [@http://en.wikipedia.org/wiki/Parameter distribution parameters]
are conventionally distinguished (for example in Wikipedia and Wolfram MathWorld
by placing a semi-colon (or sometimes vertical bar)
after the random variate (whose value you 'choose'),
to separate the variate from the parameter(s) that defines the shape of the distribution.
] [/tip Random Variates and Distribution Parameters]

As noted above the non-member function `pdf` has one parameter for the distribution object,
and a second for the random variate.  So taking our binomial distribution 
example, we would write:

`pdf(binomial_distribution<RealType>(n, p), k);`

The distribution (effectively the random variate) is said to be 'supported' over a range that is
[@http://en.wikipedia.org/wiki/Probability_distribution
 "the smallest closed set whose complement has probability zero"].
MathWorld uses the word 'defined' for this range.
Non-mathematicians might say it means the 'interesting' smallest range
of random variate x that has the cdf going from zero to unity.
Outside are uninteresting zones where the pdf is zero, and the cdf zero or unity.
Mathematically, the functions may make sense with an (+ or -) infinite value,
but except for a few special cases (in the Normal and Cauchy distributions)
this implementation limits random variates to finite values from the `max` 
to `min` for the `RealType`.
(See [link math_toolkit.backgrounders.implementation.handling_of_floating_point_infinity 
Handling of Floating-Point Infinity] for rationale).

The range of random variate values that is permitted and supported can be 
tested by using two functions `range` and `support`.

[note

[*Discrete Probability Distributions]

Note that the [@http://en.wikipedia.org/wiki/Discrete_probability_distribution 
discrete distributions], including the binomial, negative binomial, Poisson & Bernoulli,
are all mathematically defined as discrete functions:
that is to say the functions `cdf` and `pdf` are only defined for integral values 
of the random variate.

However, because the method of calculation often uses continuous functions
it is convenient to treat them as if they were continuous functions,
and permit non-integral values of their parameters.

Users wanting to enforce a strict mathematical model may use `floor` 
or `ceil` functions on the random variate prior to calling the distribution 
function.

The quantile functions for these distributions are hard to specify
in a manner that will satisfy everyone all of the time.  The default
behaviour is to return an integer result, that has been rounded 
/outwards/: that is to say, lower quantiles - where the probablity
is less than 0.5 are rounded down, while upper quantiles - where
the probability is greater than 0.5 - are rounded up.  This behaviour
ensures that if an X% quantile is requested, then /at least/ the requested
coverage will be present in the central region, and /no more than/
the requested coverage will be present in the tails.

This behaviour can be changed so that the quantile functions are rounded
differently, or return a real-valued result using 
[link math_toolkit.policy.pol_overview Policies].  It is strongly
recommended that you read the tutorial 
[link math_toolkit.policy.pol_tutorial.understand_dis_quant
Understanding Quantiles of Discrete Distributions] before
using the quantile function on a discrete distribtion.  The
[link math_toolkit.policy.pol_ref.discrete_quant_ref reference docs] 
describe how to change the rounding policy
for these distributions.

For similar reasons continuous distributions with parameters like 
"degrees of freedom"
that might appear to be integral, are treated as real values
(and are promoted from integer to floating-point if necessary).
In this case however, there are a small number of situations where non-integral
degrees of freedom do have a genuine meaning.
]

[#complements]
[h4 Complements are supported too]

Often you don't want the value of the CDF, but its complement, which is
to say `1-p` rather than `p`.  You could calculate the CDF and subtract
it from `1`, but if `p` is very close to `1` then cancellation error
will cause you to lose significant digits.  In extreme cases, `p` may
actually be equal to `1`, even though the true value of the complement is non-zero.

[link why_complements See also ['"Why complements?"]]

In this library, whenever you want to receive a complement, just wrap
all the function arguments in a call to `complement(...)`, for example:

   students_t dist(5);
   cout << "CDF at t = 1 is " << cdf(dist, 1.0) << endl;
   cout << "Complement of CDF at t = 1 is " << cdf(complement(dist, 1.0)) << endl;

But wait, now that we have a complement, we have to be able to use it as well.
Any function that accepts a probability as an argument can also accept a complement
by wrapping all of its arguments in a call to `complement(...)`, for example:

   students_t dist(5);
   
   for(double i = 10; i < 1e10; i *= 10)
   {
      // Calculate the quantile for a 1 in i chance:
      double t = quantile(complement(dist, 1/i));
      // Print it out:
      cout << "Quantile of students-t with 5 degrees of freedom\n"
              "for a 1 in " << i << " chance is " << t << endl;
   }
 
[tip

[*Critical values are just quantiles]

Some texts talk about quantiles, others about critical values, the basic rule is:

['Lower critical values] are the same as the quantile.

['Upper critical values] are the same as the quantile from the complement
of the probability.

For example, suppose we have a Bernoulli process, giving rise to a binomial
distribution with success ratio 0.1 and 100 trials in total.  The 
['lower critical value] for a probability of 0.05 is given by:

`quantile(binomial(100, 0.1), 0.05)`

and the ['upper critical value] is given by:

`quantile(complement(binomial(100, 0.1), 0.05))`

which return 4.82 and 14.63 respectively.
]

[#why_complements]
[tip

[*Why bother with complements anyway?]

It's very tempting to dispense with complements, and simply subtract
the probability from 1 when required.  However, consider what happens when
the probability is very close to 1: let's say the probability expressed at
float precision is `0.999999940f`, then `1 - 0.999999940f = 5.96046448e-008`,
but the result is actually accurate to just ['one single bit]: the only
bit that didn't cancel out!

Or to look at this another way: consider that we want the risk of falsely
rejecting the null-hypothesis in the Student's t test to be 1 in 1 billion,
for a sample size of 10,000.
This gives a probability of 1 - 10[super -9], which is exactly 1 when 
calculated at float precision.  In this case calculating the quantile from
the complement neatly solves the problem, so for example:

`quantile(complement(students_t(10000), 1e-9))`

returns the expected t-statistic `6.00336`, where as:

`quantile(students_t(10000), 1-1e-9f)`

raises an overflow error, since it is the same as:

`quantile(students_t(10000), 1)`

Which has no finite result.
]

[h4 Parameters can be calculated]

Sometimes it's the parameters that define the distribution that you 
need to find.  Suppose, for example, you have conducted a Students-t test
for equal means and the result is borderline.  Maybe your two samples
differ from each other, or maybe they don't; based on the result
of the test you can't be sure.  A legitimate question to ask then is
"How many more measurements would I have to take before I would get
an X% probability that the difference is real?"  Parameter finders
can answer questions like this, and are necessarily different for
each distribution.  They are implemented as static member functions
of the distributions, for example:

   students_t::find_degrees_of_freedom(
      1.3,        // difference from true mean to detect
      0.05,       // maximum risk of falsely rejecting the null-hypothesis.
      0.1,        // maximum risk of falsely failing to reject the null-hypothesis.
      0.13);      // sample standard deviation
      
Returns the number of degrees of freedom required to obtain a 95%
probability that the observed differences in means is not down to
chance alone.  In the case that a borderline Students-t test result
was previously obtained, this can be used to estimate how large the sample size
would have to become before the observed difference was considered
significant.  It assumes, of course, that the sample mean and standard
deviation are invariant with sample size.
   
[h4 Summary]

* Distributions are objects, which are constructed from whatever
parameters the distribution may have.
* Member functions allow you to retrieve the parameters of a distribution.
* Generic non-member functions provide access to the properties that
are common to all the distributions (PDF, CDF, quantile etc).
* Complements of probabilities are calculated by wrapping the function's
arguments in a call to `complement(...)`.
* Functions that accept a probability can accept a complement of the
probability as well, by wrapping the function's
arguments in a call to `complement(...)`.
* Static member functions allow the parameters of a distribution
to be found from other information.

Now that you have the basics, the next section looks at some worked examples.

[endsect] [/section:overview Overview]

[section:weg Worked Examples]
[include distributions/distribution_construction.qbk]
[include distributions/students_t_examples.qbk]
[include distributions/chi_squared_examples.qbk]
[include distributions/f_dist_example.qbk]
[include distributions/binomial_example.qbk]
[include distributions/negative_binomial_example.qbk]
[include distributions/normal_example.qbk]
[include distributions/error_handling_example.qbk]
[include distributions/find_location_and_scale.qbk]
[include distributions/nag_library.qbk]
[endsect] [/section:weg Worked Examples]

[include background.qbk]

[endsect] [/ section:stat_tut Statistical Distributions Tutorial]

[/ dist_tutorial.qbk
  Copyright 2006 John Maddock and Paul A. Bristow.
  Distributed under the Boost Software License, Version 1.0.
  (See accompanying file LICENSE_1_0.txt or copy at
  http://www.boost.org/LICENSE_1_0.txt).
]