1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827
|
@cindex statistics
@cindex mean
@cindex standard deviation
@cindex variance
@cindex estimated standard deviation
@cindex estimated variance
@cindex t-test
@cindex range
@cindex min
@cindex max
This chapter describes the statistical functions in the library. The
basic statistical functions include routines to compute the mean,
variance and standard deviation. More advanced functions allow you to
calculate absolute deviations, skewness, and kurtosis as well as the
median and arbitrary percentiles. The algorithms use recurrence
relations to compute average quantities in a stable way, without large
intermediate values that might overflow.
The functions are available in versions for datasets in the standard
floating-point and integer types. The versions for double precision
floating-point data have the prefix @code{gsl_stats} and are declared in
the header file @file{gsl_statistics_double.h}. The versions for integer
data have the prefix @code{gsl_stats_int} and are declared in the header
file @file{gsl_statistics_int.h}. All the functions operate on C
arrays with a @var{stride} parameter specifying the spacing between
elements.
@menu
* Mean and standard deviation and variance::
* Absolute deviation::
* Higher moments (skewness and kurtosis)::
* Autocorrelation::
* Covariance::
* Correlation::
* Weighted Samples::
* Maximum and Minimum values::
* Median and Percentiles::
* Example statistical programs::
* Statistics References and Further Reading::
@end menu
@node Mean and standard deviation and variance
@section Mean, Standard Deviation and Variance
@deftypefun double gsl_stats_mean (const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function returns the arithmetic mean of @var{data}, a dataset of
length @var{n} with stride @var{stride}. The arithmetic mean, or
@dfn{sample mean}, is denoted by @math{\Hat\mu} and defined as,
@tex
\beforedisplay
$$
{\Hat\mu} = {1 \over N} \sum x_i
$$
\afterdisplay
@end tex
@ifinfo
@example
\Hat\mu = (1/N) \sum x_i
@end example
@end ifinfo
@noindent
where @math{x_i} are the elements of the dataset @var{data}. For
samples drawn from a gaussian distribution the variance of
@math{\Hat\mu} is @math{\sigma^2 / N}.
@end deftypefun
@deftypefun double gsl_stats_variance (const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function returns the estimated, or @dfn{sample}, variance of
@var{data}, a dataset of length @var{n} with stride @var{stride}. The
estimated variance is denoted by @math{\Hat\sigma^2} and is defined by,
@tex
\beforedisplay
$$
{\Hat\sigma}^2 = {1 \over (N-1)} \sum (x_i - {\Hat\mu})^2
$$
\afterdisplay
@end tex
@ifinfo
@example
\Hat\sigma^2 = (1/(N-1)) \sum (x_i - \Hat\mu)^2
@end example
@end ifinfo
@noindent
where @math{x_i} are the elements of the dataset @var{data}. Note that
the normalization factor of @math{1/(N-1)} results from the derivation
of @math{\Hat\sigma^2} as an unbiased estimator of the population
variance @math{\sigma^2}. For samples drawn from a Gaussian distribution
the variance of @math{\Hat\sigma^2} itself is @math{2 \sigma^4 / N}.
This function computes the mean via a call to @code{gsl_stats_mean}. If
you have already computed the mean then you can pass it directly to
@code{gsl_stats_variance_m}.
@end deftypefun
@deftypefun double gsl_stats_variance_m (const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{mean})
This function returns the sample variance of @var{data} relative to the
given value of @var{mean}. The function is computed with @math{\Hat\mu}
replaced by the value of @var{mean} that you supply,
@tex
\beforedisplay
$$
{\Hat\sigma}^2 = {1 \over (N-1)} \sum (x_i - mean)^2
$$
\afterdisplay
@end tex
@ifinfo
@example
\Hat\sigma^2 = (1/(N-1)) \sum (x_i - mean)^2
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_sd (const double @var{data}[], size_t @var{stride}, size_t @var{n})
@deftypefunx double gsl_stats_sd_m (const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{mean})
The standard deviation is defined as the square root of the variance.
These functions return the square root of the corresponding variance
functions above.
@end deftypefun
@deftypefun double gsl_stats_tss (const double @var{data}[], size_t @var{stride}, size_t @var{n})
@deftypefunx double gsl_stats_tss_m (const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{mean})
These functions return the total sum of squares (TSS) of @var{data} about
the mean. For @code{gsl_stats_tss_m} the user-supplied value of
@var{mean} is used, and for @code{gsl_stats_tss} it is computed using
@code{gsl_stats_mean}.
@tex
\beforedisplay
$$
{\rm TSS} = \sum (x_i - mean)^2
$$
\afterdisplay
@end tex
@ifinfo
@example
TSS = \sum (x_i - mean)^2
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_variance_with_fixed_mean (const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{mean})
This function computes an unbiased estimate of the variance of
@var{data} when the population mean @var{mean} of the underlying
distribution is known @emph{a priori}. In this case the estimator for
the variance uses the factor @math{1/N} and the sample mean
@math{\Hat\mu} is replaced by the known population mean @math{\mu},
@tex
\beforedisplay
$$
{\Hat\sigma}^2 = {1 \over N} \sum (x_i - \mu)^2
$$
\afterdisplay
@end tex
@ifinfo
@example
\Hat\sigma^2 = (1/N) \sum (x_i - \mu)^2
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_sd_with_fixed_mean (const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{mean})
This function calculates the standard deviation of @var{data} for a
fixed population mean @var{mean}. The result is the square root of the
corresponding variance function.
@end deftypefun
@node Absolute deviation
@section Absolute deviation
@deftypefun double gsl_stats_absdev (const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function computes the absolute deviation from the mean of
@var{data}, a dataset of length @var{n} with stride @var{stride}. The
absolute deviation from the mean is defined as,
@tex
\beforedisplay
$$
absdev = {1 \over N} \sum |x_i - {\Hat\mu}|
$$
\afterdisplay
@end tex
@ifinfo
@example
absdev = (1/N) \sum |x_i - \Hat\mu|
@end example
@end ifinfo
@noindent
where @math{x_i} are the elements of the dataset @var{data}. The
absolute deviation from the mean provides a more robust measure of the
width of a distribution than the variance. This function computes the
mean of @var{data} via a call to @code{gsl_stats_mean}.
@end deftypefun
@deftypefun double gsl_stats_absdev_m (const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{mean})
This function computes the absolute deviation of the dataset @var{data}
relative to the given value of @var{mean},
@tex
\beforedisplay
$$
absdev = {1 \over N} \sum |x_i - mean|
$$
\afterdisplay
@end tex
@ifinfo
@example
absdev = (1/N) \sum |x_i - mean|
@end example
@end ifinfo
@noindent
This function is useful if you have already computed the mean of
@var{data} (and want to avoid recomputing it), or wish to calculate the
absolute deviation relative to another value (such as zero, or the
median).
@end deftypefun
@node Higher moments (skewness and kurtosis)
@section Higher moments (skewness and kurtosis)
@cindex skewness
@cindex kurtosis
@deftypefun double gsl_stats_skew (const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function computes the skewness of @var{data}, a dataset of length
@var{n} with stride @var{stride}. The skewness is defined as,
@tex
\beforedisplay
$$
skew = {1 \over N} \sum
{\left( x_i - {\Hat\mu} \over {\Hat\sigma} \right)}^3
$$
\afterdisplay
@end tex
@ifinfo
@example
skew = (1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^3
@end example
@end ifinfo
@noindent
where @math{x_i} are the elements of the dataset @var{data}. The skewness
measures the asymmetry of the tails of a distribution.
The function computes the mean and estimated standard deviation of
@var{data} via calls to @code{gsl_stats_mean} and @code{gsl_stats_sd}.
@end deftypefun
@deftypefun double gsl_stats_skew_m_sd (const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{mean}, double @var{sd})
This function computes the skewness of the dataset @var{data} using the
given values of the mean @var{mean} and standard deviation @var{sd},
@tex
\beforedisplay
$$
skew = {1 \over N}
\sum {\left( x_i - mean \over sd \right)}^3
$$
\afterdisplay
@end tex
@ifinfo
@example
skew = (1/N) \sum ((x_i - mean)/sd)^3
@end example
@end ifinfo
@noindent
These functions are useful if you have already computed the mean and
standard deviation of @var{data} and want to avoid recomputing them.
@end deftypefun
@deftypefun double gsl_stats_kurtosis (const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function computes the kurtosis of @var{data}, a dataset of length
@var{n} with stride @var{stride}. The kurtosis is defined as,
@tex
\beforedisplay
$$
kurtosis = \left( {1 \over N} \sum
{\left(x_i - {\Hat\mu} \over {\Hat\sigma} \right)}^4
\right)
- 3
$$
\afterdisplay
@end tex
@ifinfo
@example
kurtosis = ((1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^4) - 3
@end example
@end ifinfo
@noindent
The kurtosis measures how sharply peaked a distribution is, relative to
its width. The kurtosis is normalized to zero for a Gaussian
distribution.
@end deftypefun
@deftypefun double gsl_stats_kurtosis_m_sd (const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{mean}, double @var{sd})
This function computes the kurtosis of the dataset @var{data} using the
given values of the mean @var{mean} and standard deviation @var{sd},
@tex
\beforedisplay
$$
kurtosis = {1 \over N}
\left( \sum {\left(x_i - mean \over sd \right)}^4 \right)
- 3
$$
\afterdisplay
@end tex
@ifinfo
@example
kurtosis = ((1/N) \sum ((x_i - mean)/sd)^4) - 3
@end example
@end ifinfo
@noindent
This function is useful if you have already computed the mean and
standard deviation of @var{data} and want to avoid recomputing them.
@end deftypefun
@node Autocorrelation
@section Autocorrelation
@deftypefun double gsl_stats_lag1_autocorrelation (const double @var{data}[], const size_t @var{stride}, const size_t @var{n})
This function computes the lag-1 autocorrelation of the dataset @var{data}.
@tex
\beforedisplay
$$
a_1 = {\sum_{i = 2}^{n} (x_{i} - \Hat\mu) (x_{i-1} - \Hat\mu)
\over
\sum_{i = 1}^{n} (x_{i} - \Hat\mu) (x_{i} - \Hat\mu)}
$$
\afterdisplay
@end tex
@ifinfo
@example
a_1 = @{\sum_@{i = 2@}^@{n@} (x_@{i@} - \Hat\mu) (x_@{i-1@} - \Hat\mu)
\over
\sum_@{i = 1@}^@{n@} (x_@{i@} - \Hat\mu) (x_@{i@} - \Hat\mu)@}
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_lag1_autocorrelation_m (const double @var{data}[], const size_t @var{stride}, const size_t @var{n}, const double @var{mean})
This function computes the lag-1 autocorrelation of the dataset
@var{data} using the given value of the mean @var{mean}.
@end deftypefun
@node Covariance
@section Covariance
@cindex covariance, of two datasets
@deftypefun double gsl_stats_covariance (const double @var{data1}[], const size_t @var{stride1}, const double @var{data2}[], const size_t @var{stride2}, const size_t @var{n})
This function computes the covariance of the datasets @var{data1} and
@var{data2} which must both be of the same length @var{n}.
@tex
\beforedisplay
$$
covar = {1 \over (n - 1)} \sum_{i = 1}^{n} (x_{i} - \Hat x) (y_{i} - \Hat y)
$$
\afterdisplay
@end tex
@ifinfo
@example
covar = (1/(n - 1)) \sum_@{i = 1@}^@{n@} (x_i - \Hat x) (y_i - \Hat y)
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_covariance_m (const double @var{data1}[], const size_t @var{stride1}, const double @var{data2}[], const size_t @var{stride2}, const size_t @var{n}, const double @var{mean1}, const double @var{mean2})
This function computes the covariance of the datasets @var{data1} and
@var{data2} using the given values of the means, @var{mean1} and
@var{mean2}. This is useful if you have already computed the means of
@var{data1} and @var{data2} and want to avoid recomputing them.
@end deftypefun
@node Correlation
@section Correlation
@cindex correlation, of two datasets
@deftypefun double gsl_stats_correlation (const double @var{data1}[], const size_t @var{stride1}, const double @var{data2}[], const size_t @var{stride2}, const size_t @var{n})
This function efficiently computes the Pearson correlation coefficient
between the datasets @var{data1} and @var{data2} which must both be of
the same length @var{n}.
@tex
\beforedisplay
$$
r = {cov(x, y) \over \Hat\sigma_x \Hat\sigma_y} =
{{1 \over n-1} \sum (x_i - \Hat x) (y_i - \Hat y)
\over
\sqrt{{1 \over n-1} \sum (x_i - {\Hat x})^2}
\sqrt{{1 \over n-1} \sum (y_i - {\Hat y})^2}
}
$$
\afterdisplay
@end tex
@ifinfo
@example
r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y)
= @{1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y)
\over
\sqrt@{1/(n-1) \sum (x_i - \Hat x)^2@} \sqrt@{1/(n-1) \sum (y_i - \Hat y)^2@}
@}
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_spearman (const double @var{data1}[], const size_t @var{stride1}, const double @var{data2}[], const size_t @var{stride2}, const size_t @var{n}, double @var{work}[])
This function computes the Spearman rank correlation coefficient between
the datasets @var{data1} and @var{data2} which must both be of the same
length @var{n}. Additional workspace of size 2*@var{n} is required in
@var{work}. The Spearman rank correlation between vectors @math{x} and
@math{y} is equivalent to the Pearson correlation between the ranked
vectors @math{x_R} and @math{y_R}, where ranks are defined to be the
average of the positions of an element in the ascending order of the values.
@end deftypefun
@node Weighted Samples
@section Weighted Samples
The functions described in this section allow the computation of
statistics for weighted samples. The functions accept an array of
samples, @math{x_i}, with associated weights, @math{w_i}. Each sample
@math{x_i} is considered as having been drawn from a Gaussian
distribution with variance @math{\sigma_i^2}. The sample weight
@math{w_i} is defined as the reciprocal of this variance, @math{w_i =
1/\sigma_i^2}. Setting a weight to zero corresponds to removing a
sample from a dataset.
@deftypefun double gsl_stats_wmean (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function returns the weighted mean of the dataset @var{data} with
stride @var{stride} and length @var{n}, using the set of weights @var{w}
with stride @var{wstride} and length @var{n}. The weighted mean is defined as,
@tex
\beforedisplay
$$
{\Hat\mu} = {{\sum w_i x_i} \over {\sum w_i}}
$$
\afterdisplay
@end tex
@ifinfo
@example
\Hat\mu = (\sum w_i x_i) / (\sum w_i)
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_wvariance (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function returns the estimated variance of the dataset @var{data}
with stride @var{stride} and length @var{n}, using the set of weights
@var{w} with stride @var{wstride} and length @var{n}. The estimated
variance of a weighted dataset is calculated as,
@tex
\beforedisplay
$$
\Hat\sigma^2 = {{\sum w_i} \over {(\sum w_i)^2 - \sum (w_i^2)}}
\sum w_i (x_i - \Hat\mu)^2
$$
\afterdisplay
@end tex
@ifinfo
@example
\Hat\sigma^2 = ((\sum w_i)/((\sum w_i)^2 - \sum (w_i^2)))
\sum w_i (x_i - \Hat\mu)^2
@end example
@end ifinfo
@noindent
Note that this expression reduces to an unweighted variance with the
familiar @math{1/(N-1)} factor when there are @math{N} equal non-zero
weights.
@end deftypefun
@deftypefun double gsl_stats_wvariance_m (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{wmean})
This function returns the estimated variance of the weighted dataset
@var{data} using the given weighted mean @var{wmean}.
@end deftypefun
@deftypefun double gsl_stats_wsd (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n})
The standard deviation is defined as the square root of the variance.
This function returns the square root of the corresponding variance
function @code{gsl_stats_wvariance} above.
@end deftypefun
@deftypefun double gsl_stats_wsd_m (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{wmean})
This function returns the square root of the corresponding variance
function @code{gsl_stats_wvariance_m} above.
@end deftypefun
@deftypefun double gsl_stats_wvariance_with_fixed_mean (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n}, const double @var{mean})
This function computes an unbiased estimate of the variance of the weighted
dataset @var{data} when the population mean @var{mean} of the underlying
distribution is known @emph{a priori}. In this case the estimator for
the variance replaces the sample mean @math{\Hat\mu} by the known
population mean @math{\mu},
@tex
\beforedisplay
$$
\Hat\sigma^2 = {{\sum w_i (x_i - \mu)^2} \over {\sum w_i}}
$$
\afterdisplay
@end tex
@ifinfo
@example
\Hat\sigma^2 = (\sum w_i (x_i - \mu)^2) / (\sum w_i)
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_wsd_with_fixed_mean (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n}, const double @var{mean})
The standard deviation is defined as the square root of the variance.
This function returns the square root of the corresponding variance
function above.
@end deftypefun
@deftypefun double gsl_stats_wtss (const double @var{w}[], const size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n})
@deftypefunx double gsl_stats_wtss_m (const double @var{w}[], const size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{wmean})
These functions return the weighted total sum of squares (TSS) of
@var{data} about the weighted mean. For @code{gsl_stats_wtss_m} the
user-supplied value of @var{wmean} is used, and for @code{gsl_stats_wtss}
it is computed using @code{gsl_stats_wmean}.
@tex
\beforedisplay
$$
{\rm TSS} = \sum w_i (x_i - wmean)^2
$$
\afterdisplay
@end tex
@ifinfo
@example
TSS = \sum w_i (x_i - wmean)^2
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_wabsdev (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function computes the weighted absolute deviation from the weighted
mean of @var{data}. The absolute deviation from the mean is defined as,
@tex
\beforedisplay
$$
absdev = {{\sum w_i |x_i - \Hat\mu|} \over {\sum w_i}}
$$
\afterdisplay
@end tex
@ifinfo
@example
absdev = (\sum w_i |x_i - \Hat\mu|) / (\sum w_i)
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_wabsdev_m (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{wmean})
This function computes the absolute deviation of the weighted dataset
@var{data} about the given weighted mean @var{wmean}.
@end deftypefun
@deftypefun double gsl_stats_wskew (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function computes the weighted skewness of the dataset @var{data}.
@tex
\beforedisplay
$$
skew = {{\sum w_i ((x_i - {\Hat x})/{\Hat \sigma})^3} \over {\sum w_i}}
$$
\afterdisplay
@end tex
@ifinfo
@example
skew = (\sum w_i ((x_i - \Hat x)/\Hat \sigma)^3) / (\sum w_i)
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_wskew_m_sd (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{wmean}, double @var{wsd})
This function computes the weighted skewness of the dataset @var{data}
using the given values of the weighted mean and weighted standard
deviation, @var{wmean} and @var{wsd}.
@end deftypefun
@deftypefun double gsl_stats_wkurtosis (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function computes the weighted kurtosis of the dataset @var{data}.
@tex
\beforedisplay
$$
kurtosis = {{\sum w_i ((x_i - {\Hat x})/{\Hat \sigma})^4} \over {\sum w_i}} - 3
$$
\afterdisplay
@end tex
@ifinfo
@example
kurtosis = ((\sum w_i ((x_i - \Hat x)/\Hat \sigma)^4) / (\sum w_i)) - 3
@end example
@end ifinfo
@end deftypefun
@deftypefun double gsl_stats_wkurtosis_m_sd (const double @var{w}[], size_t @var{wstride}, const double @var{data}[], size_t @var{stride}, size_t @var{n}, double @var{wmean}, double @var{wsd})
This function computes the weighted kurtosis of the dataset @var{data}
using the given values of the weighted mean and weighted standard
deviation, @var{wmean} and @var{wsd}.
@end deftypefun
@node Maximum and Minimum values
@section Maximum and Minimum values
The following functions find the maximum and minimum values of a
dataset (or their indices). If the data contains @code{NaN}s then a
@code{NaN} will be returned, since the maximum or minimum value is
undefined. For functions which return an index, the location of the
first @code{NaN} in the array is returned.
@deftypefun double gsl_stats_max (const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function returns the maximum value in @var{data}, a dataset of
length @var{n} with stride @var{stride}. The maximum value is defined
as the value of the element @math{x_i} which satisfies @c{$x_i \ge x_j$}
@math{x_i >= x_j} for all @math{j}.
If you want instead to find the element with the largest absolute
magnitude you will need to apply @code{fabs} or @code{abs} to your data
before calling this function.
@end deftypefun
@deftypefun double gsl_stats_min (const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function returns the minimum value in @var{data}, a dataset of
length @var{n} with stride @var{stride}. The minimum value is defined
as the value of the element @math{x_i} which satisfies @c{$x_i \le x_j$}
@math{x_i <= x_j} for all @math{j}.
If you want instead to find the element with the smallest absolute
magnitude you will need to apply @code{fabs} or @code{abs} to your data
before calling this function.
@end deftypefun
@deftypefun void gsl_stats_minmax (double * @var{min}, double * @var{max}, const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function finds both the minimum and maximum values @var{min},
@var{max} in @var{data} in a single pass.
@end deftypefun
@deftypefun size_t gsl_stats_max_index (const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function returns the index of the maximum value in @var{data}, a
dataset of length @var{n} with stride @var{stride}. The maximum value is
defined as the value of the element @math{x_i} which satisfies
@c{$x_i \ge x_j$}
@math{x_i >= x_j} for all @math{j}. When there are several equal maximum
elements then the first one is chosen.
@end deftypefun
@deftypefun size_t gsl_stats_min_index (const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function returns the index of the minimum value in @var{data}, a
dataset of length @var{n} with stride @var{stride}. The minimum value
is defined as the value of the element @math{x_i} which satisfies
@c{$x_i \ge x_j$}
@math{x_i >= x_j} for all @math{j}. When there are several equal
minimum elements then the first one is chosen.
@end deftypefun
@deftypefun void gsl_stats_minmax_index (size_t * @var{min_index}, size_t * @var{max_index}, const double @var{data}[], size_t @var{stride}, size_t @var{n})
This function returns the indexes @var{min_index}, @var{max_index} of
the minimum and maximum values in @var{data} in a single pass.
@end deftypefun
@node Median and Percentiles
@section Median and Percentiles
The median and percentile functions described in this section operate on
sorted data. For convenience we use @dfn{quantiles}, measured on a scale
of 0 to 1, instead of percentiles (which use a scale of 0 to 100).
@deftypefun double gsl_stats_median_from_sorted_data (const double @var{sorted_data}[], size_t @var{stride}, size_t @var{n})
This function returns the median value of @var{sorted_data}, a dataset
of length @var{n} with stride @var{stride}. The elements of the array
must be in ascending numerical order. There are no checks to see
whether the data are sorted, so the function @code{gsl_sort} should
always be used first.
When the dataset has an odd number of elements the median is the value
of element @math{(n-1)/2}. When the dataset has an even number of
elements the median is the mean of the two nearest middle values,
elements @math{(n-1)/2} and @math{n/2}. Since the algorithm for
computing the median involves interpolation this function always returns
a floating-point number, even for integer data types.
@end deftypefun
@deftypefun double gsl_stats_quantile_from_sorted_data (const double @var{sorted_data}[], size_t @var{stride}, size_t @var{n}, double @var{f})
This function returns a quantile value of @var{sorted_data}, a
double-precision array of length @var{n} with stride @var{stride}. The
elements of the array must be in ascending numerical order. The
quantile is determined by the @var{f}, a fraction between 0 and 1. For
example, to compute the value of the 75th percentile @var{f} should have
the value 0.75.
There are no checks to see whether the data are sorted, so the function
@code{gsl_sort} should always be used first.
The quantile is found by interpolation, using the formula
@tex
\beforedisplay
$$
\hbox{quantile} = (1 - \delta) x_i + \delta x_{i+1}
$$
\afterdisplay
@end tex
@ifinfo
@example
quantile = (1 - \delta) x_i + \delta x_@{i+1@}
@end example
@end ifinfo
@noindent
where @math{i} is @code{floor}(@math{(n - 1)f}) and @math{\delta} is
@math{(n-1)f - i}.
Thus the minimum value of the array (@code{data[0*stride]}) is given by
@var{f} equal to zero, the maximum value (@code{data[(n-1)*stride]}) is
given by @var{f} equal to one and the median value is given by @var{f}
equal to 0.5. Since the algorithm for computing quantiles involves
interpolation this function always returns a floating-point number, even
for integer data types.
@end deftypefun
@comment @node Statistical tests
@comment @section Statistical tests
@comment FIXME, do more work on the statistical tests
@comment -@deftypefun double gsl_stats_ttest (const double @var{data1}[], double @var{data2}[], size_t @var{n1}, size_t @var{n2})
@comment -@deftypefunx Statistics double gsl_stats_int_ttest (const double @var{data1}[], double @var{data2}[], size_t @var{n1}, size_t @var{n2})
@comment The function @code{gsl_stats_ttest} computes the t-test statistic for
@comment the two arrays @var{data1}[] and @var{data2}[], of lengths @var{n1} and
@comment -@var{n2} respectively.
@comment The t-test statistic measures the difference between the means of two
@comment datasets.
@node Example statistical programs
@section Examples
Here is a basic example of how to use the statistical functions:
@example
@verbatiminclude examples/stat.c
@end example
The program should produce the following output,
@example
@verbatiminclude examples/stat.txt
@end example
Here is an example using sorted data,
@example
@verbatiminclude examples/statsort.c
@end example
This program should produce the following output,
@example
@verbatiminclude examples/statsort.txt
@end example
@node Statistics References and Further Reading
@section References and Further Reading
The standard reference for almost any topic in statistics is the
multi-volume @cite{Advanced Theory of Statistics} by Kendall and Stuart.
@itemize @w{}
@item
Maurice Kendall, Alan Stuart, and J. Keith Ord.
@cite{The Advanced Theory of Statistics} (multiple volumes)
reprinted as @cite{Kendall's Advanced Theory of Statistics}.
Wiley, ISBN 047023380X.
@end itemize
@noindent
Many statistical concepts can be more easily understood by a Bayesian
approach. The following book by Gelman, Carlin, Stern and Rubin gives a
comprehensive coverage of the subject.
@itemize @w{}
@item
Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin.
@cite{Bayesian Data Analysis}.
Chapman & Hall, ISBN 0412039915.
@end itemize
@noindent
For physicists the Particle Data Group provides useful reviews of
Probability and Statistics in the ``Mathematical Tools'' section of its
Annual Review of Particle Physics.
@itemize @w{}
@item
@cite{Review of Particle Properties}
R.M. Barnett et al., Physical Review D54, 1 (1996)
@end itemize
@noindent
The Review of Particle Physics is available online at
the website @uref{http://pdg.lbl.gov/}.
|