File: stats_rankcompare.py

package info (click to toggle)
statsmodels 0.14.6%2Bdfsg-1
links: PTS, VCS
area: main
in suites: sid
size: 49,956 kB
sloc: python: 254,365; f90: 612; sh: 560; javascript: 337; asm: 156; makefile: 145; ansic: 32; xml: 9
file content (329 lines) | stat: -rw-r--r-- 12,826 bytes
parent folder | download | duplicates (2)
#!/usr/bin/env python
# coding: utf-8

# DO NOT EDIT
# Autogenerated from the notebook stats_rankcompare.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT

# # Rank comparison: two independent samples
#
#
# Statsmodels provides statistics and tests for the probability that x1
# has larger values than x2. This measures are based on ordinal comparisons
# using ranks.
#
# Define p as the probability that a random draw from the population of
# the first sample has a larger value than a random draw from the population
# of the second sample, specifically
#
#         p = P(x1 > x2) + 0.5 * P(x1 = x2)
#
# This is a measure underlying Wilcoxon-Mann-Whitney's U test, Fligner-
# Policello test and Brunner-Munzel test. Inference is based on the
# asymptotic distribution of the Brunner-Munzel test. The half probability
# for ties corresponds to the use of midranks and makes it valid for
# discrete variables.
#
# The Null hypothesis for stochastic equality is p = 0.5, which
# corresponds to the Brunner-Munzel test.
#
# This notebook provides a brief overview of the statistics provided in
# statsmodels.

import numpy as np

from statsmodels.stats.nonparametric import (rank_compare_2indep,
                                             rank_compare_2ordinal,
                                             prob_larger_continuous,
                                             cohensd2problarger)

# ## Example
#
# The main function is `rank_compare_2indep` which computes the Brunner-
# Munzel test and returns a `RankCompareResult` instance with additional
# methods.
#
# The data for the example are taken from Munzel and Hauschke 2003 and is
# given in frequency counts. We need to expand it to arrays of observations
# to be able to use it with `rank_compare_2indep`. See below for a function
# that directly takes frequency counts. The labels or levels are treated as
# ordinal, the specific values are irrelevant as long as they define an
# order (`> `, `=`).

levels = [-2, -1, 0, 1, 2]
new = [24, 37, 21, 19, 6]
active = [11, 51, 22, 21, 7]

x1 = np.repeat(levels, new)
x2 = np.repeat(levels, active)
np.bincount(x1 + 2), np.bincount(x2 + 2)

res = rank_compare_2indep(x1, x2)  #, use_t=False)
res

# The methods of the results instance are

[i for i in dir(res) if not i.startswith("_") and callable(getattr(res, i))]

print(res.summary())  # returns SimpleTable

ci = res.conf_int()
ci

res.test_prob_superior()

# ## One-sided tests, superiority and noninferiority tests
#
# The hypothesis tests functions have a `alternative` keyword to specify
# one-sided tests and a `value` keyword to specify nonzero or nonequality
# hypothesis. Both keywords together can be used for noninferiority tests or
# superiority tests with a margin.
#
# A noninferiority test specifies a margin and alternative so we can test
# the hypothesis that a new treatment is almost as good or better than a
# reference treatment.
#
# The general one-sided hypothesis is
#
# H0: p = p0
# versus
# HA: p > p0  alternative is that effect is larger than specified null
# value p0
# HA: p < p0  alternative is that effect is smaller than specified null
# value p0
#
# Note: The null hypothesis of a one-sided test is often specified as a
# weak inequality. However, p-values are usually derived assuming a null
# hypothesis at the boundary value. The boundary value is in most cases the
# worst case for the p-value, points in the interior of the null hypothesis
# usually have larger p-values, and the test is conservative in those cases.
#
# Most two sample hypothesis test in statsmodels allow for a "nonzero"
# null value, that is a null value that does not require that the effects in
# the two samples is the same. Some hypothesis tests, like Fisher's exact
# test for contingency tables, and most k-samples anova-type tests only
# allow a null hypothesis that the effect is the same in all samples.
#
# The null value p0 for hypothesis that there is no difference between
# samples, depends on the effect statistic, The null value for a difference
# and a correlation is zero, The null value for a ratio is one, and, the
# null value for the stochastic superiority probability is 0.5.
#
# Noninferiority and superiority tests are just special cases of one-sided
# hypothesis tests that allow nonzero null hypotheses. TOST equivalence
# tests are a combination of two one-sided tests with nonzero null values.
#
# Note, we are using "superior" now in two different meanings. Superior
# and noninferior in the tests refer to whether the treatment effect is
# better than the control. The effect measure in `rank_compare` is the
# probability based on stochastic superiority of sample 1 compared to sample
# 2, but stochastic superiority can be either good or bad.
#
#
# **Noninferiority: Smaller values are better**
#
# If having lower values is better, for example if the variable of
# interest is disease occurencies, then a non-inferiority test specifies a
# threshold larger than equality with an alternative that the parameter is
# less than the threshold.
#
# In the case of stochastic superiority measure, equality of the two
# sample implies a probability equal to 0.5. If we specify an inferiority
# margin of, say 0.6, then the null and alternative hypothesis are
#
# H0: p >= 0.6  (null for inference based on H0: p = 0.6)
# HA: p < 0.6
#
# If we reject the null hypothesis, then our data supports that the
# treatment is noninferior to the control.
#
#
# **Noninferiority: Larger values are better**
#
# In cases where having larger values is better, e.g. a skill or health
# measures, non-inferiority means that the treatment has values almost as
# high or higher than the control. This defines the alternative hypothesis.
# Assuming p0 is the non-inferiority threshold, e.g p0 = 0.4, then null
# and alternative hypotheses are
#
# H0: p <= p0   (p0 < 0.5)
# HA: p > p0
#
# If the null hypothesis is rejected, then we have evidence that the
# treatment (sample 1) is noninferior to reference (sample 2), that is the
# superiority probability is larger than p0.
#
#
#
#
#
# **Supperiority tests**
#
# Suppertiority tests are usually defined as one-sided alternative with
# equality as the null and the better inequality as the alternative.
# However, superiority can also me defined with a margin, in which case the
# treatment has to be better by a non-neglibible amount specified by the
# margin.
#
# This means the test is the same one-sided tests as above with p0 either
# equal to 0.5, or p0 a value above 0.5 if larger is better and below 0.5 if
# smaller is better.

# **Example: noninferiority smaller is better**
#
# Suppose our noninferiority threshold is p0 = 0.55. The one-sided test
# with alternative "smaller" has a pvalue around 0.0065 and we reject the
# null hypothesis at an alpha of 0.05. The data provides evidence that the
# treatment (sample 1) is noninferior to the control (sample2), that is we
# have evidence that the treatment is at most 5 percentage points worse than
# the control.
#

res.test_prob_superior(value=0.55, alternative="smaller")

# **Example: noninferiority larger is better**
#
# Now consider the case when having larger values is better and the
# noninferiority threshold is 0.45. The one-sided test has a p-value of 0.44
# and we cannot reject the null hypothesis.
# Therefore, we do not have evidence for the treatment to be at most 5
# percentage points worse than the control.

res.test_prob_superior(value=0.45, alternative="larger")

# ## Equivalence Test TOST
#
# In equivalence test we want to show that the treatment has approximately
# the same effect as the control, or that two treatments have approximately
# the same effect. Being equivalent is defined by a lower and upper
# equivalence margin (low and upp). If the effect measure is inside this
# interval, then the treatments are equivalent.
#
# The null and alternative hypothesis are
#
# H0: p <= p_low or p >= p_upp
# HA: p_low < p < p_upp
#
# In this case the null hypothesis is that the effect measure is outside
# of the equivalence interval. If we reject the null hypothesis, then the
# data provides evidence that treatment and control are equivalent.
#
# In the TOST procedure we evaluate two one-sided tests, one for the null
# hypothesis that the effect is equal or below the lower threshold and one
# for the null hypothesis that the effect is equal or above the upper
# threshold. If we reject both tests, then the data provides evidence that
# the effect is inside the equivalence interval.
# The p-value of the tost method will be the maximum of the pvalues of the
# two test.
#
# Suppose our equivalence margins are 0.45 and 0.55, then the p-value of
# the equivalence test is 0.43, and we cannot reject the null hypothesis
# that the two treatments are not equivalent, i.e. the data does not provide
# support for equivalence.

res.tost_prob_superior(low=0.45, upp=0.55)

res.test_prob_superior(value=0.55, alternative="smaller")

res.test_prob_superior(value=0.6, alternative="larger")

res.test_prob_superior(value=0.529937, alternative="smaller")

res.test_prob_superior(value=0.529937017, alternative="smaller")

res.conf_int()

# ### Aside: Burden of proof in hypothesis testing
#
# Hypothesis tests have in general the following two properties
#
# - small samples favor the null hypothesis. Uncertainty in small samples
# is large and the hypothesis test has little power. The study in this case
# is underpowered and we often cannot reject the null because of large
# uncertainty. Non-rejection cannot be taken as evidence in favor of the
# null because the hypothesis test does not have much power to reject the
# null.
# - large samples favor the alternative hypothesis. In large samples
# uncertainty becomes small, and even small deviations from the null
# hypothesis will lead to rejection. In this case we can have a test result
# that shows an effect that is statistical significant but substantive
# irrelevant.
#
# Noninferiority and equivalence tests solve both problems. The first
# problem, favoring the null in small samples, can be avoided by reversing
# null and alternative hypothesis. The alternative hypothesis should be the
# one we want to show, so the test does not just support the hypothesis of
# interest because the power is too small. The second problem, favoring the
# alternative in large samples, can be voided but specifying a margin in the
# hypothesis test, so that trivial deviations are still part of the null
# hypothesis. By using this, we are not favoring the alternative in large
# samples for irrelevant deviations of a point null hypothesis.
# Noninferiority tests want to show that a treatment is almost as good as
# the control. Equivalence test try to show that the statistics in two
# samples are approximately the same, in contrast to a point hypothesis that
# specifies that they are exactly the same.
#

# ## Reversing the samples
#
# In the literature, stochastic dominance in the sense of p not equal to
# 0.5 is often defined using a stochastically smaller measure instead of
# stochastically larger as defined in statsmodels.
#
# The effect measure reverses the inequality and is then
#
# p2 = P(x1 < x2) + 0.5 * P(x1 = x2)
#
# This can be obtained in the statsmodels function by reversing the
# sequence of the sample, that is we can use (x2, x1) instead of (x1, x2).
# The two definitions, p, p2, are related by p2 = 1 - p.
#
# The results instance of rank_compare shows both statistics, but the
# hypothesis tests and confidence interval are only provided for the
# stochastically larger measure.

res = rank_compare_2indep(x2, x1)  #, use_t=False)
res

res.conf_int()

st = res.summary()
st

# ## Ordinal data
#
# The rank based analysis assumes only that data is ordinal and uses only
# ordinal information. Furthermore, the definition in terms of midranks
# allows for discete data in analysis similar to Brunner-Munzel test.
#
# Statsmodels provides some additional functions for the case when the
# data is discrete and has only a finite number of support points, i.e. is
# ordered categorical.
#
# The data above was given as ordinal data, but we had to expand it to be
# able to use `rank_compare_2indep`. Instead we can use
# `rank_compare_2ordinal` directly with the frequency counts. The latter
# function mainly differs from the former by using more efficient
# computation for the special case. The results statistics will be the same.
#
# The counts for treatment ("new") and control ("active") in increasing
# order of the underlying values from the above example are

new, active

res_o = rank_compare_2ordinal(new, active)
res_o.summary()

res_o

res = rank_compare_2indep(x1, x2)
res

res_o.test_prob_superior()

res.test_prob_superior()

res_o.conf_int(), res.conf_int()