1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329
|
#!/usr/bin/env python
# coding: utf-8
# DO NOT EDIT
# Autogenerated from the notebook stats_rankcompare.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT
# # Rank comparison: two independent samples
#
#
# Statsmodels provides statistics and tests for the probability that x1
# has larger values than x2. This measures are based on ordinal comparisons
# using ranks.
#
# Define p as the probability that a random draw from the population of
# the first sample has a larger value than a random draw from the population
# of the second sample, specifically
#
# p = P(x1 > x2) + 0.5 * P(x1 = x2)
#
# This is a measure underlying Wilcoxon-Mann-Whitney's U test, Fligner-
# Policello test and Brunner-Munzel test. Inference is based on the
# asymptotic distribution of the Brunner-Munzel test. The half probability
# for ties corresponds to the use of midranks and makes it valid for
# discrete variables.
#
# The Null hypothesis for stochastic equality is p = 0.5, which
# corresponds to the Brunner-Munzel test.
#
# This notebook provides a brief overview of the statistics provided in
# statsmodels.
import numpy as np
from statsmodels.stats.nonparametric import (rank_compare_2indep,
rank_compare_2ordinal,
prob_larger_continuous,
cohensd2problarger)
# ## Example
#
# The main function is `rank_compare_2indep` which computes the Brunner-
# Munzel test and returns a `RankCompareResult` instance with additional
# methods.
#
# The data for the example are taken from Munzel and Hauschke 2003 and is
# given in frequency counts. We need to expand it to arrays of observations
# to be able to use it with `rank_compare_2indep`. See below for a function
# that directly takes frequency counts. The labels or levels are treated as
# ordinal, the specific values are irrelevant as long as they define an
# order (`> `, `=`).
levels = [-2, -1, 0, 1, 2]
new = [24, 37, 21, 19, 6]
active = [11, 51, 22, 21, 7]
x1 = np.repeat(levels, new)
x2 = np.repeat(levels, active)
np.bincount(x1 + 2), np.bincount(x2 + 2)
res = rank_compare_2indep(x1, x2) #, use_t=False)
res
# The methods of the results instance are
[i for i in dir(res) if not i.startswith("_") and callable(getattr(res, i))]
print(res.summary()) # returns SimpleTable
ci = res.conf_int()
ci
res.test_prob_superior()
# ## One-sided tests, superiority and noninferiority tests
#
# The hypothesis tests functions have a `alternative` keyword to specify
# one-sided tests and a `value` keyword to specify nonzero or nonequality
# hypothesis. Both keywords together can be used for noninferiority tests or
# superiority tests with a margin.
#
# A noninferiority test specifies a margin and alternative so we can test
# the hypothesis that a new treatment is almost as good or better than a
# reference treatment.
#
# The general one-sided hypothesis is
#
# H0: p = p0
# versus
# HA: p > p0 alternative is that effect is larger than specified null
# value p0
# HA: p < p0 alternative is that effect is smaller than specified null
# value p0
#
# Note: The null hypothesis of a one-sided test is often specified as a
# weak inequality. However, p-values are usually derived assuming a null
# hypothesis at the boundary value. The boundary value is in most cases the
# worst case for the p-value, points in the interior of the null hypothesis
# usually have larger p-values, and the test is conservative in those cases.
#
# Most two sample hypothesis test in statsmodels allow for a "nonzero"
# null value, that is a null value that does not require that the effects in
# the two samples is the same. Some hypothesis tests, like Fisher's exact
# test for contingency tables, and most k-samples anova-type tests only
# allow a null hypothesis that the effect is the same in all samples.
#
# The null value p0 for hypothesis that there is no difference between
# samples, depends on the effect statistic, The null value for a difference
# and a correlation is zero, The null value for a ratio is one, and, the
# null value for the stochastic superiority probability is 0.5.
#
# Noninferiority and superiority tests are just special cases of one-sided
# hypothesis tests that allow nonzero null hypotheses. TOST equivalence
# tests are a combination of two one-sided tests with nonzero null values.
#
# Note, we are using "superior" now in two different meanings. Superior
# and noninferior in the tests refer to whether the treatment effect is
# better than the control. The effect measure in `rank_compare` is the
# probability based on stochastic superiority of sample 1 compared to sample
# 2, but stochastic superiority can be either good or bad.
#
#
# **Noninferiority: Smaller values are better**
#
# If having lower values is better, for example if the variable of
# interest is disease occurencies, then a non-inferiority test specifies a
# threshold larger than equality with an alternative that the parameter is
# less than the threshold.
#
# In the case of stochastic superiority measure, equality of the two
# sample implies a probability equal to 0.5. If we specify an inferiority
# margin of, say 0.6, then the null and alternative hypothesis are
#
# H0: p >= 0.6 (null for inference based on H0: p = 0.6)
# HA: p < 0.6
#
# If we reject the null hypothesis, then our data supports that the
# treatment is noninferior to the control.
#
#
# **Noninferiority: Larger values are better**
#
# In cases where having larger values is better, e.g. a skill or health
# measures, non-inferiority means that the treatment has values almost as
# high or higher than the control. This defines the alternative hypothesis.
# Assuming p0 is the non-inferiority threshold, e.g p0 = 0.4, then null
# and alternative hypotheses are
#
# H0: p <= p0 (p0 < 0.5)
# HA: p > p0
#
# If the null hypothesis is rejected, then we have evidence that the
# treatment (sample 1) is noninferior to reference (sample 2), that is the
# superiority probability is larger than p0.
#
#
#
#
#
# **Supperiority tests**
#
# Suppertiority tests are usually defined as one-sided alternative with
# equality as the null and the better inequality as the alternative.
# However, superiority can also me defined with a margin, in which case the
# treatment has to be better by a non-neglibible amount specified by the
# margin.
#
# This means the test is the same one-sided tests as above with p0 either
# equal to 0.5, or p0 a value above 0.5 if larger is better and below 0.5 if
# smaller is better.
# **Example: noninferiority smaller is better**
#
# Suppose our noninferiority threshold is p0 = 0.55. The one-sided test
# with alternative "smaller" has a pvalue around 0.0065 and we reject the
# null hypothesis at an alpha of 0.05. The data provides evidence that the
# treatment (sample 1) is noninferior to the control (sample2), that is we
# have evidence that the treatment is at most 5 percentage points worse than
# the control.
#
res.test_prob_superior(value=0.55, alternative="smaller")
# **Example: noninferiority larger is better**
#
# Now consider the case when having larger values is better and the
# noninferiority threshold is 0.45. The one-sided test has a p-value of 0.44
# and we cannot reject the null hypothesis.
# Therefore, we do not have evidence for the treatment to be at most 5
# percentage points worse than the control.
res.test_prob_superior(value=0.45, alternative="larger")
# ## Equivalence Test TOST
#
# In equivalence test we want to show that the treatment has approximately
# the same effect as the control, or that two treatments have approximately
# the same effect. Being equivalent is defined by a lower and upper
# equivalence margin (low and upp). If the effect measure is inside this
# interval, then the treatments are equivalent.
#
# The null and alternative hypothesis are
#
# H0: p <= p_low or p >= p_upp
# HA: p_low < p < p_upp
#
# In this case the null hypothesis is that the effect measure is outside
# of the equivalence interval. If we reject the null hypothesis, then the
# data provides evidence that treatment and control are equivalent.
#
# In the TOST procedure we evaluate two one-sided tests, one for the null
# hypothesis that the effect is equal or below the lower threshold and one
# for the null hypothesis that the effect is equal or above the upper
# threshold. If we reject both tests, then the data provides evidence that
# the effect is inside the equivalence interval.
# The p-value of the tost method will be the maximum of the pvalues of the
# two test.
#
# Suppose our equivalence margins are 0.45 and 0.55, then the p-value of
# the equivalence test is 0.43, and we cannot reject the null hypothesis
# that the two treatments are not equivalent, i.e. the data does not provide
# support for equivalence.
res.tost_prob_superior(low=0.45, upp=0.55)
res.test_prob_superior(value=0.55, alternative="smaller")
res.test_prob_superior(value=0.6, alternative="larger")
res.test_prob_superior(value=0.529937, alternative="smaller")
res.test_prob_superior(value=0.529937017, alternative="smaller")
res.conf_int()
# ### Aside: Burden of proof in hypothesis testing
#
# Hypothesis tests have in general the following two properties
#
# - small samples favor the null hypothesis. Uncertainty in small samples
# is large and the hypothesis test has little power. The study in this case
# is underpowered and we often cannot reject the null because of large
# uncertainty. Non-rejection cannot be taken as evidence in favor of the
# null because the hypothesis test does not have much power to reject the
# null.
# - large samples favor the alternative hypothesis. In large samples
# uncertainty becomes small, and even small deviations from the null
# hypothesis will lead to rejection. In this case we can have a test result
# that shows an effect that is statistical significant but substantive
# irrelevant.
#
# Noninferiority and equivalence tests solve both problems. The first
# problem, favoring the null in small samples, can be avoided by reversing
# null and alternative hypothesis. The alternative hypothesis should be the
# one we want to show, so the test does not just support the hypothesis of
# interest because the power is too small. The second problem, favoring the
# alternative in large samples, can be voided but specifying a margin in the
# hypothesis test, so that trivial deviations are still part of the null
# hypothesis. By using this, we are not favoring the alternative in large
# samples for irrelevant deviations of a point null hypothesis.
# Noninferiority tests want to show that a treatment is almost as good as
# the control. Equivalence test try to show that the statistics in two
# samples are approximately the same, in contrast to a point hypothesis that
# specifies that they are exactly the same.
#
# ## Reversing the samples
#
# In the literature, stochastic dominance in the sense of p not equal to
# 0.5 is often defined using a stochastically smaller measure instead of
# stochastically larger as defined in statsmodels.
#
# The effect measure reverses the inequality and is then
#
# p2 = P(x1 < x2) + 0.5 * P(x1 = x2)
#
# This can be obtained in the statsmodels function by reversing the
# sequence of the sample, that is we can use (x2, x1) instead of (x1, x2).
# The two definitions, p, p2, are related by p2 = 1 - p.
#
# The results instance of rank_compare shows both statistics, but the
# hypothesis tests and confidence interval are only provided for the
# stochastically larger measure.
res = rank_compare_2indep(x2, x1) #, use_t=False)
res
res.conf_int()
st = res.summary()
st
# ## Ordinal data
#
# The rank based analysis assumes only that data is ordinal and uses only
# ordinal information. Furthermore, the definition in terms of midranks
# allows for discete data in analysis similar to Brunner-Munzel test.
#
# Statsmodels provides some additional functions for the case when the
# data is discrete and has only a finite number of support points, i.e. is
# ordered categorical.
#
# The data above was given as ordinal data, but we had to expand it to be
# able to use `rank_compare_2indep`. Instead we can use
# `rank_compare_2ordinal` directly with the frequency counts. The latter
# function mainly differs from the former by using more efficient
# computation for the special case. The results statistics will be the same.
#
# The counts for treatment ("new") and control ("active") in increasing
# order of the underlying values from the above example are
new, active
res_o = rank_compare_2ordinal(new, active)
res_o.summary()
res_o
res = rank_compare_2indep(x1, x2)
res
res_o.test_prob_superior()
res.test_prob_superior()
res_o.conf_int(), res.conf_int()
|