File: contingency_tables.rst

package info (click to toggle)
statsmodels 0.8.0-9
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 35,236 kB
  • sloc: python: 159,888; ansic: 316; makefile: 246; asm: 156; sh: 19
file content (296 lines) | stat: -rw-r--r-- 9,824 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
.. currentmodule:: statsmodels.stats.contingency_tables

.. _contingency_tables:


Contingency tables
==================

Statsmodels supports a variety of approaches for analyzing contingency
tables, including methods for assessing independence, symmetry,
homogeneity, and methods for working with collections of tables from a
stratified population.

The methods described here are mainly for two-way tables.  Multi-way
tables can be analyzed using log-linear models.  Statsmodels does not
currently have a dedicated API for loglinear modeling, but Poisson
regression in :class:`statsmodels.genmod.GLM` can be used for this
purpose.

A contingency table is a multi-way table that describes a data set in
which each observation belongs to one category for each of several
variables.  For example, if there are two variables, one with
:math:`r` levels and one with :math:`c` levels, then we have a
:math:`r \times c` contingency table.  The table can be described in
terms of the number of observations that fall into a given cell of the
table, e.g. :math:`T_{ij}` is the number of observations that have
level :math:`i` for the first variable and level :math:`j` for the
second variable.  Note that each variable must have a finite number of
levels (or categories), which can be either ordered or unordered.  In
different contexts, the variables defining the axes of a contingency
table may be called **categorical variables** or **factor variables**.
They may be either **nominal** (if their levels are unordered) or
**ordinal** (if their levels are ordered).

The underlying population for a contingency table is described by a
**distribution table** :math:`P_{i, j}`.  The elements of :math:`P`
are probabilities, and the sum of all elements in :math:`P` is 1.
Methods for analyzing contingency tables use the data in :math:`T` to
learn about properties of :math:`P`.

The :class:`statsmodels.stats.Table` is the most basic class for
working with contingency tables.  We can create a ``Table`` object
directly from any rectangular array-like object containing the
contingency table cell counts:

.. ipython:: python

    import numpy as np
    import pandas as pd
    import statsmodels.api as sm

    df = sm.datasets.get_rdataset("Arthritis", "vcd", cache=True).data

    tab = pd.crosstab(df['Treatment'], df['Improved'])
    tab = tab.loc[:, ["None", "Some", "Marked"]]
    table = sm.stats.Table(tab)

Alternatively, we can pass the raw data and let the Table class
construct the array of cell counts for us:

.. ipython:: python

    table = sm.stats.Table.from_data(df[["Treatment", "Improved"]])


Independence
------------

**Independence** is the property that the row and column factors occur
independently. **Association** is the lack of independence.  If the
joint distribution is independent, it can be written as the outer
product of the row and column marginal distributions:

.. math::

P_{ij} = \sum_k P_{ij} \cdot \sum_k P_{kj} \forall i, j

We can obtain the best-fitting independent distribution for our
observed data, and then view residuals which identify particular cells
that most strongly violate independence:

.. ipython:: python

    print(table.table_orig)
    print(table.fittedvalues)
    print(table.resid_pearson)

In this example, compared to a sample from a population in which the
rows and columns are independent, we have too many observations in the
placebo/no improvement and treatment/marked improvement cells, and too
few observations in the placebo/marked improvement and treated/no
improvement cells.  This reflects the apparent benefits of the
treatment.

If the rows and columns of a table are unordered (i.e. are nominal
factors), then the most common approach for formally assessing
independence is using Pearson's :math:`\chi^2` statistic.  It's often
useful to look at the cell-wise contributions to the :math:`\chi^2`
statistic to see where the evidence for dependence is coming from.

.. ipython:: python

    rslt = table.test_nominal_association()
    print(rslt.pvalue)
    print(table.chi2_contribs)

For tables with ordered row and column factors, we can us the **linear
by linear** association test to obtain more power against alternative
hypotheses that respect the ordering.  The test statistic for the
linear by linear association test is

.. math::

\sum_k r_i c_j T_{ij}

where :math:`r_i` and :math:`c_j` are row and column scores.  Often
these scores are set to the sequences 0, 1, ....  This gives the
'Cochran-Armitage trend test'.

.. ipython:: python

    rslt = table.test_ordinal_association()
    print(rslt.pvalue)

We can assess the association in a :math:`r\times x` table by
constructing a series of :math:`2\times 2` tables and calculating
their odds ratios.  There are two ways to do this.  The **local odds
ratios** construct :math:`2\times 2` tables from adjacent row and
column categories.

.. ipython:: python

    print(table.local_oddsratios)
    taloc = sm.stats.Table2x2(np.asarray([[7, 29], [21, 13]]))
    print(taloc.oddsratio)
    taloc = sm.stats.Table2x2(np.asarray([[29, 7], [13, 7]]))
    print(taloc.oddsratio)

The **cumulative odds ratios** construct :math:`2\times 2` tables by
dichotomizing the row and column factors at each possible point.

.. ipython:: python

    print(table.cumulative_oddsratios)
    tab1 = np.asarray([[7, 29 + 7], [21, 13 + 7]])
    tacum = sm.stats.Table2x2(tab1)
    print(tacum.oddsratio)
    tab1 = np.asarray([[7 + 29, 7], [21 + 13, 7]])
    tacum = sm.stats.Table2x2(tab1)
    print(tacum.oddsratio)

A mosaic plot is a graphical approach to informally assessing
dependence in two-way tables.

::

    from statsmodels.graphics.mosaicplot import mosaic
    mosaic(data)

Symmetry and homogeneity
------------------------

**Symmetry** is the property that :math:`P_{i, j} = P_{j, i}` for
every :math:`i` and :math:`j`.  **Homogeneity** is the property that
the marginal distribution of the row factor and the column factor are
identical, meaning that

.. math::

\sum_j P_{ij} = \sum_j P_{ji} \forall i

Note that for these properties to be applicable the table :math:`P`
(and :math:`T`) must be square, and the row and column categories must
be identical and must occur in the same order.

To illustrate, we load a data set, create a contingency table, and
calculate the row and column margins.  The :class:`Table` class
contains methods for analyzing :math:`r \times c` contingency tables.
The data set loaded below contains assessments of visual acuity in
people's left and right eyes.  We first load the data and create a
contingency table.

.. ipython:: python

    df = sm.datasets.get_rdataset("VisualAcuity", "vcd", cache=True).data
    df = df.loc[df.gender == "female", :]
    tab = df.set_index(['left', 'right'])
    del tab["gender"]
    tab = tab.unstack()
    tab.columns = tab.columns.get_level_values(1)
    print(tab)

Next we create a :class:`SquareTable` object from the contingency
table.

.. ipython:: python

    sqtab = sm.stats.SquareTable(tab)
    row, col = sqtab.marginal_probabilities
    print(row)
    print(col)


The ``summary`` method prints results for the symmetry and homogeneity
testing procedures.

.. ipython:: python

    print(sqtab.summary())

If we had the individual case records in a dataframe called ``data``,
we could also perform the same analysis by passing the raw data using
the ``SquareTable.from_data`` class method.

::

    sqtab = sm.stats.SquareTable.from_data(data[['left', 'right']])
    print(sqtab.summary())


A single 2x2 table
------------------

Several methods for working with individual 2x2 tables are provided in
the :class:`sm.stats.Table2x2` class.  The ``summary`` method displays
several measures of association between the rows and columns of the
table.

.. ipython:: python

    table = np.asarray([[35, 21], [25, 58]])
    t22 = sm.stats.Table2x2(table)
    print(t22.summary())

Note that the risk ratio is not symmetric so different results will be
obtained if the transposed table is analyzed.

.. ipython:: python

    table = np.asarray([[35, 21], [25, 58]])
    t22 = sm.stats.Table2x2(table.T)
    print(t22.summary())


Stratified 2x2 tables
---------------------

Stratification occurs when we have a collection of contingency tables
defined by the same row and column factors.  In the example below, we
have a collection of 2x2 tables reflecting the joint distribution of
smoking and lung cancer in each of several regions of China.  It is
possible that the tables all have a common odds ratio, even while the
marginal probabilities vary among the strata.  The 'Breslow-Day'
procedure tests whether the data are consistent with a common odds
ratio.  It appears below as the `Test of constant OR`.  The
Mantel-Haenszel procedure tests whether this common odds ratio is
equal to one.  It appears below as the `Test of OR=1`.  It is also
possible to estimate the common odds and risk ratios and obtain
confidence intervals for them.  The ``summary`` method displays all of
these results.  Individual results can be obtained from the class
methods and attributes.

.. ipython:: python

    data = sm.datasets.china_smoking.load()

    mat = np.asarray(data.data)
    tables = [np.reshape(list(x)[1:], (2, 2)) for x in mat]

    st = sm.stats.StratifiedTable(tables)
    print(st.summary())


Module Reference
----------------

.. module:: statsmodels.stats.contingency_tables
.. currentmodule:: statsmodels.stats.contingency_tables

.. autosummary::
   :toctree: generated/

   Table
   Table2x2
   SquareTable
   StratifiedTable
   mcnemar
   cochrans_q

See also
--------

Scipy_ has several functions for analyzing contingency tables,
including Fisher's exact test which is not currently in Statsmodels.

.. _Scipy: http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html#contingency-table-functions