1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286
|
.. _10min_tut_06_stats:
{{ header }}
.. ipython:: python
import pandas as pd
.. raw:: html
<div class="card gs-data">
<div class="card-header">
<div class="gs-data-title">
Data used for this tutorial:
</div>
</div>
<ul class="list-group list-group-flush">
<li class="list-group-item">
.. include:: includes/titanic.rst
.. ipython:: python
titanic = pd.read_csv("data/titanic.csv")
titanic.head()
.. raw:: html
</li>
</ul>
</div>
How to calculate summary statistics?
------------------------------------
Aggregating statistics
~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../../_static/schemas/06_aggregate.svg
:align: center
.. raw:: html
<ul class="task-bullet">
<li>
What is the average age of the Titanic passengers?
.. ipython:: python
titanic["Age"].mean()
.. raw:: html
</li>
</ul>
Different statistics are available and can be applied to columns with
numerical data. Operations in general exclude missing data and operate
across rows by default.
.. image:: ../../_static/schemas/06_reduction.svg
:align: center
.. raw:: html
<ul class="task-bullet">
<li>
What is the median age and ticket fare price of the Titanic passengers?
.. ipython:: python
titanic[["Age", "Fare"]].median()
The statistic applied to multiple columns of a ``DataFrame`` (the selection of two columns
returns a ``DataFrame``, see the :ref:`subset data tutorial <10min_tut_03_subset>`) is calculated for each numeric column.
.. raw:: html
</li>
</ul>
The aggregating statistic can be calculated for multiple columns at the
same time. Remember the ``describe`` function from the :ref:`first tutorial <10min_tut_01_tableoriented>`?
.. ipython:: python
titanic[["Age", "Fare"]].describe()
Instead of the predefined statistics, specific combinations of
aggregating statistics for given columns can be defined using the
:func:`DataFrame.agg` method:
.. ipython:: python
titanic.agg(
{
"Age": ["min", "max", "median", "skew"],
"Fare": ["min", "max", "median", "mean"],
}
)
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
Details about descriptive statistics are provided in the user guide section on :ref:`descriptive statistics <basics.stats>`.
.. raw:: html
</div>
Aggregating statistics grouped by category
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../../_static/schemas/06_groupby.svg
:align: center
.. raw:: html
<ul class="task-bullet">
<li>
What is the average age for male versus female Titanic passengers?
.. ipython:: python
titanic[["Sex", "Age"]].groupby("Sex").mean()
As our interest is the average age for each gender, a subselection on
these two columns is made first: ``titanic[["Sex", "Age"]]``. Next, the
:meth:`~DataFrame.groupby` method is applied on the ``Sex`` column to make a group per
category. The average age *for each gender* is calculated and
returned.
.. raw:: html
</li>
</ul>
Calculating a given statistic (e.g. ``mean`` age) *for each category in
a column* (e.g. male/female in the ``Sex`` column) is a common pattern.
The ``groupby`` method is used to support this type of operations. This
fits in the more general ``split-apply-combine`` pattern:
- **Split** the data into groups
- **Apply** a function to each group independently
- **Combine** the results into a data structure
The apply and combine steps are typically done together in pandas.
In the previous example, we explicitly selected the 2 columns first. If
not, the ``mean`` method is applied to each column containing numerical
columns by passing ``numeric_only=True``:
.. ipython:: python
titanic.groupby("Sex").mean(numeric_only=True)
It does not make much sense to get the average value of the ``Pclass``.
If we are only interested in the average age for each gender, the
selection of columns (rectangular brackets ``[]`` as usual) is supported
on the grouped data as well:
.. ipython:: python
titanic.groupby("Sex")["Age"].mean()
.. image:: ../../_static/schemas/06_groupby_select_detail.svg
:align: center
.. note::
The ``Pclass`` column contains numerical data but actually
represents 3 categories (or factors) with respectively the labels ‘1’,
‘2’ and ‘3’. Calculating statistics on these does not make much sense.
Therefore, pandas provides a ``Categorical`` data type to handle this
type of data. More information is provided in the user guide
:ref:`categorical` section.
.. raw:: html
<ul class="task-bullet">
<li>
What is the mean ticket fare price for each of the sex and cabin class combinations?
.. ipython:: python
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
Grouping can be done by multiple columns at the same time. Provide the
column names as a list to the :meth:`~DataFrame.groupby` method.
.. raw:: html
</li>
</ul>
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
A full description on the split-apply-combine approach is provided in the user guide section on :ref:`groupby operations <groupby>`.
.. raw:: html
</div>
Count number of records by category
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../../_static/schemas/06_valuecounts.svg
:align: center
.. raw:: html
<ul class="task-bullet">
<li>
What is the number of passengers in each of the cabin classes?
.. ipython:: python
titanic["Pclass"].value_counts()
The :meth:`~Series.value_counts` method counts the number of records for each
category in a column.
.. raw:: html
</li>
</ul>
The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records
within each group:
.. ipython:: python
titanic.groupby("Pclass")["Pclass"].count()
.. note::
Both ``size`` and ``count`` can be used in combination with
``groupby``. Whereas ``size`` includes ``NaN`` values and just provides
the number of rows (size of the table), ``count`` excludes the missing
values. In the ``value_counts`` method, use the ``dropna`` argument to
include or exclude the ``NaN`` values.
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
The user guide has a dedicated section on ``value_counts`` , see the page on :ref:`discretization <basics.discretization>`.
.. raw:: html
</div>
.. raw:: html
<div class="shadow gs-callout gs-callout-remember">
<h4>REMEMBER</h4>
- Aggregation statistics can be calculated on entire columns or rows.
- ``groupby`` provides the power of the *split-apply-combine* pattern.
- ``value_counts`` is a convenient shortcut to count the number of
entries in each category of a variable.
.. raw:: html
</div>
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
A full description on the split-apply-combine approach is provided in the user guide pages about :ref:`groupby operations <groupby>`.
.. raw:: html
</div>
|