File: 07_reshape_table_layout.rst

package info (click to toggle)
pandas 2.2.3%2Bdfsg-9
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 66,784 kB
  • sloc: python: 422,228; ansic: 9,190; sh: 270; xml: 102; makefile: 83
file content (380 lines) | stat: -rw-r--r-- 10,341 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
.. _10min_tut_07_reshape:

{{ header }}

.. ipython:: python

    import pandas as pd

.. raw:: html

    <div class="card gs-data">
        <div class="card-header gs-data-header">
            <div class="gs-data-title">
                Data used for this tutorial:
            </div>
        </div>
        <ul class="list-group list-group-flush">
            <li class="list-group-item gs-data-list">

.. include:: includes/titanic.rst

.. ipython:: python

    titanic = pd.read_csv("data/titanic.csv")
    titanic.head()

.. raw:: html

        </li>
        <li class="list-group-item gs-data-list">
            <div data-bs-toggle="collapse" href="#collapsedata2" role="button" aria-expanded="false" aria-controls="collapsedata2">
                <span class="badge bg-secondary">Air quality data</span>
            </div>
            <div class="collapse" id="collapsedata2">
                <div class="card-body">
                    <p class="card-text">

This tutorial uses air quality data about :math:`NO_2` and Particulate matter less than 2.5
micrometers, made available by
`OpenAQ <https://openaq.org>`__ and using the
`py-openaq <http://dhhagan.github.io/py-openaq/index.html>`__ package.
The ``air_quality_long.csv`` data set provides :math:`NO_2` and
:math:`PM_{25}` values for the measurement stations *FR04014*, *BETR801*
and *London Westminster* in respectively Paris, Antwerp and London.

The air-quality data set has the following columns:

-  city: city where the sensor is used, either Paris, Antwerp or London
-  country: country where the sensor is used, either FR, BE or GB
-  location: the id of the sensor, either *FR04014*, *BETR801* or
   *London Westminster*
-  parameter: the parameter measured by the sensor, either :math:`NO_2`
   or Particulate matter
-  value: the measured value
-  unit: the unit of the measured parameter, in this case ‘µg/m³’

and the index of the ``DataFrame`` is ``datetime``, the datetime of the
measurement.

.. note::
    The air-quality data is provided in a so-called *long format*
    data representation with each observation on a separate row and each
    variable a separate column of the data table. The long/narrow format is
    also known as the `tidy data
    format <https://www.jstatsoft.org/article/view/v059i10>`__.

.. raw:: html

                    </p>
                <a href="https://github.com/pandas-dev/pandas/tree/main/doc/data/air_quality_long.csv" class="btn btn-dark btn-sm">To raw data</a>
            </div>
        </div>

.. ipython:: python

    air_quality = pd.read_csv(
        "data/air_quality_long.csv", index_col="date.utc", parse_dates=True
    )
    air_quality.head()

.. raw:: html

            </li>
        </ul>
    </div>

How to reshape the layout of tables
-----------------------------------

Sort table rows
~~~~~~~~~~~~~~~

.. raw:: html

    <ul class="task-bullet">
        <li>

I want to sort the Titanic data according to the age of the passengers.

.. ipython:: python

    titanic.sort_values(by="Age").head()

.. raw:: html

        </li>
    </ul>

.. raw:: html

    <ul class="task-bullet">
        <li>

I want to sort the Titanic data according to the cabin class and age in descending order.

.. ipython:: python

    titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()

With :meth:`DataFrame.sort_values`, the rows in the table are sorted according to the
defined column(s). The index will follow the row order.

.. raw:: html

        </li>
    </ul>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

More details about sorting of tables is provided in the user guide section on :ref:`sorting data <basics.sorting>`.

.. raw:: html

   </div>

Long to wide table format
~~~~~~~~~~~~~~~~~~~~~~~~~

Let’s use a small subset of the air quality data set. We focus on
:math:`NO_2` data and only use the first two measurements of each
location (i.e. the head of each group). The subset of data will be
called ``no2_subset``.

.. ipython:: python

    # filter for no2 data only
    no2 = air_quality[air_quality["parameter"] == "no2"]

.. ipython:: python

    # use 2 measurements (head) for each location (groupby)
    no2_subset = no2.sort_index().groupby(["location"]).head(2)
    no2_subset

.. image:: ../../_static/schemas/07_pivot.svg
   :align: center

.. raw:: html

    <ul class="task-bullet">
        <li>

I want the values for the three stations as separate columns next to each other.

.. ipython:: python

    no2_subset.pivot(columns="location", values="value")

The :meth:`~pandas.pivot` function is purely reshaping of the data: a single value
for each index/column combination is required.

.. raw:: html

        </li>
    </ul>

As pandas supports plotting of multiple columns (see :ref:`plotting tutorial <10min_tut_04_plotting>`) out of the box, the conversion from
*long* to *wide* table format enables the plotting of the different time
series at the same time:

.. ipython:: python

    no2.head()

.. ipython:: python

    @savefig 7_reshape_columns.png
    no2.pivot(columns="location", values="value").plot()

.. note::
    When the ``index`` parameter is not defined, the existing
    index (row labels) is used.

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

For more information about :meth:`~DataFrame.pivot`, see the user guide section on :ref:`pivoting DataFrame objects <reshaping.reshaping>`.

.. raw:: html

   </div>

Pivot table
~~~~~~~~~~~

.. image:: ../../_static/schemas/07_pivot_table.svg
   :align: center

.. raw:: html

    <ul class="task-bullet">
        <li>

I want the mean concentrations for :math:`NO_2` and :math:`PM_{2.5}` in each of the stations in table form.

.. ipython:: python

    air_quality.pivot_table(
        values="value", index="location", columns="parameter", aggfunc="mean"
    )

In the case of :meth:`~DataFrame.pivot`, the data is only rearranged. When multiple
values need to be aggregated (in this specific case, the values on
different time steps), :meth:`~DataFrame.pivot_table` can be used, providing an
aggregation function (e.g. mean) on how to combine these values.

.. raw:: html

        </li>
    </ul>

Pivot table is a well known concept in spreadsheet software. When
interested in the row/column margins (subtotals) for each variable, set
the ``margins`` parameter to ``True``:

.. ipython:: python

    air_quality.pivot_table(
        values="value",
        index="location",
        columns="parameter",
        aggfunc="mean",
        margins=True,
    )

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

For more information about :meth:`~DataFrame.pivot_table`, see the user guide section on :ref:`pivot tables <reshaping.pivot>`.

.. raw:: html

   </div>

.. note::
    In case you are wondering, :meth:`~DataFrame.pivot_table` is indeed directly linked
    to :meth:`~DataFrame.groupby`. The same result can be derived by grouping on both
    ``parameter`` and ``location``:

    ::

        air_quality.groupby(["parameter", "location"])[["value"]].mean()

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

.. raw:: html

   </div>

Wide to long format
~~~~~~~~~~~~~~~~~~~

Starting again from the wide format table created in the previous
section, we add a new index to the ``DataFrame`` with :meth:`~DataFrame.reset_index`.

.. ipython:: python

    no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
    no2_pivoted.head()

.. image:: ../../_static/schemas/07_melt.svg
   :align: center

.. raw:: html

    <ul class="task-bullet">
        <li>

I want to collect all air quality :math:`NO_2` measurements in a single column (long format).

.. ipython:: python

    no_2 = no2_pivoted.melt(id_vars="date.utc")
    no_2.head()

The :func:`pandas.melt` method on a ``DataFrame`` converts the data table from wide
format to long format. The column headers become the variable names in a
newly created column.

.. raw:: html

        </li>
    </ul>

The solution is the short version on how to apply :func:`pandas.melt`. The method
will *melt* all columns NOT mentioned in ``id_vars`` together into two
columns: A column with the column header names and a column with the
values itself. The latter column gets by default the name ``value``.

The parameters passed to :func:`pandas.melt` can be defined in more detail:

.. ipython:: python

    no_2 = no2_pivoted.melt(
        id_vars="date.utc",
        value_vars=["BETR801", "FR04014", "London Westminster"],
        value_name="NO_2",
        var_name="id_location",
    )
    no_2.head()

The additional parameters have the following effects:

-  ``value_vars`` defines which columns to *melt* together
-  ``value_name`` provides a custom column name for the values column
   instead of the default column name ``value``
-  ``var_name`` provides a custom column name for the column collecting
   the column header names. Otherwise it takes the index name or a
   default ``variable``

Hence, the arguments ``value_name`` and ``var_name`` are just
user-defined names for the two generated columns. The columns to melt
are defined by ``id_vars`` and ``value_vars``.

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

Conversion from wide to long format with :func:`pandas.melt` is explained in the user guide section on :ref:`reshaping by melt <reshaping.melt>`.

.. raw:: html

   </div>

.. raw:: html

    <div class="shadow gs-callout gs-callout-remember">
        <h4>REMEMBER</h4>

-  Sorting by one or more columns is supported by ``sort_values``.
-  The ``pivot`` function is purely restructuring of the data,
   ``pivot_table`` supports aggregations.
-  The reverse of ``pivot`` (long to wide format) is ``melt`` (wide to
   long format).

.. raw:: html

   </div>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

A full overview is available in the user guide on the pages about :ref:`reshaping and pivoting <reshaping>`.

.. raw:: html

   </div>