File: 09_timeseries.rst

package info (click to toggle)
pandas 1.5.3%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 56,516 kB
  • sloc: python: 382,477; ansic: 8,695; sh: 119; xml: 102; makefile: 97
file content (390 lines) | stat: -rw-r--r-- 11,470 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
.. _10min_tut_09_timeseries:

{{ header }}

.. ipython:: python

    import pandas as pd
    import matplotlib.pyplot as plt

.. raw:: html

    <div class="card gs-data">
        <div class="card-header">
            <div class="gs-data-title">
                Data used for this tutorial:
            </div>
        </div>
        <ul class="list-group list-group-flush">
            <li class="list-group-item">
                <div data-toggle="collapse" href="#collapsedata" role="button" aria-expanded="false" aria-controls="collapsedata">
                    <span class="badge badge-dark">Air quality data</span>
                </div>
                <div class="collapse" id="collapsedata">
                    <div class="card-body">
                        <p class="card-text">

For this tutorial, air quality data about :math:`NO_2` and Particulate
matter less than 2.5 micrometers is used, made available by
`OpenAQ <https://openaq.org>`__ and downloaded using the
`py-openaq <http://dhhagan.github.io/py-openaq/index.html>`__ package.
The ``air_quality_no2_long.csv"`` data set provides :math:`NO_2` values
for the measurement stations *FR04014*, *BETR801* and *London
Westminster* in respectively Paris, Antwerp and London.

.. raw:: html

                        </p>
                    <a href="https://github.com/pandas-dev/pandas/tree/main/doc/data/air_quality_no2_long.csv" class="btn btn-dark btn-sm">To raw data</a>
                </div>
            </div>

.. ipython:: python

    air_quality = pd.read_csv("data/air_quality_no2_long.csv")
    air_quality = air_quality.rename(columns={"date.utc": "datetime"})
    air_quality.head()

.. ipython:: python

    air_quality.city.unique()

.. raw:: html

        </li>
    </ul>
    </div>

How to handle time series data with ease?
-----------------------------------------

.. _10min_tut_09_timeseries.properties:

Using pandas datetime properties
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. raw:: html

    <ul class="task-bullet">
        <li>

I want to work with the dates in the column ``datetime`` as datetime objects instead of plain text

.. ipython:: python

    air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])
    air_quality["datetime"]

Initially, the values in ``datetime`` are character strings and do not
provide any datetime operations (e.g. extract the year, day of the
week,…). By applying the ``to_datetime`` function, pandas interprets the
strings and convert these to datetime (i.e. ``datetime64[ns, UTC]``)
objects. In pandas we call these datetime objects similar to
``datetime.datetime`` from the standard library as :class:`pandas.Timestamp`.

.. raw:: html

        </li>
    </ul>

.. note::
    As many data sets do contain datetime information in one of
    the columns, pandas input function like :func:`pandas.read_csv` and :func:`pandas.read_json`
    can do the transformation to dates when reading the data using the
    ``parse_dates`` parameter with a list of the columns to read as
    Timestamp:

    ::

        pd.read_csv("../data/air_quality_no2_long.csv", parse_dates=["datetime"])

Why are these :class:`pandas.Timestamp` objects useful? Let’s illustrate the added
value with some example cases.

   What is the start and end date of the time series data set we are working
   with?

.. ipython:: python

    air_quality["datetime"].min(), air_quality["datetime"].max()

Using :class:`pandas.Timestamp` for datetimes enables us to calculate with date
information and make them comparable. Hence, we can use this to get the
length of our time series:

.. ipython:: python

    air_quality["datetime"].max() - air_quality["datetime"].min()

The result is a :class:`pandas.Timedelta` object, similar to ``datetime.timedelta``
from the standard Python library and defining a time duration.

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

The various time concepts supported by pandas are explained in the user guide section on :ref:`time related concepts <timeseries.overview>`.

.. raw:: html

    </div>

.. raw:: html

    <ul class="task-bullet">
        <li>

I want to add a new column to the ``DataFrame`` containing only the month of the measurement

.. ipython:: python

    air_quality["month"] = air_quality["datetime"].dt.month
    air_quality.head()

By using ``Timestamp`` objects for dates, a lot of time-related
properties are provided by pandas. For example the ``month``, but also
``year``, ``weekofyear``, ``quarter``,… All of these properties are
accessible by the ``dt`` accessor.

.. raw:: html

        </li>
    </ul>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

An overview of the existing date properties is given in the
:ref:`time and date components overview table <timeseries.components>`. More details about the ``dt`` accessor
to return datetime like properties are explained in a dedicated section on the  :ref:`dt accessor <basics.dt_accessors>`.

.. raw:: html

    </div>

.. raw:: html

    <ul class="task-bullet">
        <li>

What is the average :math:`NO_2` concentration for each day of the week for each of the measurement locations?

.. ipython:: python

    air_quality.groupby(
        [air_quality["datetime"].dt.weekday, "location"])["value"].mean()

Remember the split-apply-combine pattern provided by ``groupby`` from the
:ref:`tutorial on statistics calculation <10min_tut_06_stats>`?
Here, we want to calculate a given statistic (e.g. mean :math:`NO_2`)
**for each weekday** and **for each measurement location**. To group on
weekdays, we use the datetime property ``weekday`` (with Monday=0 and
Sunday=6) of pandas ``Timestamp``, which is also accessible by the
``dt`` accessor. The grouping on both locations and weekdays can be done
to split the calculation of the mean on each of these combinations.

.. danger::
    As we are working with a very short time series in these
    examples, the analysis does not provide a long-term representative
    result!

.. raw:: html

        </li>
    </ul>

.. raw:: html

    <ul class="task-bullet">
        <li>

Plot the typical :math:`NO_2` pattern during the day of our time series of all stations together. In other words, what is the average value for each hour of the day?

.. ipython:: python

    fig, axs = plt.subplots(figsize=(12, 4))
    air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
        kind='bar', rot=0, ax=axs
    )
    plt.xlabel("Hour of the day");  # custom x label using Matplotlib
    @savefig 09_bar_chart.png
    plt.ylabel("$NO_2 (µg/m^3)$");

Similar to the previous case, we want to calculate a given statistic
(e.g. mean :math:`NO_2`) **for each hour of the day** and we can use the
split-apply-combine approach again. For this case, we use the datetime property ``hour``
of pandas ``Timestamp``, which is also accessible by the ``dt`` accessor.

.. raw:: html

        </li>
    </ul>

Datetime as index
~~~~~~~~~~~~~~~~~

In the :ref:`tutorial on reshaping <10min_tut_07_reshape>`,
:meth:`~pandas.pivot` was introduced to reshape the data table with each of the
measurements locations as a separate column:

.. ipython:: python

    no_2 = air_quality.pivot(index="datetime", columns="location", values="value")
    no_2.head()

.. note::
    By pivoting the data, the datetime information became the
    index of the table. In general, setting a column as an index can be
    achieved by the ``set_index`` function.

Working with a datetime index (i.e. ``DatetimeIndex``) provides powerful
functionalities. For example, we do not need the ``dt`` accessor to get
the time series properties, but have these properties available on the
index directly:

.. ipython:: python

    no_2.index.year, no_2.index.weekday

Some other advantages are the convenient subsetting of time period or
the adapted time scale on plots. Let’s apply this on our data.

.. raw:: html

    <ul class="task-bullet">
        <li>

Create a plot of the :math:`NO_2` values in the different stations from the 20th of May till the end of 21st of May

.. ipython:: python
    :okwarning:

    @savefig 09_time_section.png
    no_2["2019-05-20":"2019-05-21"].plot();

By providing a **string that parses to a datetime**, a specific subset of the data can be selected on a ``DatetimeIndex``.

.. raw:: html

        </li>
    </ul>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

More information on the ``DatetimeIndex`` and the slicing by using strings is provided in the section on :ref:`time series indexing <timeseries.datetimeindex>`.

.. raw:: html

    </div>

Resample a time series to another frequency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. raw:: html

    <ul class="task-bullet">
        <li>

Aggregate the current hourly time series values to the monthly maximum value in each of the stations.

.. ipython:: python

    monthly_max = no_2.resample("M").max()
    monthly_max

A very powerful method on time series data with a datetime index, is the
ability to :meth:`~Series.resample` time series to another frequency (e.g.,
converting secondly data into 5-minutely data).

.. raw:: html

        </li>
    </ul>

The :meth:`~Series.resample` method is similar to a groupby operation:

-  it provides a time-based grouping, by using a string (e.g. ``M``,
   ``5H``,…) that defines the target frequency
-  it requires an aggregation function such as ``mean``, ``max``,…

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

An overview of the aliases used to define time series frequencies is given in the :ref:`offset aliases overview table <timeseries.offset_aliases>`.

.. raw:: html

    </div>

When defined, the frequency of the time series is provided by the
``freq`` attribute:

.. ipython:: python

    monthly_max.index.freq

.. raw:: html

    <ul class="task-bullet">
        <li>

Make a plot of the daily mean :math:`NO_2` value in each of the stations.

.. ipython:: python
    :okwarning:

    @savefig 09_resample_mean.png
    no_2.resample("D").mean().plot(style="-o", figsize=(10, 5));

.. raw:: html

        </li>
    </ul>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

More details on the power of time series ``resampling`` is provided in the user guide section on :ref:`resampling <timeseries.resampling>`.

.. raw:: html

    </div>

.. raw:: html

    <div class="shadow gs-callout gs-callout-remember">
        <h4>REMEMBER</h4>

-  Valid date strings can be converted to datetime objects using
   ``to_datetime`` function or as part of read functions.
-  Datetime objects in pandas support calculations, logical operations
   and convenient date-related properties using the ``dt`` accessor.
-  A ``DatetimeIndex`` contains these date-related properties and
   supports convenient slicing.
-  ``Resample`` is a powerful method to change the frequency of a time
   series.

.. raw:: html

   </div>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

A full overview on time series is given on the pages on :ref:`time series and date functionality <timeseries>`.

.. raw:: html

   </div>