1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390
|
.. _10min_tut_09_timeseries:
{{ header }}
.. ipython:: python
import pandas as pd
import matplotlib.pyplot as plt
.. raw:: html
<div class="card gs-data">
<div class="card-header">
<div class="gs-data-title">
Data used for this tutorial:
</div>
</div>
<ul class="list-group list-group-flush">
<li class="list-group-item">
<div data-toggle="collapse" href="#collapsedata" role="button" aria-expanded="false" aria-controls="collapsedata">
<span class="badge badge-dark">Air quality data</span>
</div>
<div class="collapse" id="collapsedata">
<div class="card-body">
<p class="card-text">
For this tutorial, air quality data about :math:`NO_2` and Particulate
matter less than 2.5 micrometers is used, made available by
`OpenAQ <https://openaq.org>`__ and downloaded using the
`py-openaq <http://dhhagan.github.io/py-openaq/index.html>`__ package.
The ``air_quality_no2_long.csv"`` data set provides :math:`NO_2` values
for the measurement stations *FR04014*, *BETR801* and *London
Westminster* in respectively Paris, Antwerp and London.
.. raw:: html
</p>
<a href="https://github.com/pandas-dev/pandas/tree/main/doc/data/air_quality_no2_long.csv" class="btn btn-dark btn-sm">To raw data</a>
</div>
</div>
.. ipython:: python
air_quality = pd.read_csv("data/air_quality_no2_long.csv")
air_quality = air_quality.rename(columns={"date.utc": "datetime"})
air_quality.head()
.. ipython:: python
air_quality.city.unique()
.. raw:: html
</li>
</ul>
</div>
How to handle time series data with ease?
-----------------------------------------
.. _10min_tut_09_timeseries.properties:
Using pandas datetime properties
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. raw:: html
<ul class="task-bullet">
<li>
I want to work with the dates in the column ``datetime`` as datetime objects instead of plain text
.. ipython:: python
air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])
air_quality["datetime"]
Initially, the values in ``datetime`` are character strings and do not
provide any datetime operations (e.g. extract the year, day of the
week,…). By applying the ``to_datetime`` function, pandas interprets the
strings and convert these to datetime (i.e. ``datetime64[ns, UTC]``)
objects. In pandas we call these datetime objects similar to
``datetime.datetime`` from the standard library as :class:`pandas.Timestamp`.
.. raw:: html
</li>
</ul>
.. note::
As many data sets do contain datetime information in one of
the columns, pandas input function like :func:`pandas.read_csv` and :func:`pandas.read_json`
can do the transformation to dates when reading the data using the
``parse_dates`` parameter with a list of the columns to read as
Timestamp:
::
pd.read_csv("../data/air_quality_no2_long.csv", parse_dates=["datetime"])
Why are these :class:`pandas.Timestamp` objects useful? Let’s illustrate the added
value with some example cases.
What is the start and end date of the time series data set we are working
with?
.. ipython:: python
air_quality["datetime"].min(), air_quality["datetime"].max()
Using :class:`pandas.Timestamp` for datetimes enables us to calculate with date
information and make them comparable. Hence, we can use this to get the
length of our time series:
.. ipython:: python
air_quality["datetime"].max() - air_quality["datetime"].min()
The result is a :class:`pandas.Timedelta` object, similar to ``datetime.timedelta``
from the standard Python library and defining a time duration.
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
The various time concepts supported by pandas are explained in the user guide section on :ref:`time related concepts <timeseries.overview>`.
.. raw:: html
</div>
.. raw:: html
<ul class="task-bullet">
<li>
I want to add a new column to the ``DataFrame`` containing only the month of the measurement
.. ipython:: python
air_quality["month"] = air_quality["datetime"].dt.month
air_quality.head()
By using ``Timestamp`` objects for dates, a lot of time-related
properties are provided by pandas. For example the ``month``, but also
``year``, ``weekofyear``, ``quarter``,… All of these properties are
accessible by the ``dt`` accessor.
.. raw:: html
</li>
</ul>
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
An overview of the existing date properties is given in the
:ref:`time and date components overview table <timeseries.components>`. More details about the ``dt`` accessor
to return datetime like properties are explained in a dedicated section on the :ref:`dt accessor <basics.dt_accessors>`.
.. raw:: html
</div>
.. raw:: html
<ul class="task-bullet">
<li>
What is the average :math:`NO_2` concentration for each day of the week for each of the measurement locations?
.. ipython:: python
air_quality.groupby(
[air_quality["datetime"].dt.weekday, "location"])["value"].mean()
Remember the split-apply-combine pattern provided by ``groupby`` from the
:ref:`tutorial on statistics calculation <10min_tut_06_stats>`?
Here, we want to calculate a given statistic (e.g. mean :math:`NO_2`)
**for each weekday** and **for each measurement location**. To group on
weekdays, we use the datetime property ``weekday`` (with Monday=0 and
Sunday=6) of pandas ``Timestamp``, which is also accessible by the
``dt`` accessor. The grouping on both locations and weekdays can be done
to split the calculation of the mean on each of these combinations.
.. danger::
As we are working with a very short time series in these
examples, the analysis does not provide a long-term representative
result!
.. raw:: html
</li>
</ul>
.. raw:: html
<ul class="task-bullet">
<li>
Plot the typical :math:`NO_2` pattern during the day of our time series of all stations together. In other words, what is the average value for each hour of the day?
.. ipython:: python
fig, axs = plt.subplots(figsize=(12, 4))
air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
kind='bar', rot=0, ax=axs
)
plt.xlabel("Hour of the day"); # custom x label using Matplotlib
@savefig 09_bar_chart.png
plt.ylabel("$NO_2 (µg/m^3)$");
Similar to the previous case, we want to calculate a given statistic
(e.g. mean :math:`NO_2`) **for each hour of the day** and we can use the
split-apply-combine approach again. For this case, we use the datetime property ``hour``
of pandas ``Timestamp``, which is also accessible by the ``dt`` accessor.
.. raw:: html
</li>
</ul>
Datetime as index
~~~~~~~~~~~~~~~~~
In the :ref:`tutorial on reshaping <10min_tut_07_reshape>`,
:meth:`~pandas.pivot` was introduced to reshape the data table with each of the
measurements locations as a separate column:
.. ipython:: python
no_2 = air_quality.pivot(index="datetime", columns="location", values="value")
no_2.head()
.. note::
By pivoting the data, the datetime information became the
index of the table. In general, setting a column as an index can be
achieved by the ``set_index`` function.
Working with a datetime index (i.e. ``DatetimeIndex``) provides powerful
functionalities. For example, we do not need the ``dt`` accessor to get
the time series properties, but have these properties available on the
index directly:
.. ipython:: python
no_2.index.year, no_2.index.weekday
Some other advantages are the convenient subsetting of time period or
the adapted time scale on plots. Let’s apply this on our data.
.. raw:: html
<ul class="task-bullet">
<li>
Create a plot of the :math:`NO_2` values in the different stations from the 20th of May till the end of 21st of May
.. ipython:: python
:okwarning:
@savefig 09_time_section.png
no_2["2019-05-20":"2019-05-21"].plot();
By providing a **string that parses to a datetime**, a specific subset of the data can be selected on a ``DatetimeIndex``.
.. raw:: html
</li>
</ul>
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
More information on the ``DatetimeIndex`` and the slicing by using strings is provided in the section on :ref:`time series indexing <timeseries.datetimeindex>`.
.. raw:: html
</div>
Resample a time series to another frequency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. raw:: html
<ul class="task-bullet">
<li>
Aggregate the current hourly time series values to the monthly maximum value in each of the stations.
.. ipython:: python
monthly_max = no_2.resample("M").max()
monthly_max
A very powerful method on time series data with a datetime index, is the
ability to :meth:`~Series.resample` time series to another frequency (e.g.,
converting secondly data into 5-minutely data).
.. raw:: html
</li>
</ul>
The :meth:`~Series.resample` method is similar to a groupby operation:
- it provides a time-based grouping, by using a string (e.g. ``M``,
``5H``,…) that defines the target frequency
- it requires an aggregation function such as ``mean``, ``max``,…
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
An overview of the aliases used to define time series frequencies is given in the :ref:`offset aliases overview table <timeseries.offset_aliases>`.
.. raw:: html
</div>
When defined, the frequency of the time series is provided by the
``freq`` attribute:
.. ipython:: python
monthly_max.index.freq
.. raw:: html
<ul class="task-bullet">
<li>
Make a plot of the daily mean :math:`NO_2` value in each of the stations.
.. ipython:: python
:okwarning:
@savefig 09_resample_mean.png
no_2.resample("D").mean().plot(style="-o", figsize=(10, 5));
.. raw:: html
</li>
</ul>
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
More details on the power of time series ``resampling`` is provided in the user guide section on :ref:`resampling <timeseries.resampling>`.
.. raw:: html
</div>
.. raw:: html
<div class="shadow gs-callout gs-callout-remember">
<h4>REMEMBER</h4>
- Valid date strings can be converted to datetime objects using
``to_datetime`` function or as part of read functions.
- Datetime objects in pandas support calculations, logical operations
and convenient date-related properties using the ``dt`` accessor.
- A ``DatetimeIndex`` contains these date-related properties and
supports convenient slicing.
- ``Resample`` is a powerful method to change the frequency of a time
series.
.. raw:: html
</div>
.. raw:: html
<div class="d-flex flex-row gs-torefguide">
<span class="badge badge-info">To user guide</span>
A full overview on time series is given on the pages on :ref:`time series and date functionality <timeseries>`.
.. raw:: html
</div>
|