File: data_transformers.rst

package info (click to toggle)
python-altair 5.0.1-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 6,952 kB
  • sloc: python: 25,649; sh: 14; makefile: 5
file content (200 lines) | stat: -rw-r--r-- 6,896 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
.. _data-transformers:

Data Transformers
=================

Before a Vega-Lite or Vega specification can be passed to a renderer, it typically
has to be transformed in a number of ways:

* Pandas Dataframe has to be sanitized and serialized to JSON.
* The rows of a Dataframe might need to be sampled or limited to a maximum number.
* The Dataframe might be written to a ``.csv`` of ``.json`` file for performance
  reasons.

These data transformations are managed by the data transformation API of Altair.

.. note::

    The data transformation API of Altair should not be confused with the ``transform``
    API of Vega and Vega-Lite.

A data transformer is a Python function that takes a Vega-Lite data ``dict`` or
Pandas ``DataFrame`` and returns a transformed version of either of these types::

    from typing import Union
    Data = Union[dict, pd.DataFrame]

    def data_transformer(data: Data) -> Data:
        # Transform and return the data
        return transformed_data

Dataset Consolidation
~~~~~~~~~~~~~~~~~~~~~
Datasets passed as Pandas dataframes can be represented in the chart in two
ways:

- As literal dataset values in the ``data`` attribute at any level of the
  specification
- As a named dataset in the ``datasets`` attribute of the top-level
  specification.

The former is a bit more simple, but common patterns of usage in Altair can
often lead to full datasets being listed multiple times in their entirety
within a single specification.

For this reason, Altair 2.2 and newer will by default move all
directly-specified datasets into the top-level ``datasets`` entry, and
reference them by a unique name determined from the hash of the data
representation. The benefit of using a hash-based name is that even if the
user specifies a dataset in multiple places when building the chart, the
specification will only include one copy.

This behavior can be modified by setting the ``consolidate_datasets`` attribute
of the data transformer.

For example, consider this simple layered chart:

.. altair-plot::
   :chart-var-name: chart
		    
   import altair as alt
   import pandas as pd

   df = pd.DataFrame({'x': range(5),
                      'y': [1, 3, 4, 3, 5]})

   line = alt.Chart(df).mark_line().encode(x='x', y='y')
   points = alt.Chart(df).mark_point().encode(x='x', y='y')
   chart = line + points

If we look at the resulting specification, we see that although the dataset
was specified twice, only one copy of it is output in the spec:

.. altair-plot::
   :output: stdout

   from pprint import pprint
   pprint(chart.to_dict())

This consolidation of datasets is an extra bit of processing that is turned on
by default in all renderers.

If you would like to disable this dataset consolidation for any reason, you can
do so by setting ``alt.data_transformers.consolidate_datasets = False``, or
by using the ``enable()`` context manager to do it only temporarily:

.. altair-plot::
   :output: stdout

   with alt.data_transformers.enable(consolidate_datasets=False):
       pprint(chart.to_dict())
   
Notice that now the dataset is not specified within the top-level ``datasets``
attribute, but rather as values within the ``data`` attribute of each
individual layer. This duplication of data is the reason that dataset
consolidation is set to ``True`` by default.


Built-in Data Transformers
~~~~~~~~~~~~~~~~~~~~~~~~~~

Altair includes a default set of data transformers with the following signatures.

Raise a ``MaxRowsError`` if a Dataframe has more than ``max_rows`` rows::

    limit_rows(data, max_rows=5000)

Randomly sample a DataFrame (without replacement) before visualizing::

    sample(data, n=None, frac=None)

Convert a Dataframe to a separate ``.json`` file before visualization::

    to_json(data, prefix='altair-data'):

Convert a Dataframe to a separate ``.csv`` file before visualization::

    to_csv(data, prefix='altair-data'):

Convert a Dataframe to inline JSON values before visualization::

    to_values(data):

Piping
~~~~~~

Multiple data transformers can be piped together using ``pipe``::

    from altair import limit_rows, to_values
    from toolz.curried import pipe
    pipe(data, limit_rows(10000), to_values)

Managing Data Transformers
~~~~~~~~~~~~~~~~~~~~~~~~~~

Altair maintains a registry of data transformers, which includes a default
data transformer that is automatically applied to all Dataframes before rendering.

To see the registered transformers::

    >>> import altair as alt
    >>> alt.data_transformers.names()
    ['default', 'json', 'csv']

The default data transformer is the following::

    def default_data_transformer(data):
        return pipe(data, limit_rows, to_values)

The ``json`` and ``csv`` data transformers will save a Dataframe to a temporary
``.json`` or ``.csv`` file before rendering. There are a number of performance
advantages to these two data transformers:

* The full dataset will not be saved in the notebook document.
* The performance of the Vega-Lite/Vega JavaScript appears to be better
  for standalone JSON/CSV files than for inline values.

There are disadvantages of the JSON/CSV data transformers:

* The Dataframe will be exported to a temporary ``.json`` or ``.csv``
  file that sits next to the notebook.
* That notebook will not be able to re-render the visualization without
  that temporary file (or re-running the cell).

In our experience, the performance improvement is significant enough that
we recommend using the ``json`` data transformer for any large datasets::

    alt.data_transformers.enable('json')

We hope that others will write additional data transformers - imagine a
transformer which saves the dataset to a JSON file on S3, which could
be registered and enabled as::

    alt.data_transformers.register('s3', lambda data: pipe(data, to_s3('mybucket')))
    alt.data_transformers.enable('s3')


Storing JSON Data in a Separate Directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When creating many charts with ``alt.data_transformers.enable('json')`` the
working directory can get a bit cluttered. To avoid this we can build a simple
custom data transformer that stores all JSON files in separate directory.::

    import os
    import altair as alt
    from toolz.curried import pipe


    def json_dir(data, data_dir='altairdata'):
        os.makedirs(data_dir, exist_ok=True)
        return pipe(data, alt.to_json(filename=data_dir + '/{prefix}-{hash}.{extension}') )


    alt.data_transformers.register('json_dir', json_dir)
    alt.data_transformers.enable('json_dir', data_dir='mydata')

After enabling this data transformer, the JSON files will be stored in what ``data_dir``
was set to when enabling the transformer or 'altairdata' by default. All we had to do
was to prefix the ``filename`` argument of the ``alt.to_json`` function with our
desired directory and make sure that the directory actually exists.