1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
|
.. _developer:
{{ header }}
.. currentmodule:: pandas
*********
Developer
*********
This section will focus on downstream applications of pandas.
.. _apache.parquet:
Storing pandas DataFrame objects in Apache Parquet format
---------------------------------------------------------
The `Apache Parquet <https://github.com/apache/parquet-format>`__ format
provides key-value metadata at the file and column level, stored in the footer
of the Parquet file:
.. code-block:: shell
5: optional list<KeyValue> key_value_metadata
where ``KeyValue`` is
.. code-block:: shell
struct KeyValue {
1: required string key
2: optional string value
}
So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
``pandas`` metadata key in the ``FileMetaData`` with the value stored as :
.. code-block:: text
{'index_columns': [<descr0>, <descr1>, ...],
'column_indexes': [<ci0>, <ci1>, ..., <ciN>],
'columns': [<c0>, <c1>, ...],
'pandas_version': $VERSION,
'creator': {
'library': $LIBRARY,
'version': $LIBRARY_VERSION
}}
The "descriptor" values ``<descr0>`` in the ``'index_columns'`` field are
strings (referring to a column) or dictionaries with values as described below.
The ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata
for each column, *including the index columns*. This has JSON form:
.. code-block:: text
{'name': column_name,
'field_name': parquet_column_name,
'pandas_type': pandas_type,
'numpy_type': numpy_type,
'metadata': metadata}
See below for the detailed specification for these.
Index metadata descriptors
~~~~~~~~~~~~~~~~~~~~~~~~~~
``RangeIndex`` can be stored as metadata only, not requiring serialization. The
descriptor format for these as is follows:
.. code-block:: python
index = pd.RangeIndex(0, 10, 2)
{
"kind": "range",
"name": index.name,
"start": index.start,
"stop": index.stop,
"step": index.step,
}
Other index types must be serialized as data columns along with the other
DataFrame columns. The metadata for these is a string indicating the name of
the field in the data columns, for example ``'__index_level_0__'``.
If an index has a non-None ``name`` attribute, and there is no other column
with a name matching that value, then the ``index.name`` value can be used as
the descriptor. Otherwise (for unnamed indexes and ones with names colliding
with other column names) a disambiguating name with pattern matching
``__index_level_\d+__`` should be used. In cases of named indexes as data
columns, ``name`` attribute is always stored in the column descriptors as
above.
Column metadata
~~~~~~~~~~~~~~~
``pandas_type`` is the logical type of the column, and is one of:
* Boolean: ``'bool'``
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
* Floats: ``'float16', 'float32', 'float64'``
* Date and Time Types: ``'datetime', 'datetimetz'``, ``'timedelta'``
* String: ``'unicode', 'bytes'``
* Categorical: ``'categorical'``
* Other Python objects: ``'object'``
The ``numpy_type`` is the physical storage type of the column, which is the
result of ``str(dtype)`` for the underlying NumPy array that holds the data. So
for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be
any of the supported integer categorical types.
The ``metadata`` field is ``None`` except for:
* ``datetimetz``: ``{'timezone': zone, 'unit': 'ns'}``, e.g. ``{'timezone',
'America/New_York', 'unit': 'ns'}``. The ``'unit'`` is optional, and if
omitted it is assumed to be nanoseconds.
* ``categorical``: ``{'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}``
* Here ``'type'`` is optional, and can be a nested pandas type specification
here (but not categorical)
* ``unicode``: ``{'encoding': encoding}``
* The encoding is optional, and if not present is UTF-8
* ``object``: ``{'encoding': encoding}``. Objects can be serialized and stored
in ``BYTE_ARRAY`` Parquet columns. The encoding can be one of:
* ``'pickle'``
* ``'bson'``
* ``'json'``
* ``timedelta``: ``{'unit': 'ns'}``. The ``'unit'`` is optional, and if omitted
it is assumed to be nanoseconds. This metadata is optional altogether
For types other than these, the ``'metadata'`` key can be
omitted. Implementations can assume ``None`` if the key is not present.
As an example of fully-formed metadata:
.. code-block:: text
{'index_columns': ['__index_level_0__'],
'column_indexes': [
{'name': None,
'field_name': 'None',
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': {'encoding': 'UTF-8'}}
],
'columns': [
{'name': 'c0',
'field_name': 'c0',
'pandas_type': 'int8',
'numpy_type': 'int8',
'metadata': None},
{'name': 'c1',
'field_name': 'c1',
'pandas_type': 'bytes',
'numpy_type': 'object',
'metadata': None},
{'name': 'c2',
'field_name': 'c2',
'pandas_type': 'categorical',
'numpy_type': 'int16',
'metadata': {'num_categories': 1000, 'ordered': False}},
{'name': 'c3',
'field_name': 'c3',
'pandas_type': 'datetimetz',
'numpy_type': 'datetime64[ns]',
'metadata': {'timezone': 'America/Los_Angeles'}},
{'name': 'c4',
'field_name': 'c4',
'pandas_type': 'object',
'numpy_type': 'object',
'metadata': {'encoding': 'pickle'}},
{'name': None,
'field_name': '__index_level_0__',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None}
],
'pandas_version': '1.4.0',
'creator': {
'library': 'pyarrow',
'version': '0.13.0'
}}
|