File: quickstart.rst

package info (click to toggle)
python-fastparquet 2024.2.0-2
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 120,180 kB
  • sloc: python: 8,181; makefile: 187
file content (95 lines) | stat: -rw-r--r-- 3,911 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
Quickstart
==========

You may already be using fastparquet via the Dask or Pandas APIs. The options
available, with ``engine="fastparquet"``, are essentially the same as given here
and in our :doc:`api` docs.

Reading
-------

To open and read the contents of a Parquet file:

.. code-block:: python

    from fastparquet import ParquetFile
    pf = ParquetFile('myfile.parq')
    df = pf.to_pandas()

The Pandas data-frame, ``df`` will contain all columns in the target file, and all
row-groups concatenated together. If the data is a multi-file collection, such as
generated by hadoop, the filename to supply is
either the directory name, or the "_metadata" file contained therein #.these are
handled transparently.

One may wish to investigate the meta-data associated with the data before loading,
for example, to choose which row-groups and columns to load. The properties ``columns``,
``count``, ``dtypes`` and ``statistics`` are available
to assist with this, and a summary in ``info``.
In addition, if the data is in a hierarchical directory-partitioned
structure, then the property ``cats`` specifies the possible values of each partitioning field.
You can get a deeper view of the parquet schema wih ``print(pf.schema)``.

You may specify which columns to load, which of those to keep as categoricals
(if the data uses dictionary encoding), and which column to use as the
pandas index. By selecting columns, we only access parts of the file,
and efficiently skip columns that are not of interest.

.. code-block:: python

    df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])
    # or
    df2 = pf.to_pandas(['col1', 'col2'], categories={'col1': 12})

where the second version specifies the number of expected labels for that
column. If the data originated from pandas, the categories will already be known.

Furthermore, row-groups can be skipped by providing a list of filters. There is no need to
return the filtering column as a column in the data-frame. Note that only row-groups that have no data at all
meeting the specified requirements will be skipped.

.. code-block:: python

    df3 = pf.to_pandas(['col1', 'col2'], filters=[('col3', 'in', [1, 2, 3, 4])])

(new in :ref:`0.7.0`: row-level filtering)


Writing
-------

To create a single Parquet file from a dataframe:

.. code-block:: python

    from fastparquet import write
    write('outfile.parq', df)

The function ``write`` provides a number of options. The default is to produce a single output file
with a row-groups up to 50M rows, with plain encoding and no compression. The
performance will therefore be similar to simple binary packing such as ``numpy.save``
for numerical columns.

Further options that may be of interest are:

#. the compression algorithms (typically "snappy", for fast, but not too space-efficient), which can vary by column
#. the row-group splits to apply, which may lead to efficiencies on loading, if some row-groups can be skipped.
   Statistics (min/max) are calculated for each column in each row-group on the fly.
#. multi-file saving can be enabled with the ``file_scheme="hive"|"drill"``: directory-tree-partitioned output
   with a single metadata file and several data-files, one or more per leaf directory. The values used for
   partitioning are encoded into the paths of the data files.
#. append to existing data sets with ``append=True``, adding new row-groups. For the specific case of
   "hive"-partitioned data and one file per partition, ``append="overwrite"`` is also allowed, which replaces
   partitions of the data where new data are given, but leaves other existing partitions untouched.

.. code-block:: python

    write('outdir.parq', df, row_group_offsets=[0, 10000, 20000],
          compression='GZIP', file_scheme='hive')



.. raw:: html

    <script data-goatcounter="https://fastparquet.goatcounter.com/count"
        async src="//gc.zgo.at/count.js"></script>