File: filesystems.rst

package info (click to toggle)
python-fastparquet 2024.2.0-2
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 120,180 kB
  • sloc: python: 8,181; makefile: 187
file content (40 lines) | stat: -rw-r--r-- 1,557 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Backend File-systems
====================

Fastparquet can use alternatives to the local disk for reading and writing parquet.

One example of such a backend file-system is `s3fs <http://s3fs.readthedocs.io>`_, to connect to
AWS's S3 storage. In the following, the login credentials are automatically inferred from the system
(could be environment variables, or one of several possible configuration files).

.. code-block:: python

    import s3fs
    from fastparquet import ParquetFile
    s3 = s3fs.S3FileSystem()
    myopen = s3.open
    pf = ParquetFile('/mybucket/data.parquet', open_with=myopen)
    df = pf.to_pandas()

The function ``myopen`` provided to the constructor must be callable with ``f(path, mode)``
and produce an open file context.

The resultant ``pf`` object is the same as would be generated locally, and only requires a relatively short
read from the remote store. If '/mybucket/data.parquet' contains a sub-key called "_metadata", it will be
read in preference, and the data-set is assumed to be multi-file.


Similarly, providing an open function and another to make any necessary directories (only necessary in multi-file mode), we can write to the s3 file-system:

.. code-block:: python

   write('/mybucket/output_parq', data, file_scheme='hive',
         row_group_offsets=[0, 500], open_with=myopen, mkdirs=noop)

(In the case of s3, no intermediate directories need to be created)


.. raw:: html

    <script data-goatcounter="https://fastparquet.goatcounter.com/count"
        async src="//gc.zgo.at/count.js"></script>