1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
|
Create and Store Dask DataFrames
================================
Dask can create DataFrames from various data storage formats like CSV, HDF,
Apache Parquet, and others. For most formats, this data can live on various
storage systems including local disk, network file systems (NFS), the Hadoop
File System (HDFS), and Amazon's S3 (excepting HDF, which is only available on
POSIX like file systems).
See the :doc:`Overview section <dataframe-overview>` for an in depth
discussion of ``dask.dataframe`` scope, use, and limitations.
API
---
The following functions provide access to convert between Dask DataFrames,
file formats, and other Dask or Python collections.
.. currentmodule:: dask.dataframe
File Formats:
.. autosummary::
read_csv
read_parquet
read_hdf
read_orc
read_json
read_sql_table
from_bcolz
from_array
to_csv
to_parquet
to_hdf
Dask Collections:
.. autosummary::
from_delayed
from_dask_array
dask.bag.core.Bag.to_dataframe
DataFrame.to_delayed
to_records
to_bag
Pandas:
.. autosummary::
from_pandas
Locations
---------
For text, CSV, and Apache Parquet formats, data can come from local disk,
the Hadoop File System, S3FS, or other sources, by prepending the filenames with
a protocol:
.. code-block:: python
>>> df = dd.read_csv('my-data-*.csv')
>>> df = dd.read_csv('hdfs:///path/to/my-data-*.csv')
>>> df = dd.read_csv('s3://bucket-name/my-data-*.csv')
For remote systems like HDFS or S3, credentials may be an issue. Usually, these
are handled by configuration files on disk (such as a ``.boto`` file for S3),
but in some cases you may want to pass storage-specific options through to the
storage backend. You can do this with the ``storage_options=`` keyword:
.. code-block:: python
>>> df = dd.read_csv('s3://bucket-name/my-data-*.csv',
... storage_options={'anon': True})
Dask Delayed
------------
For more complex situations not covered by the functions above, you may want to
use :doc:`dask.delayed<delayed>`, which lets you construct Dask DataFrames out
of arbitrary Python function calls that load DataFrames. This can allow you to
handle new formats easily or bake in particular logic around loading data if,
for example, your data is stored with some special format.
See :doc:`documentation on using dask.delayed with
collections<delayed-collections>` or an `example notebook
<https://gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca>`_ showing
how to create a Dask DataFrame from a nested directory structure of Feather
files (as a stand in for any custom file format).
Dask delayed is particularly useful when simple ``map`` operations aren't
sufficient to capture the complexity of your data layout.
From Raw Dask Graphs
--------------------
This section is mainly for developers wishing to extend ``dask.dataframe``. It
discusses internal API not normally needed by users. Everything below can be
done just as effectively with :doc:`dask.delayed<delayed>` described
just above. You should never need to create a DataFrame object by hand.
To construct a DataFrame manually from a dask graph you need the following
information:
1. Dask: a Dask graph with keys like ``{(name, 0): ..., (name, 1): ...}`` as
well as any other tasks on which those tasks depend. The tasks
corresponding to ``(name, i)`` should produce ``pandas.DataFrame`` objects
that correspond to the columns and divisions information discussed below
2. Name: the special name used above
3. Columns: a list of column names
4. Divisions: a list of index values that separate the different partitions.
Alternatively, if you don't know the divisions (this is common), you can
provide a list of ``[None, None, None, ...]`` with as many partitions as
you have plus one. For more information, see the Partitions section in the
:doc:`DataFrame documentation <dataframe>`
As an example, we build a DataFrame manually that reads several CSV files that
have a datetime index separated by day. Note that you should **never** do this.
The ``dd.read_csv`` function does this for you:
.. code-block:: Python
dsk = {('mydf', 0): (pd.read_csv, 'data/2000-01-01.csv'),
('mydf', 1): (pd.read_csv, 'data/2000-01-02.csv'),
('mydf', 2): (pd.read_csv, 'data/2000-01-03.csv')}
name = 'mydf'
columns = ['price', 'name', 'id']
divisions = [Timestamp('2000-01-01 00:00:00'),
Timestamp('2000-01-02 00:00:00'),
Timestamp('2000-01-03 00:00:00'),
Timestamp('2000-01-03 23:59:59')]
df = dd.DataFrame(dsk, name, columns, divisions)
|