File: io.rst

package info (click to toggle)
python-petl 1.7.17-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 2,224 kB
  • sloc: python: 22,617; makefile: 109; xml: 9
file content (484 lines) | stat: -rw-r--r-- 12,823 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
.. module:: petl.io
.. _io_usage:

Usage - reading/writing tables
==============================

`petl` uses simple python functions for providing a rows and columns abstraction
for reading and writing data from files, databases, and other sources.

The main features that `petl` was designed are:

- Pure python implementation based on `streams <https://docs.python.org/3/library/io.html>`,
  `iterators <https://docs.python.org/3/library/stdtypes.html?highlight=iterator#iterator-types>`
  , and other python types.
- Extensible approach, only requiring package dependencies when using their 
  functionality.
- Use a Dataframe/Table like paradigm similar of Pandas, R, and others
- Lightweight alternative to develop and maintain compared to heavier, 
  full-featured frameworks, like PySpark, PyArrow and other ETL tools.

.. _io_overview:

Brief Overview
--------------

.. _io_extract:

Extract (read)
^^^^^^^^^^^^^^

The "from..." functions extract a table from a file-like source or
database. For everything except :func:`petl.io.db.fromdb` the
``source`` argument provides information about where to extract the
underlying data from. If the ``source`` argument is ``None`` or a
string it is interpreted as follows:

* ``None`` - read from stdin
* string starting with `http://`, `https://` or `ftp://` - read from URL
* string ending with `.gz` or `.bgz` - read from file via gzip decompression
* string ending with `.bz2` - read from file via bz2 decompression
* any other string - read directly from file

.. _io_extract_codec:

Some helper classes are also available for reading from other types of
file-like sources, e.g., reading data from a Zip file, a string or a
subprocess, see the section on :ref:`io_helpers` below for more
information.

Be aware that loading data from stdin breaks the table container
convention, because data can usually only be read once. If you are
sure that data will only be read once in your script or interactive
session then this may not be a problem, however note that some
:mod:`petl` functions do access the underlying data source more than
once and so will not work as expected with data from stdin.

.. _io_load:

Load (write)
^^^^^^^^^^^^

The "to..." functions load data from a table into a file-like source
or database. For functions that accept a ``source`` argument, if the
``source`` argument is ``None`` or a string it is interpreted as
follows:

* ``None`` - write to stdout
* string ending with `.gz` or `.bgz` - write to file via gzip decompression
* string ending with `.bz2` - write to file via bz2 decompression
* any other string - write directly to file

.. _io_load_codec:

Some helper classes are also available for writing to other types of
file-like sources, e.g., writing to a Zip file or string buffer, see
the section on :ref:`io_helpers` below for more information.

.. _io_builtin_formats:

Built-in File Formats
---------------------

.. module:: petl.io.csv
.. _io_csv:

Python objects
^^^^^^^^^^^^^^

.. autofunction:: petl.io.base.fromcolumns

Delimited files
^^^^^^^^^^^^^^^

.. autofunction:: petl.io.csv.fromcsv
.. autofunction:: petl.io.csv.tocsv
.. autofunction:: petl.io.csv.appendcsv
.. autofunction:: petl.io.csv.teecsv
.. autofunction:: petl.io.csv.fromtsv
.. autofunction:: petl.io.csv.totsv
.. autofunction:: petl.io.csv.appendtsv
.. autofunction:: petl.io.csv.teetsv


.. module:: petl.io.pickle
.. _io_pickle:

Pickle files
^^^^^^^^^^^^

.. autofunction:: petl.io.pickle.frompickle
.. autofunction:: petl.io.pickle.topickle
.. autofunction:: petl.io.pickle.appendpickle
.. autofunction:: petl.io.pickle.teepickle


.. module:: petl.io.text
.. _io_text:

Text files
^^^^^^^^^^

.. autofunction:: petl.io.text.fromtext
.. autofunction:: petl.io.text.totext
.. autofunction:: petl.io.text.appendtext
.. autofunction:: petl.io.text.teetext


.. module:: petl.io.xml
.. _io_xml:

XML files
^^^^^^^^^

.. autofunction:: petl.io.xml.fromxml
.. autofunction:: petl.io.xml.toxml


.. module:: petl.io.html
.. _io_html:

HTML files
^^^^^^^^^^

.. autofunction:: petl.io.html.tohtml
.. autofunction:: petl.io.html.teehtml


.. module:: petl.io.json
.. _io_json:

JSON files
^^^^^^^^^^

.. autofunction:: petl.io.json.fromjson
.. autofunction:: petl.io.json.fromdicts
.. autofunction:: petl.io.json.tojson
.. autofunction:: petl.io.json.tojsonarrays

.. module:: petl.io.streams
.. _io_helpers:

Python I/O streams
^^^^^^^^^^^^^^^^^^

The following classes are helpers for extract (``from...()``) and load
(``to...()``) functions that use a file-like data source.

An instance of any of the following classes can be used as the ``source``
argument to data extraction functions like :func:`petl.io.csv.fromcsv` etc.,
with the exception of :class:`petl.io.sources.StdoutSource` which is
write-only.

An instance of any of the following classes can also be used as the ``source``
argument to data loading functions like :func:`petl.io.csv.tocsv` etc., with the
exception of :class:`petl.io.sources.StdinSource`,
:class:`petl.io.sources.URLSource` and :class:`petl.io.sources.PopenSource`
which are read-only.

The behaviour of each source can usually be configured by passing arguments
to the constructor, see the source code of the :mod:`petl.io.sources` module
for full details.

.. autoclass:: petl.io.sources.StdinSource
.. autoclass:: petl.io.sources.StdoutSource
.. autoclass:: petl.io.sources.MemorySource
.. autoclass:: petl.io.sources.PopenSource

.. module:: petl.io.register
.. _io_register:

Custom I/O streams
^^^^^^^^^^^^^^^^^^

For creating custom helpers for :ref:`remote I/O <io_remotes>` or
`compression` use the following functions:

.. autofunction:: petl.io.sources.register_reader
.. autofunction:: petl.io.sources.register_writer
.. autofunction:: petl.io.sources.get_reader
.. autofunction:: petl.io.sources.get_writer

See the source code of the classes in :mod:`petl.io.sources` module for
more details.

.. _io_extended_formats:

Supported File Formats
----------------------

.. module:: petl.io.xls
.. _io_xls:

Excel .xls files (xlrd/xlwt)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::

    The following functions require `xlrd
    <https://pypi.python.org/pypi/xlrd>`_ and `xlwt
    <https://pypi.python.org/pypi/xlwt-future>`_ to be installed,
    e.g.::

        $ pip install xlrd xlwt-future

.. autofunction:: petl.io.xls.fromxls
.. autofunction:: petl.io.xls.toxls


.. module:: petl.io.xlsx
.. _io_xlsx:

Excel .xlsx files (openpyxl)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::

    The following functions require `openpyxl
    <https://bitbucket.org/ericgazoni/openpyxl/wiki/Home>`_ to be
    installed, e.g.::

        $ pip install openpyxl

.. autofunction:: petl.io.xlsx.fromxlsx
.. autofunction:: petl.io.xlsx.toxlsx
.. autofunction:: petl.io.xlsx.appendxlsx


.. module:: petl.io.numpy
.. _io_numpy:

Arrays (NumPy)
^^^^^^^^^^^^^^

.. note::

    The following functions require `numpy <http://www.numpy.org/>`_
    to be installed, e.g.::

        $ pip install numpy

.. autofunction:: petl.io.numpy.fromarray
.. autofunction:: petl.io.numpy.toarray
.. autofunction:: petl.io.numpy.torecarray
.. autofunction:: petl.io.numpy.valuestoarray


.. module:: petl.io.pandas
.. _io_pandas:

DataFrames (pandas)
^^^^^^^^^^^^^^^^^^^

.. note::

    The following functions require `pandas
    <http://pandas.pydata.org/>`_ to be installed, e.g.::

        $ pip install pandas

.. autofunction:: petl.io.pandas.fromdataframe
.. autofunction:: petl.io.pandas.todataframe


.. module:: petl.io.pytables
.. _io_pytables:

HDF5 files (PyTables)
^^^^^^^^^^^^^^^^^^^^^

.. note::

    The following functions require `PyTables
    <https://pytables.github.io/index.html>`_ to be installed, e.g.::

        $ # install HDF5
        $ apt-get install libhdf5-7 libhdf5-dev
        $ # install other prerequisites
        $ pip install cython
        $ pip install numpy
        $ pip install numexpr
        $ # install PyTables
        $ pip install tables

.. autofunction:: petl.io.pytables.fromhdf5
.. autofunction:: petl.io.pytables.fromhdf5sorted
.. autofunction:: petl.io.pytables.tohdf5
.. autofunction:: petl.io.pytables.appendhdf5


.. module:: petl.io.bcolz
.. _io_bcolz:

Bcolz ctables
^^^^^^^^^^^^^

.. note::

    The following functions require `bcolz <http://bcolz.blosc.org>`_
    to be installed, e.g.::

        $ pip install bcolz

.. autofunction:: petl.io.bcolz.frombcolz
.. autofunction:: petl.io.bcolz.tobcolz
.. autofunction:: petl.io.bcolz.appendbcolz

.. module:: petl.io.whoosh
.. _io_whoosh:

Text indexes (Whoosh)
^^^^^^^^^^^^^^^^^^^^^

.. note::

    The following functions require
    `Whoosh <https://pypi.python.org/pypi/Whoosh/>`_
    to be installed, e.g.::

        $ pip install whoosh

.. autofunction:: petl.io.whoosh.fromtextindex
.. autofunction:: petl.io.whoosh.searchtextindex
.. autofunction:: petl.io.whoosh.searchtextindexpage
.. autofunction:: petl.io.whoosh.totextindex
.. autofunction:: petl.io.whoosh.appendtextindex

.. module:: petl.io.avro
.. _io_avro:

Avro files (fastavro)
^^^^^^^^^^^^^^^^^^^^^

.. note::

    The following functions require `fastavro
    <https://github.com/fastavro/fastavro>`_ to be
    installed, e.g.::

        $ pip install fastavro

.. autofunction:: petl.io.avro.fromavro
.. autofunction:: petl.io.avro.toavro
.. autofunction:: petl.io.avro.appendavro

.. literalinclude:: ../petl/test/io/test_avro_schemas.py
   :name: logical_schema
   :language: python
   :caption: Avro schema for logical types 
   :start-after: begin_logical_schema
   :end-before: end_logical_schema

.. literalinclude:: ../petl/test/io/test_avro_schemas.py
   :name: nullable_schema
   :language: python
   :caption: Avro schema with nullable fields
   :start-after: begin_nullable_schema
   :end-before: end_nullable_schema

.. literalinclude:: ../petl/test/io/test_avro_schemas.py
   :name: array_schema
   :language: python
   :caption: Avro schema with array values in fields
   :start-after: begin_array_schema
   :end-before: end_array_schema

.. literalinclude:: ../petl/test/io/test_avro_schemas.py
   :name: complex_schema
   :language: python
   :caption: Example of recursive complex Avro schema
   :start-after: begin_complex_schema
   :end-before: end_complex_schema

.. module:: petl.io.gsheet
.. _io_gsheet:

Google Sheets (gspread)
^^^^^^^^^^^^^^^^^^^^^^^

.. warning::

    This is a experimental feature. API and behavior may change between releases
    with some possible breaking changes.

.. note::

    The following functions require `gspread
    <https://github.com/burnash/gspread>`_  to be installed,
    e.g.::

        $ pip install gspread

.. autofunction:: petl.io.gsheet.fromgsheet
.. autofunction:: petl.io.gsheet.togsheet
.. autofunction:: petl.io.gsheet.appendgsheet

.. module:: petl.io.db
.. _io_db:

Databases
---------

.. note::

    For reading and writing to databases, the following functions require
    `SQLAlchemy <http://www.sqlalchemy.org/>` and the database specific driver
    to be installed along petl, e.g.::

        $ pip install sqlalchemy
        $ pip install sqlite3
        $ pip install pymysql

.. autofunction:: petl.io.db.fromdb
.. autofunction:: petl.io.db.todb
.. autofunction:: petl.io.db.appenddb

.. module:: petl.io.remote
.. _io_remotes:

Remote and Cloud Filesystems
----------------------------

The following classes are helpers for reading (``from...()``) and writing
(``to...()``) functions transparently as a file-like source.

There are no need to instantiate them. They are used in the mecanism described
in :ref:`Extract <io_extract>` and :ref:`Load <io_load>`.

It's possible to read and write just by prefixing the protocol (e.g: `s3://`)
in the source path of the file.

.. note::

    For reading and writing to remote filesystems, the following functions 
    requires `fsspec <https://filesystem-spec.readthedocs.io/>` to be installed 
    along petl package e.g.::

        $ pip install fsspec

The supported filesystems with their URI formats can be found in fsspec 
documentation:

- `Built-in Implementations <https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations>`__
- `Other Known Implementations <https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations>`__

Remote sources
^^^^^^^^^^^^^^

.. autoclass:: petl.io.remotes.RemoteSource
.. autoclass:: petl.io.remotes.SMBSource

.. _io_deprecated:

Deprecated I/O sources
^^^^^^^^^^^^^^^^^^^^^^

The following helpers are deprecated and will be removed in a future version.

It's functionality was replaced by helpers in :ref:`Remote helpers <io_remotes>`.

.. autoclass:: petl.io.sources.FileSource
.. autoclass:: petl.io.sources.GzipSource
.. autoclass:: petl.io.sources.BZ2Source
.. autoclass:: petl.io.sources.ZipSource
.. autoclass:: petl.io.sources.URLSource