File: README.rst

package info (click to toggle)
python-xopen 1.7.0-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 300 kB
  • sloc: python: 1,610; makefile: 6; sh: 5
file content (284 lines) | stat: -rw-r--r-- 9,575 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
.. image:: https://github.com/pycompression/xopen/workflows/CI/badge.svg
  :target: https://github.com/pycompression/xopen
  :alt:

.. image:: https://img.shields.io/pypi/v/xopen.svg?branch=main
  :target: https://pypi.python.org/pypi/xopen

.. image:: https://img.shields.io/conda/v/conda-forge/xopen.svg
  :target: https://anaconda.org/conda-forge/xopen
  :alt:

.. image:: https://codecov.io/gh/pycompression/xopen/branch/main/graph/badge.svg
  :target: https://codecov.io/gh/pycompression/xopen
  :alt:

=====
xopen
=====

This Python module provides an ``xopen`` function that works like the
built-in ``open`` function but also transparently deals with compressed files.
Supported compression formats are currently gzip, bzip2, xz and optionally Zstandard.

``xopen`` selects the most efficient method for reading or writing a compressed file.
This often means opening a pipe to an external tool, such as
`pigz <https://zlib.net/pigz/>`_, which is a parallel version of ``gzip``,
or `igzip <https://github.com/intel/isa-l/>`_, which is a highly optimized
version of ``gzip``.

If ``threads=0`` is passed to ``xopen()``, no external process is used.
For gzip files, this will then use `python-isal
<https://github.com/pycompression/python-isal>`_ (which binds isa-l) if
it is installed (since ``python-isal`` is a dependency of ``xopen``,
this should always be the case).
Neither ``igzip`` nor ``python-isal`` support compression levels
greater 3, so if no external tool is available or ``threads`` has been set to 0,
Python’s built-in ``gzip.open`` is used.

For xz files, a pipe to the ``xz`` program is used because it has built-in support for multithreaded compression.

For bz2 files, `pbzip2 (parallel bzip2) <http://compression.ca/pbzip2/>`_ is used.

``xopen`` falls back to Python’s built-in functions
(``gzip.open``, ``lzma.open``, ``bz2.open``)
if none of the other methods can be used.

The file format to use is determined from the file name if the extension is recognized
(``.gz``, ``.bz2``, ``.xz`` or ``.zst``).
When reading a file without a recognized file extension, xopen attempts to detect the format
by reading the first couple of bytes from the file.

``xopen`` is compatible with Python versions 3.7 and later.


Usage
-----

Open a file for reading::

    from xopen import xopen

    with xopen("file.txt.gz") as f:
        content = f.read()

Write to a file in binary mode,
set the compression level
and avoid using an external process::

    from xopen import xopen

    with xopen("file.txt.xz", mode="wb", threads=0, compresslevel=3)
        f.write(b"Hello")


Reproducibility
---------------

xopen writes gzip files in a reproducible manner.

Normally, gzip files contain a timestamp in the file header,
which means that compressing the same data at different times results in different output files.
xopen disables this for all of the supported gzip compression backends.
For example, when using an external process, it sets the command-line option
``--no-name`` (same as ``-n``).

Note that different gzip compression backends typically do not produce
identical output, so reproducibility may no longer be given when the execution environment changes
from one ``xopen()`` invocation to the next.
This includes the CPU architecture as `igzip adjusts its algorithm
depending on it <https://github.com/intel/isa-l/issues/140#issuecomment-634877966>`_.

bzip2 and xz compression methods do not store timestamps in the file headers,
so output from them is also reproducible.


Optional Zstandard support
--------------------------

For reading and writing Zstandard (``.zst``) files, either the ``zstd`` command-line
program or the Python ``zstandard`` package needs to be installed.

* If the ``threads`` parameter to ``xopen()`` is ``None`` (the default) or any value greater than 0,
  ``xopen`` uses an external ``zstd`` process.
* If the above fails (because no ``zstd`` program is available) or if ``threads`` is 0,
  the ``zstandard`` package is used.

To ensure that you get the correct ``zstandard`` version, you can specify the ``zstd`` extra for
``xopen``, that is, install it using ``pip install xopen[zstd]``.


Changelog
---------

v1.7.0 (2022-11-03)
~~~~~~~~~~~~~~~~~~~

* #91: Added optional support for Zstandard (``.zst``) files.
  This requires that the Python ``zstandard`` package is installed
  or that the ``zstd`` command-line program is available.

v1.6.0 (2022-08-10)
~~~~~~~~~~~~~~~~~~~

* #94: When writing gzip files, the timestamp and name of the original
  file is omitted (equivalent to using ``gzip --no-name`` (or ``-n``) on the
  command line). This allows files to be written in a reproducible manner.

v1.5.0 (2022-03-23)
~~~~~~~~~~~~~~~~~~~

* #100: Dropped Python 3.6 support
* #101: Added support for piping into and from an external ``xz`` process. Contributed by @fanninpm.
* #102: Support setting the xz compression level. Contributed by @tsibley.

v1.4.0 (2022-01-14)
~~~~~~~~~~~~~~~~~~~

* Add ``seek()`` and ``tell()`` to the ``PipedCompressionReader`` classes
  (for Windows compatibility)

v1.3.0 (2022-01-10)
~~~~~~~~~~~~~~~~~~~

* xopen is now available on Windows (in addition to Linux and macOS).
* For greater compatibility with `the built-in open()
  function <https://docs.python.org/3/library/functions.html#open>`_,
  ``xopen()`` has gained the parameters *encoding*, *errors* and *newlines*
  with the same meaning as in ``open()``. Unlike built-in ``open()``, though,
  encoding is UTF-8 by default.
* A parameter *format* has been added that allows to force the compression
  file format.

v1.2.0 (2021-09-21)
~~~~~~~~~~~~~~~~~~~

* `pbzip2 <http://compression.ca/pbzip2/>`_ is now used to open ``.bz2`` files if
  ``threads`` is greater than zero (contributed by @DriesSchaumont).

v1.1.0 (2021-01-20)
~~~~~~~~~~~~~~~~~~~

* Python 3.5 support is dropped.
* On Linux systems, `python-isal <https://github.com/pycompression/python-isal>`_
  is now added as a requirement. This will speed up the reading of gzip files
  significantly when no external processes are used.

v1.0.0 (2020-11-05)
~~~~~~~~~~~~~~~~~~~

* If installed, the ``igzip`` program (part of
  `Intel ISA-L <https://github.com/intel/isa-l/>`_) is now used for reading
  and writing gzip-compressed files at compression levels 1-3, which results
  in a significant speedup.

v0.9.0 (2020-04-02)
~~~~~~~~~~~~~~~~~~~

* #80: When the file name extension of a file to be opened for reading is not
  available, the content is inspected (if possible) and used to determine
  which compression format applies (contributed by @bvaisvil).
* This release drops Python 2.7 and 3.4 support. Python 3.5 or later is
  now required.

v0.8.4 (2019-10-24)
~~~~~~~~~~~~~~~~~~~

* When reading gzipped files, force ``pigz`` to use only a single process.
  ``pigz`` cannot use multiple cores anyway when decompressing. By default,
  it would use extra I/O processes, which slightly reduces wall-clock time,
  but increases CPU time. Single-core decompression with ``pigz`` is still
  about twice as fast as regular ``gzip``.
* Allow ``threads=0`` for specifying that no external ``pigz``/``gzip``
  process should be used (then regular ``gzip.open()`` is used instead).

v0.8.3 (2019-10-18)
~~~~~~~~~~~~~~~~~~~

* #20: When reading gzipped files, let ``pigz`` use at most four threads by default.
  This limit previously only applied when writing to a file. Contributed by @bernt-matthias.
* Support Python 3.8

v0.8.0 (2019-08-14)
~~~~~~~~~~~~~~~~~~~

* #14: Speed improvements when iterating over gzipped files.

v0.6.0 (2019-05-23)
~~~~~~~~~~~~~~~~~~~

* For reading from gzipped files, xopen will now use a ``pigz`` subprocess.
  This is faster than using ``gzip.open``.
* Python 2 support will be dropped in one of the next releases.

v0.5.0 (2019-01-30)
~~~~~~~~~~~~~~~~~~~

* By default, pigz is now only allowed to use at most four threads. This hopefully reduces
  problems some users had with too many threads when opening many files at the same time.
* xopen now accepts pathlib.Path objects.

v0.4.0 (2019-01-07)
~~~~~~~~~~~~~~~~~~~

* Drop Python 3.3 support
* Add a ``threads`` parameter (passed on to ``pigz``)

v0.3.2 (2017-11-22)
~~~~~~~~~~~~~~~~~~~

* #6: Make multi-block bz2 work on Python 2 by using external bz2file library.

v0.3.1 (2017-11-22)
~~~~~~~~~~~~~~~~~~~

* Drop Python 2.6 support
* #5: Fix PipedGzipReader.read() not returning anything

v0.3.0 (2017-11-15)
~~~~~~~~~~~~~~~~~~~

* Add gzip compression parameter

v0.2.1 (2017-05-31)
~~~~~~~~~~~~~~~~~~~

* #3: Allow appending to bz2 and lzma files where possible

v0.1.1 (2016-12-02)
~~~~~~~~~~~~~~~~~~~

* Fix a deadlock

v0.1.0 (2016-09-09)
~~~~~~~~~~~~~~~~~~~

* Initial release

Credits
-------

The name ``xopen`` was taken from the C function of the same name in the
`utils.h file which is part of
BWA <https://github.com/lh3/bwa/blob/83662032a2192d5712996f36069ab02db82acf67/utils.h>`_.

Some ideas were taken from the `canopener project <https://github.com/selassid/canopener>`_.
If you also want to open S3 files, you may want to use that module instead.

@kyleabeauchamp contributed support for appending to files before this repository was created.


Maintainers
-----------

* Marcel Martin
* Ruben Vorderman
* For a list of contributors, see <https://github.com/pycompression/xopen/graphs/contributors>


Links
-----

* `Source code <https://github.com/pycompression/xopen/>`_
* `Report an issue <https://github.com/pycompression/xopen/issues>`_
* `Project page on PyPI (Python package index) <https://pypi.python.org/pypi/xopen/>`_