1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284
|
.. image:: https://github.com/pycompression/xopen/workflows/CI/badge.svg
:target: https://github.com/pycompression/xopen
:alt:
.. image:: https://img.shields.io/pypi/v/xopen.svg?branch=main
:target: https://pypi.python.org/pypi/xopen
.. image:: https://img.shields.io/conda/v/conda-forge/xopen.svg
:target: https://anaconda.org/conda-forge/xopen
:alt:
.. image:: https://codecov.io/gh/pycompression/xopen/branch/main/graph/badge.svg
:target: https://codecov.io/gh/pycompression/xopen
:alt:
=====
xopen
=====
This Python module provides an ``xopen`` function that works like the
built-in ``open`` function but also transparently deals with compressed files.
Supported compression formats are currently gzip, bzip2, xz and optionally Zstandard.
``xopen`` selects the most efficient method for reading or writing a compressed file.
This often means opening a pipe to an external tool, such as
`pigz <https://zlib.net/pigz/>`_, which is a parallel version of ``gzip``,
or `igzip <https://github.com/intel/isa-l/>`_, which is a highly optimized
version of ``gzip``.
If ``threads=0`` is passed to ``xopen()``, no external process is used.
For gzip files, this will then use `python-isal
<https://github.com/pycompression/python-isal>`_ (which binds isa-l) if
it is installed (since ``python-isal`` is a dependency of ``xopen``,
this should always be the case).
Neither ``igzip`` nor ``python-isal`` support compression levels
greater 3, so if no external tool is available or ``threads`` has been set to 0,
Python’s built-in ``gzip.open`` is used.
For xz files, a pipe to the ``xz`` program is used because it has built-in support for multithreaded compression.
For bz2 files, `pbzip2 (parallel bzip2) <http://compression.ca/pbzip2/>`_ is used.
``xopen`` falls back to Python’s built-in functions
(``gzip.open``, ``lzma.open``, ``bz2.open``)
if none of the other methods can be used.
The file format to use is determined from the file name if the extension is recognized
(``.gz``, ``.bz2``, ``.xz`` or ``.zst``).
When reading a file without a recognized file extension, xopen attempts to detect the format
by reading the first couple of bytes from the file.
``xopen`` is compatible with Python versions 3.7 and later.
Usage
-----
Open a file for reading::
from xopen import xopen
with xopen("file.txt.gz") as f:
content = f.read()
Write to a file in binary mode,
set the compression level
and avoid using an external process::
from xopen import xopen
with xopen("file.txt.xz", mode="wb", threads=0, compresslevel=3)
f.write(b"Hello")
Reproducibility
---------------
xopen writes gzip files in a reproducible manner.
Normally, gzip files contain a timestamp in the file header,
which means that compressing the same data at different times results in different output files.
xopen disables this for all of the supported gzip compression backends.
For example, when using an external process, it sets the command-line option
``--no-name`` (same as ``-n``).
Note that different gzip compression backends typically do not produce
identical output, so reproducibility may no longer be given when the execution environment changes
from one ``xopen()`` invocation to the next.
This includes the CPU architecture as `igzip adjusts its algorithm
depending on it <https://github.com/intel/isa-l/issues/140#issuecomment-634877966>`_.
bzip2 and xz compression methods do not store timestamps in the file headers,
so output from them is also reproducible.
Optional Zstandard support
--------------------------
For reading and writing Zstandard (``.zst``) files, either the ``zstd`` command-line
program or the Python ``zstandard`` package needs to be installed.
* If the ``threads`` parameter to ``xopen()`` is ``None`` (the default) or any value greater than 0,
``xopen`` uses an external ``zstd`` process.
* If the above fails (because no ``zstd`` program is available) or if ``threads`` is 0,
the ``zstandard`` package is used.
To ensure that you get the correct ``zstandard`` version, you can specify the ``zstd`` extra for
``xopen``, that is, install it using ``pip install xopen[zstd]``.
Changelog
---------
v1.7.0 (2022-11-03)
~~~~~~~~~~~~~~~~~~~
* #91: Added optional support for Zstandard (``.zst``) files.
This requires that the Python ``zstandard`` package is installed
or that the ``zstd`` command-line program is available.
v1.6.0 (2022-08-10)
~~~~~~~~~~~~~~~~~~~
* #94: When writing gzip files, the timestamp and name of the original
file is omitted (equivalent to using ``gzip --no-name`` (or ``-n``) on the
command line). This allows files to be written in a reproducible manner.
v1.5.0 (2022-03-23)
~~~~~~~~~~~~~~~~~~~
* #100: Dropped Python 3.6 support
* #101: Added support for piping into and from an external ``xz`` process. Contributed by @fanninpm.
* #102: Support setting the xz compression level. Contributed by @tsibley.
v1.4.0 (2022-01-14)
~~~~~~~~~~~~~~~~~~~
* Add ``seek()`` and ``tell()`` to the ``PipedCompressionReader`` classes
(for Windows compatibility)
v1.3.0 (2022-01-10)
~~~~~~~~~~~~~~~~~~~
* xopen is now available on Windows (in addition to Linux and macOS).
* For greater compatibility with `the built-in open()
function <https://docs.python.org/3/library/functions.html#open>`_,
``xopen()`` has gained the parameters *encoding*, *errors* and *newlines*
with the same meaning as in ``open()``. Unlike built-in ``open()``, though,
encoding is UTF-8 by default.
* A parameter *format* has been added that allows to force the compression
file format.
v1.2.0 (2021-09-21)
~~~~~~~~~~~~~~~~~~~
* `pbzip2 <http://compression.ca/pbzip2/>`_ is now used to open ``.bz2`` files if
``threads`` is greater than zero (contributed by @DriesSchaumont).
v1.1.0 (2021-01-20)
~~~~~~~~~~~~~~~~~~~
* Python 3.5 support is dropped.
* On Linux systems, `python-isal <https://github.com/pycompression/python-isal>`_
is now added as a requirement. This will speed up the reading of gzip files
significantly when no external processes are used.
v1.0.0 (2020-11-05)
~~~~~~~~~~~~~~~~~~~
* If installed, the ``igzip`` program (part of
`Intel ISA-L <https://github.com/intel/isa-l/>`_) is now used for reading
and writing gzip-compressed files at compression levels 1-3, which results
in a significant speedup.
v0.9.0 (2020-04-02)
~~~~~~~~~~~~~~~~~~~
* #80: When the file name extension of a file to be opened for reading is not
available, the content is inspected (if possible) and used to determine
which compression format applies (contributed by @bvaisvil).
* This release drops Python 2.7 and 3.4 support. Python 3.5 or later is
now required.
v0.8.4 (2019-10-24)
~~~~~~~~~~~~~~~~~~~
* When reading gzipped files, force ``pigz`` to use only a single process.
``pigz`` cannot use multiple cores anyway when decompressing. By default,
it would use extra I/O processes, which slightly reduces wall-clock time,
but increases CPU time. Single-core decompression with ``pigz`` is still
about twice as fast as regular ``gzip``.
* Allow ``threads=0`` for specifying that no external ``pigz``/``gzip``
process should be used (then regular ``gzip.open()`` is used instead).
v0.8.3 (2019-10-18)
~~~~~~~~~~~~~~~~~~~
* #20: When reading gzipped files, let ``pigz`` use at most four threads by default.
This limit previously only applied when writing to a file. Contributed by @bernt-matthias.
* Support Python 3.8
v0.8.0 (2019-08-14)
~~~~~~~~~~~~~~~~~~~
* #14: Speed improvements when iterating over gzipped files.
v0.6.0 (2019-05-23)
~~~~~~~~~~~~~~~~~~~
* For reading from gzipped files, xopen will now use a ``pigz`` subprocess.
This is faster than using ``gzip.open``.
* Python 2 support will be dropped in one of the next releases.
v0.5.0 (2019-01-30)
~~~~~~~~~~~~~~~~~~~
* By default, pigz is now only allowed to use at most four threads. This hopefully reduces
problems some users had with too many threads when opening many files at the same time.
* xopen now accepts pathlib.Path objects.
v0.4.0 (2019-01-07)
~~~~~~~~~~~~~~~~~~~
* Drop Python 3.3 support
* Add a ``threads`` parameter (passed on to ``pigz``)
v0.3.2 (2017-11-22)
~~~~~~~~~~~~~~~~~~~
* #6: Make multi-block bz2 work on Python 2 by using external bz2file library.
v0.3.1 (2017-11-22)
~~~~~~~~~~~~~~~~~~~
* Drop Python 2.6 support
* #5: Fix PipedGzipReader.read() not returning anything
v0.3.0 (2017-11-15)
~~~~~~~~~~~~~~~~~~~
* Add gzip compression parameter
v0.2.1 (2017-05-31)
~~~~~~~~~~~~~~~~~~~
* #3: Allow appending to bz2 and lzma files where possible
v0.1.1 (2016-12-02)
~~~~~~~~~~~~~~~~~~~
* Fix a deadlock
v0.1.0 (2016-09-09)
~~~~~~~~~~~~~~~~~~~
* Initial release
Credits
-------
The name ``xopen`` was taken from the C function of the same name in the
`utils.h file which is part of
BWA <https://github.com/lh3/bwa/blob/83662032a2192d5712996f36069ab02db82acf67/utils.h>`_.
Some ideas were taken from the `canopener project <https://github.com/selassid/canopener>`_.
If you also want to open S3 files, you may want to use that module instead.
@kyleabeauchamp contributed support for appending to files before this repository was created.
Maintainers
-----------
* Marcel Martin
* Ruben Vorderman
* For a list of contributors, see <https://github.com/pycompression/xopen/graphs/contributors>
Links
-----
* `Source code <https://github.com/pycompression/xopen/>`_
* `Report an issue <https://github.com/pycompression/xopen/issues>`_
* `Project page on PyPI (Python package index) <https://pypi.python.org/pypi/xopen/>`_
|