1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322
|
---------
Tutorials
---------
Using `python-blosc` (or just `blosc`, because we are going to talk always on
how to use it in a Python environment) is pretty easy. It basically mimics
the API of the `zlib` module included in the standard Python library.
Here are some examples on how to use it. For the full documentation, please
refer to the `Library Reference <https://www.blosc.org/python-blosc/reference.html>`_
section.
Most of the times in this tutorial have been obtained using a VM with 2 cores
on top of a Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz.
Compressing and decompressing with `blosc`
==========================================
Let's start creating a NumPy array with 80 MB full of data::
>>> import numpy as np
>>> a = np.linspace(0, 100, 1e7)
>>> bytes_array = a.tostring() # get an array of bytes
and let's compare Blosc operation with `zlib` (please note that we are
using IPython for leveraging its timing capabilities)::
>>> import zlib
>>> %timeit zlib.compress(bytes_array)
1 loops, best of 3: 4.65 s per loop # ~ 17 MB/s
>>> import blosc
>>> %timeit blosc.compress(bytes_array, typesize=8)
100 loops, best of 3: 17.7 ms per loop # ~ 4.5 GB/s and 260x faster than zlib
but Blosc can use different codecs under the hood. Let's try the LZ4 codec
(instead of the BloscLZ which is the default)::
>>> %timeit blosc.compress(bytes_array, typesize=8, cname='lz4')
100 loops, best of 3: 8.77 ms per loop # ~ 9.1 GB/s and 530x faster than zlib
It is important to note that this is quite more than the speed of a memcpy()
in this machine::
>>> %timeit a.copy()
10 loops, best of 3: 27.2 ms per loop # ~ 2.9 GB/s
which means that both BloscLZ and LZ4 codecs can be faster than memcpy(),
just as the Blosc slogan promises.
Blosc also comes with the ZLib codec too, and it actually runs faster than
the naked zlib::
>>> %timeit blosc.compress(bytes_array, typesize=8, cname='zlib')
10 loops, best of 3: 139 ms per loop # ~ 580 MB/s and 33x faster than zlib
The reason why the internal Zlib codec in Blosc is faster than the 'naked'
one is 1) that Blosc can use multithreading (using 8 threads here) and 2)
that Blosc splits the data to compress in smaller blocks that are more
friendly to the caches in modern CPUs.
Now, let's have a look at the compression ratios::
>>> zpacked = zlib.compress(bytes_array); len(zpacked)
52994692
>>> round(len(bytes_array) / float(len(zpacked)), 3)
1.51 # zlib achieves a 1.5x compression ratio
>>> blzpacked = blosc.compress(bytes_array, typesize=8, cname='blosclz'); len(blzpacked)
6986533
>>> round(len(bytes_array) / float(len(blzpacked)), 3)
11.451 # BloscLZ codec reaches more than 11x compression ratio
>>> lz4packed = blosc.compress(bytes_array, typesize=8, cname='lz4'); len(lz4packed)
3716015
>>> round(len(bytes_array) / float(len(lz4packed)), 3)
21.528 # LZ4 codec reaches more than 21x compression ratio
Here we can see how Blosc, with its different codecs, is very efficient
compressing this kind of binary data. It is important to note that the
codecs alone are not the only responsible for high compression ratios. See
what a naked LZ4 codec can do on the same string::
>>> import lz4
>>> lz4_packed = lz4.compress(bytes_array); len(lz4_packed)
80309133
>>> round(len(bytes_array) / float(len(lz4_packed)), 3)
0.996
That is, the naked LZ4 codec cannot compress the array of bytes at all,
whereas through Blosc LZ4 can compress quite a lot. In fact, this difference
in compression also happens with ZLib::
>>> zlibpacked = blosc.compress(bytes_array, typesize=8, cname='zlib'); len(zlibpacked)
875274
>>> round(len(zpacked) / float(len(zlibpacked)), 3)
60.546 # ZLib codec reaches 60x more compression than naked ZLib
Here it is a plot with the different compression ratios achieved:
.. image:: _static/cratio-blosc-codecs.png
:scale: 75 %
The explanation for this apparently shocking result is that Blosc uses
filters (`SHUFFLE` and `BITSHUFFLE` currently, but the list can increase
more in the future) prior to the compression stage and these allow in
general for better compression ratios when using binary data.
How about decompression?::
>>> %timeit zlib.decompress(zpacked)
1 loops, best of 3: 275 ms per loop # ~ 290 MB/s
>>> %timeit blosc.decompress(blzpacked)
100 loops, best of 3: 17.3 ms per loop # ~ 4.3 GB/s and 16x faster than zlib
>>> %timeit blosc.decompress(lz4packed)
10 loops, best of 3: 41.9 ms per loop # ~ 1.9 GB/s and 6.5x faster than zlib
>>> %timeit blosc.decompress(zlibpacked)
10 loops, best of 3: 40.7 ms per loop # ~ 2.0 GB/S and 6.8x faster than zlib
Here we see a couple of things:
* All the internal codecs in Blosc are way faster than naked ZLib
* The fastest codec for decompressing here is BloscLZ (remember that LZ4 was
the fastest for compression).
The next plot summarizes the speed benchmarks above:
.. image:: _static/speed-blosc-codecs.png
:scale: 75 %
These results should reinforce the idea that there is not a single codec
that wins in all areas (compression ratio, compression speed and
decompression speed) and each has its pro's and con's. It is up to the user
to choose whatever fits better to him (hint: there is no replacement for
experimentation).
Finally, here it is the way to discover all the internal codecs inside your
Blosc package::
>>> blosc.cnames
['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']
*Note*: the actual list of codecs may change depending on how you have
compiled the underlying C-Blosc library.
Using different filters
=======================
In the same way that you can use different codecs inside Blosc, you can use
different filters too (currently `SHUFFLE` and `BITSHUFLE`). These allow the
integrated compressors to compress more efficiently or not, depending on
your datasets.
Here it is an example using the `SHUFFLE` filter::
>>> %time bpacked = blosc.compress(bytes_array, shuffle=blosc.SHUFFLE)
CPU times: user 112 ms, sys: 4 ms, total: 116 ms
Wall time: 29.9 ms
>>> len(bpacked)
6986533
Here there is another example using `BITSHUFFLE`::
>>> %time bpacked = blosc.compress(bytes_array, shuffle=blosc.BITSHUFFLE)
CPU times: user 120 ms, sys: 8 ms, total: 128 ms
Wall time: 32.9 ms
>>> len(bpacked)
5942257 # ~ 1.2x smaller than blosclz/shuffle
You can also deactivate filters completely with `NOSHUFFLE`::
>>> %time bpacked = blosc.compress(bytes_array, shuffle=blosc.NOSHUFFLE)
CPU times: user 416 ms, sys: 8 ms, total: 424 ms
Wall time: 107 ms
>>> len(bpacked)
74323591 # just a 7% of compression wrt the original buffer
So you have quite a bit of flexibility on choosing different codecs and
filters inside Blosc. Again, depending on the dataset you have and the
requirements on performance, you may want to experiment a bit before
sticking with your preferred one.
Supporting the buffer interface
===============================
python-blosc supports compressing and decompressing from any bytes-like
object that supports the buffer-interface: this includes `memoryview` and
`bytearray`::
>>> input_bytes = b"abcdefghijklmnopqrstuvwxyz"
>>> blosc.compress(input_bytes, typesize=1)
'\x02\x01\x03\x01\x1a\x00\x00\x00\x1a\x00\x00\x00*\x00\x00\x00abcdefghijklmnopqrstuvwxyz'
>>> blosc.compress(memoryview(input_bytes), typesize=1)
'\x02\x01\x03\x01\x1a\x00\x00\x00\x1a\x00\x00\x00*\x00\x00\x00abcdefghijklmnopqrstuvwxyz'
>>> blosc.compress(bytearray(input_bytes), typesize=1)
'\x02\x01\x03\x01\x1a\x00\x00\x00\x1a\x00\x00\x00*\x00\x00\x00abcdefghijklmnopqrstuvwxyz'
>>> compressed = blosc.compress(input_bytes, typesize=1)
>>> blosc.decompress(compressed)
'abcdefghijklmnopqrstuvwxyz'
>>> blosc.decompress(memoryview(compressed))
'abcdefghijklmnopqrstuvwxyz'
>>> blosc.decompress(bytearray(compressed))
'abcdefghijklmnopqrstuvwxyz'
Note however, that we can only compress 'binary' data which does *not* include
`unicode`.
Packaging NumPy arrays
======================
Want to use `blosc` to compress and decompress NumPy objects without having to
worry about passing the typesize for optimal compression, or having to create
the final container for decompression? `blosc` comes with the `pack_array`
and `unpack_array` to perform this in a handy way::
>>> a = np.linspace(0, 100, 1e7)
>>> packed = blosc.pack_array(a)
>>> %timeit blosc.pack_array(a)
10 loops, best of 3: 104 ms per loop # ~ 770 MB/s
>>> %timeit blosc.unpack_array(packed)
10 loops, best of 3: 76.3 ms per loop # ~ 1 GB/s
Although this is a convenient way for compressing/decompressing NumPy
arrays, this method uses pickle/unpickle behind the scenes. This step
implies additional copies, which takes both memory and time.
Compressing from a data pointer
===============================
For avoiding the data copy problem in the previous section, `blosc`
comes with a couple of lower-level functions: `compress_ptr` and
`decompress_ptr`. Here are they in action::
>>> c = blosc.compress_ptr(a.__array_interface__['data'][0], a.size, a.dtype.itemsize, 9, True)
>>> %timeit blosc.compress_ptr(a.__array_interface__['data'][0], a.size, a.dtype.itemsize, 9, True)
10 loops, best of 3: 17.8 ms per loop # ~ 4.5 GB/s
>>> a2 = np.empty(a.size, dtype=a.dtype)
>>> %timeit blosc.decompress_ptr(c, a2.__array_interface__['data'][0])
100 loops, best of 3: 11 ms per loop # ~ 7.3 GB/s
>>> (a == a2).all()
True
As you see, these are really low level functions because you should
pass actual pointers where the data is, as well as the size and
itemsize (for compression). Needless to say, it is very easy to cause
a segfault by passing incorrect parameters to the functions (wrong
pointer or wrong size).
On the other hand, and contrarily to the `pack_array` / `unpack_array`
method, the `compress_ptr` / `decompress_ptr` functions do not need to
make internal copies of the data buffers, so they are extremely fast
(as much as the C-Blosc library can be), but you have to provide a
container when doing the de-serialization.
`blosc` has a maximum `blocksize` of 2**31 bytes = 2 GB. Larger `numpy` arrays must
be chunked by slicing, for example as::
>>> c = b''
>>> for index in np.arange( a.shape[0] ):
c += blosc.compress_ptr(a[index,...].__array_interface__['data'][0], a.size, a.dtype.itemsize, 9, True)
You can also use this method with other Python objects like Bytes and bytearray, by converting them into numpy arrays using np.frombuffer(). Should work with any object that implements the buffer interface. As np.frombuffer() does not do a copy of data of the array, the conversion overhead is low.
>>> byte_arr = bytearray(b'\x01\x02\x03\x04\x05\x06\x07\x08........')
>>> a = np.frombuffer(byte_arr, dtype=np.int8)
>>> c = blosc.compress_ptr(a.__array_interface__['data'][0], a.size, a.dtype.itemsize, 9, True)
Fine-tuning compression parameters
==================================
There are a number of parameters that affect the de/compression bandwidth
achieved by `blosc`:
* The information content of the underlying data, and `chunksize`: the size of
the data in each call to `compress` and analogous functions. 'blosc' does not
manage chunksize itself, but an example implementation can be seen in the
`bloscpack` module.
* `n_threads`: The number of threads to spawn inside `c-blosc`.
`n_threads` may be changed by calling `blosc.set_nthreads(16)` for example.
`blosc` performance generally scales sub-linearly with the number of threads
with a coefficient roughly around 0.5-0.67. I.e. the expected performance
compared to a single thread is 1.0 / (0.6*n_threads). For systems with
hyper-threading the optimum number of threads is usually a small over-subscription
of the number of _physical_ (not virtual) cores.
* `blocksize`: is the size of each continuously memory-element that is compressed,
in bytes. Normally `blosc` attempts to automatically guess the size of
each compressed block of data, but the user can set it manually by calling
`blosc.set_blocksize( size_in_bytes )` for hand optimized situations. Often
the L2 cache size (e.g. 256kB for an Intel Haswell) is a good starting
point for optimization.
* `shuffle`: as discussed above the `shuffle` mode can substantially improve
compression ratios when the
* `clevel`: the compression level called for the algorithm. Called as an
argument to `compress` and similar functions.
* `cname`: the compressor codec itself. Each codec has its own characteristics
that also vary depending on the underlying data. For example, 'lz4' tends
to prefer smaller `blocksize` and does not slow significantly with `clevel`.
In comparison `zlib` and `zstd` both slow substantially with increasing `clevel`,
with an inflection point around `clevel`=4-5 for 'zlib' and `clevel`=2-4
for 'zstd'. Called as an argument to `compress` and similar functions.
* `releasegil`: optionally the Python Global Interpreter Lock (GIL) can be turned
off during `c-blosc` operations by calling `blosc.set_releasegil(True)`. The
default is off as there is a small overhead in releasing the GIL which can be
significant for a small `blocksize`. GIL release is intended to be used in
situations where other bounds (such as file or network I/O) are the rate-limiting
ones and a Python `ThreadPool` or similar object can be used for parallel
processing either with or without `blosc` threads. Preliminary tests suggest
that an equal mix of `ThreadPool` and `blosc` threads is near optimal. I.e. a
computer with 16 cores would have 4 `blosc` threads and 4 `ThreadPool` threads.
An example of combining `blosc` and `ThreadPool` may be found in
`bench/threadpool.py`.
Links to external discussions on `blosc` optimization
=====================================================
* `Synthetic Benchmarks <https://www.blosc.org/pages/synthetic-benchmarks/>`_ by Francesc Alted
* `Genotype compressor benchmark <https://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html>`_ by Alistair Miles
|