1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
|
Compression of sparse bitarrays
===============================
In a ``bitarray`` object each byte in memory represents eight bits.
While this representation is very compact and efficient when dealing with
most data, there are situations when this representation is inefficient.
One such situation are sparsely populated bitarray.
That is, bitarray in which only a few bits are 1, but most bits are 0.
In this situation, one might consider using a data structure which stores
the indices of the 1 bits and not use the ``bitarray`` object at all.
However, having all of bitarray's functionality is very convenient.
It may be desired to convert ``bitarray`` objects into a more compact (index
based) format when storing objects on disk or sending them over the network.
This is the use case of the utility functions ``sc_encode()``
and ``sc_decode()``.
The lower the population count, the more efficient the compression will be:
.. code-block:: python
>>> from bitarray import bitarray
>>> from bitarray.util import zeros, sc_encode, sc_decode
>>> a = zeros(1 << 24, 'little') # 16 mbits
>>> a[0xaa] = a[0xbbcc] = a[0xddeeff] = 1
>>> blob = sc_encode(a)
>>> blob
b'\x04\x00\x00\x00\x01\xc3\x03\xaa\x00\x00\xcc\xbb\x00\xff\xee\xdd\x00'
>>> assert sc_decode(blob) == a
How it works
------------
Consider a ``bitarray`` of length 256, that is 32 bytes of memory.
If we represent this object by the indices of 1 bits as one byte each,
the object will be represent more efficiently when the population (number
of 1 bits) is less than 32. Based on the population, the
function ``sc_encode()`` chooses to represent the object as either raw bytes
or as bytes of indices of 1 bits. These are the block types 0 and 1.
Next, we consider a ``bitarray`` of length 65536. When each section of 256
bits has a population below 32, it would be stored as 256 blocks of type 1.
That is, we need 256 block headers and one (index) byte for each 1 bit.
However, when the total population is below 256, we could also introduce
a new block type 2 in which each index is represented by two bytes and
represent the entire bitarray as a single block (of type 2).
This saves us the 256 block headers (of type 1).
Similarly, with even less populated bitarrays, it will become more efficient
to move to blocks representing each index using 3 or more bytes.
The encoding algorithm starts at the front of the ``bitarray``, inspects
the population and decides which block type to use to encode the following
bits. Once the first block is written, the algorithm moves on to inspecting
the remaining population, and so on.
This way, a large bitarray with densely and sparsely populated areas will
be compressed efficiently using different block types.
The binary blob consists of a header which encodes the bit-endianness and the
total length of the bitarray, i.e. the number of bits. The header is followed
by an arbitrary number of blocks. There are 5 block types. Each block starts
with a block header encoding the block type and specifying the size of the
block data that follows.
.. code-block::
block head count count bytes block size
type byte byte per index (encoded) (decoded)
-------------------------------------------------------------------------
0 0x00..0x9f 0..4096 no raw 1..4097 0..4096
1 0xa0..0xbf 0..31 no 1 1..32 32
2 0xc2 0..255 yes 2 2..512 8,192
3 0xc3 0..255 yes 3 2..767 2,097,152
4 0xc4 0..255 yes 4 2..1022 536,870,912
As the decoder stops whenever the decoded block size is 0,
the head byte 0x00 (type 0 with no raw bytes) is considered the stop byte.
Speed
-----
We create a 64 mbit (8mb) random bitarray with a probability of 1/1024
for each bit being 1. The table shows a comparison of different compression
methods:
.. code-block::
compress (ms) decompress (ms) ratio
----------------------------------------------------------
serialize 3.876 1.002 1.0000
sc 5.502 2.703 0.0158
gzip 918.937 10.057 0.0169
bz2 59.500 32.611 0.0117
Statistics
----------
We create 256 mbit (32mb) random bitarrays with varying probability ``p``
for elements being 1. After compression, we look at the compression
ratio, and the number of blocks of each type:
.. code-block::
p ratio raw type 1 type 2 type 3 type 4
-------------------------------------------------------------------------
1.00000000 1.00024432 8192 0 0 0 0
0.50000000 1.00024432 8192 0 0 0 0
0.25000000 1.00024432 8192 0 0 0 0
0.12500000 0.95681512 261581 494833 0 0 0
0.06250000 0.53125548 176 1048400 0 0 0
0.03125000 0.28118369 0 1048576 0 0 0
0.01562500 0.15618166 0 1048576 0 0 0
0.00781250 0.09370023 0 1048576 0 0 0
0.00390625 0.06188744 0 281719 2996 0 0
0.00195312 0.03140587 0 0 4096 0 0
0.00097656 0.01582754 0 0 4096 0 0
0.00048828 0.00804502 0 0 4096 0 0
0.00024414 0.00415075 0 0 4096 0 0
0.00012207 0.00220025 0 0 4096 0 0
0.00006104 0.00122154 0 0 4096 0 0
0.00003052 0.00072917 0 0 3958 1 0
0.00001526 0.00038746 0 0 817 13 0
0.00000763 0.00018182 0 0 0 16 0
0.00000381 0.00009358 0 0 0 16 0
0.00000191 0.00004664 0 0 0 16 0
0.00000095 0.00002545 0 0 0 16 0
0.00000048 0.00001267 0 0 0 16 0
0.00000024 0.00000739 0 0 0 16 0
0.00000012 0.00000426 0 0 0 16 0
0.00000006 0.00000226 0 0 0 0 1
0.00000003 0.00000107 0 0 0 0 1
0.00000001 0.00000060 0 0 0 0 1
|