1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
|
Bitarray representations
========================
The bitarray library offers many ways to represent bitarray objects.
Here, we take a closer look at those representations and discuss their
advantages and disadvantages.
Binary representation
---------------------
The most common representation of bitarrays is its native binary string
representation, which is great for interactively analyzing bitarray objects:
.. code-block:: python
>>> from bitarray import bitarray
>>> a = bitarray('11001')
>>> repr(a) # same as str(a)
"bitarray('11001')"
>>> a.to01() # the raw string of 0's and 1's
'11001'
However, this representation is very large compared to the bitarray object
itself, and it not efficient for large bitarrays.
Byte representation
-------------------
As bitarray objects are stored in a byte buffer in memory, it is very
efficient (in terms of size and time) to use this representation of large
bitarrays. However, this representation is not very human readable.
.. code-block:: python
>>> a = bitarray('11001110000011010001110001111000010010101111000111100')
>>> a.tobytes() # raw buffer
b'\xce\r\x1cxJ\xf1\xe0'
Here, the number of pad bits within the last byte, as well as the
bit-endianness, is not part of the byte buffer itself. Therefore, extra work
is required to store this information. The utility function ``serialize()``
adds this information to a header byte:
.. code-block:: python
>>> from bitarray.util import serialize, deserialize
>>> x = serialize(a)
>>> x
b'\x13\xce\r\x1cxJ\xf1\xe0'
>>> b = deserialize(x)
>>> assert a == b and a.endian == b.endian
The header byte is structured the following way:
.. code-block:: python
>>> x[0] # 0x13
19
>>> x[0] % 16 # number of pad bits (0..7) within last byte
3
>>> x[0] // 16 # bit-endianness: 0 little, 1 big
1
Hence, valid values for the header byte are in the ranges 0 .. 7
or 16 .. 23 (inclusive). Moreover, if the serialized bitarray is
empty (``x`` only consists of a single byte - the header byte), the
only valid values for the header are 0 or 16 (corresponding to a
little-endian and big-endian empty bitarray).
The functions ``serialize()`` and ``deserialize()`` are the recommended and
fasted way to (de-) serialize bitarray objects to bytes objects (and vice
versa). The exact format of this representation is guaranteed to not
change in future releases.
Hexadecimal representation
--------------------------
As four bits of a bitarray may be represented by a hexadecimal digit,
we can represent bitarrays (whose length is a multiple of 4) as a hexadecimal
string:
.. code-block:: python
>>> from bitarray.util import ba2hex, hex2ba
>>> a = bitarray('1100 1110 0001 1010 0011 1000 1111')
>>> ba2hex(a)
'ce1a38f'
>>> hex2ba('ce1a38f')
bitarray('1100111000011010001110001111')
Note that the representation is different for the same bitarray if the
endianness changes:
.. code-block:: python
>>> a.endian
'big'
>>> b = bitarray(a, 'little')
>>> assert a == b
>>> b.endian
'little'
>>> ba2hex(b)
'3785c1f'
The functions ``ba2hex()`` and ``hex2ba()`` are very efficiently implemented
in C, and take advantage of byte level operations.
Base 2, 4, 8, 16, 32 and 64 representation
------------------------------------------
The utility function ``ba2base()`` allows representing bitarrays by
base ``n``, with possible bases 2, 4, 8, 16, 32 and 64.
The bitarray has to be multiple of length 1, 2, 3, 4, 5 or 6 respectively:
.. code-block:: python
>>> from bitarray.util import ba2base, base2ba
>>> a = bitarray('001010011010100000111011100110110001111100101110000100010010')
>>> len(a) # divisible by 2, 3, 4, 5 and 6
60
>>> ba2base(2, a) # binary
'001010011010100000111011100110110001111100101110000100010010'
>>> ba2base(4, a) # quaternary
'022122200323212301330232010102'
>>> ba2base(8, a) # octal
'12324073466174560422'
>>> ba2base(16, a) # hexadecimal
'29a83b9b1f2e112'
>>> ba2base(32, a) # base 32 (using RFC 4648 Base32 alphabet)
'FGUDXGY7FYIS'
>>> ba2base(64, a) # base 64 (using standard base 64 alphabet)
'Kag7mx8uES'
Note that ``ba2base(2, a)`` is equivalent to ``a.to01()`` and
that ``ba2base(16, a)`` is equivalent to ``ba2hex(a)``.
Unlike ``ba2hex()``, ``ba2base()`` does not take advantage of byte level
operations and is therefore a lot slower, although still implemented in C.
The inverse function is called ``base2ba()``.
See also `this example <../examples/base-n.py>`__.
Variable length representation
------------------------------
In some cases, it is useful to represent bitarrays in a binary format that
is "self terminating" (in the same way that C strings are NUL terminated).
That is, when an encoded bitarray of unknown length is encountered in a
stream of binary data, the format lets us know when the end of the encoded
bitarray is reached.
See `variable length format <./variable_length.rst>`__ for this representation.
Compressed sparse bitarrays
---------------------------
Another representation
is `compressed sparse bitarrays <./sparse_compression.rst>`__,
whose format is also "self terminating". This, format actually uses different
representations dependent on how sparsely the population of the bitarray (even
sections of the bitarray) is.
For large sparse bitarrays, the format reduces (compresses) the amount of data
very efficiently, while only requiring a very tiny overhead for non-sparsely
populated bitarrays.
|