1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273
|
.. currentmodule:: bitstring
.. _Exotic floats:
Exotic Floating Point Formats
=============================
Python floats are typically 64 bits long, but 32 and 16 bit sizes are also supported through the ``struct`` module.
These are the well-known IEEE formats.
Recently, lower precision floating points have become more widely used, largely driven by the requirements of machine learning algorithms and hardware.
As well as the 'half' precision 16 bit standard, a truncated version of the 32 bit standard called 'bfloat16' is used which has the range of a 32-bit float but less precision.
The #bits value in the tables below show how the available bits are split into `sign` + `exponent` + `mantissa`.
There's always 1 bit to determine the sign of the floating point value.
The more bits in the exponent, the larger the range that can be represented.
The more bits in the mantissa, the greater the precision (~significant figures) of the values.
.. list-table::
:header-rows: 1
* - Type
- # bits
- Standard
- +ive Range
- bitstring / struct format
* - Double precision
- 1 + 11 + 52
- IEEE 754
- 10\ :sup:`-308` → 10\ :sup:`308`
- ``'float64'`` / ``'d'``
* - Single precision
- 1 + 8 + 23
- IEEE 754
- 10\ :sup:`-38` → 10\ :sup:`38`
- ``'float32'`` / ``'f'``
* - Half precision
- 1 + 5 + 10
- IEEE 754
- 6×10\ :sup:`-8` → 65504
- ``'float16'`` / ``'e'``
* - bfloat
- 1 + 8 + 7
- ``-``
- 10\ :sup:`-38` → 10\ :sup:`38`
- ``'bfloat'`` / ``-``
An example of creation and interpretation of a bfloat::
>>> a = Bits(bfloat=4.5e23) # No need to specify length as always 16 bits
>>> a
Bits('0x66be')
>>> a.bfloat
4.486248158726163e+23 # Converted to Python float
IEEE 8-bit Floating Point Types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
In bitstring prior to version 4.2 the `p4binary8` and `p3binary8` formats were called `e4m3float` and `e5m2float` respectively.
The two formats are almost identical, the difference being the addition of `inf` values that replace the largest positive and negative values that were previously available.
Neither should be confused with the `e4m3mxfp` and `e5m2mxfp` formats from the Open Compute Project described below.
The 'binary8' formats are part of an ongoing IEEE standardisation process.
This implementation here is based on a publicly available draft of the standard.
There are seven formats defined in the draft standard, but only two are supported here.
If you'd like the other precisions supported then raise a feature request!
The ``p4binary8`` has a single sign bit, 4 bits for the exponent and 3 bits for the mantissa.
For a bit more range and less precision you can use ``p3binary8`` which has 5 bits for the exponent and only 2 for the mantissa.
Note that in the standard they are called `binary8p4` and `binary8p3`, but for bitstring types any final digits should be the length of the type, so these slightly modified names were chosen.
.. list-table::
:header-rows: 1
* - Type
- # bits
- +ive Range
- bitstring format
* - binary8p4
- 1 + 4 + 3
- 10\ :sup:`-3` → 224
- ``'p4binary8'``
* - binary8p3
- 1 + 5 + 2
- 8×10\ :sup:`-6` → 49152
- ``'p3binary8'``
As there are just 256 possible values, both the range and precision of these formats are extremely limited.
It's remarkable that any useful calculations can be performed, but both inference and training of large machine learning models can be done with these formats.
You can easily examine every possible value that these formats can represent using a line like this::
>>> [Bits(uint=x, length=8).p3binary8 for x in range(256)]
or using the :class:`Array` type it's even more concise - we can create an Array and pretty print all the values with this line::
>>> Array('p4binary8', bytearray(range(256))).pp(width=90)
<Array fmt='p4binary', length=256, itemsize=8 bits, total data size=256 bytes>
[
0: 0.0 0.0009765625 0.001953125 0.0029296875 0.00390625 0.0048828125
6: 0.005859375 0.0068359375 0.0078125 0.0087890625 0.009765625 0.0107421875
12: 0.01171875 0.0126953125 0.013671875 0.0146484375 0.015625 0.017578125
18: 0.01953125 0.021484375 0.0234375 0.025390625 0.02734375 0.029296875
24: 0.03125 0.03515625 0.0390625 0.04296875 0.046875 0.05078125
30: 0.0546875 0.05859375 0.0625 0.0703125 0.078125 0.0859375
36: 0.09375 0.1015625 0.109375 0.1171875 0.125 0.140625
42: 0.15625 0.171875 0.1875 0.203125 0.21875 0.234375
48: 0.25 0.28125 0.3125 0.34375 0.375 0.40625
54: 0.4375 0.46875 0.5 0.5625 0.625 0.6875
60: 0.75 0.8125 0.875 0.9375 1.0 1.125
66: 1.25 1.375 1.5 1.625 1.75 1.875
72: 2.0 2.25 2.5 2.75 3.0 3.25
78: 3.5 3.75 4.0 4.5 5.0 5.5
84: 6.0 6.5 7.0 7.5 8.0 9.0
90: 10.0 11.0 12.0 13.0 14.0 15.0
96: 16.0 18.0 20.0 22.0 24.0 26.0
102: 28.0 30.0 32.0 36.0 40.0 44.0
108: 48.0 52.0 56.0 60.0 64.0 72.0
114: 80.0 88.0 96.0 104.0 112.0 120.0
120: 128.0 144.0 160.0 176.0 192.0 208.0
126: 224.0 inf nan -0.0009765625 -0.001953125 -0.0029296875
132: -0.00390625 -0.0048828125 -0.005859375 -0.0068359375 -0.0078125 -0.0087890625
138: -0.009765625 -0.0107421875 -0.01171875 -0.0126953125 -0.013671875 -0.0146484375
144: -0.015625 -0.017578125 -0.01953125 -0.021484375 -0.0234375 -0.025390625
150: -0.02734375 -0.029296875 -0.03125 -0.03515625 -0.0390625 -0.04296875
156: -0.046875 -0.05078125 -0.0546875 -0.05859375 -0.0625 -0.0703125
162: -0.078125 -0.0859375 -0.09375 -0.1015625 -0.109375 -0.1171875
168: -0.125 -0.140625 -0.15625 -0.171875 -0.1875 -0.203125
174: -0.21875 -0.234375 -0.25 -0.28125 -0.3125 -0.34375
180: -0.375 -0.40625 -0.4375 -0.46875 -0.5 -0.5625
186: -0.625 -0.6875 -0.75 -0.8125 -0.875 -0.9375
192: -1.0 -1.125 -1.25 -1.375 -1.5 -1.625
198: -1.75 -1.875 -2.0 -2.25 -2.5 -2.75
204: -3.0 -3.25 -3.5 -3.75 -4.0 -4.5
210: -5.0 -5.5 -6.0 -6.5 -7.0 -7.5
216: -8.0 -9.0 -10.0 -11.0 -12.0 -13.0
222: -14.0 -15.0 -16.0 -18.0 -20.0 -22.0
228: -24.0 -26.0 -28.0 -30.0 -32.0 -36.0
234: -40.0 -44.0 -48.0 -52.0 -56.0 -60.0
240: -64.0 -72.0 -80.0 -88.0 -96.0 -104.0
246: -112.0 -120.0 -128.0 -144.0 -160.0 -176.0
252: -192.0 -208.0 -224.0 -inf
]
You'll see that there is only 1 zero value and only one 'nan' value, together with positive and negative 'inf' values.
When converting from a Python float (which will typically be stored in 64-bits) unrepresented values are rounded to nearest, with ties-to-even.
This is the standard method used in IEEE 754.
Microscaling Formats
^^^^^^^^^^^^^^^^^^^^
A range of formats from the Microscaling Formats (MX) Alliance are supported. These are part of the Open Compute Project, and will usually have an external scale factor associated with them.
Eight-bit floats similar to the IEEE `p3binary8` and `p4binary8` are available, though these seem rather arbitrary and ugly in places in comparison to the IEEE definitions.
There is also a format to use for the scaling factor, an int-like format which is really a float, and some sensible six and four bit float formats.
.. list-table::
:header-rows: 1
* - Type
- # bits
- +ive Range
- bitstring format
* - E5M2
- 1 + 5 + 2
- 10\ :sup:`-6` → 57344
- ``'e5m2mxfp'``
* - E4M3
- 1 + 4 + 3
- 2×10\ :sup:`-3` → 448
- ``'e4m3mxfp'``
* - E3M2
- 1 + 3 + 2
- 0.0625 → 28
- ``'e3m2mxfp'``
* - E2M3
- 1 + 2 + 3
- 0.125 → 7.5
- ``'e2m3mxfp'``
* - E2M1
- 1 + 2 + 1
- 0.5 → 6
- ``'e2m1mxfp'``
* - E8M0
- 0 + 8 + 0
- 10\ :sup:`-38` → 10\ :sup:`38`
- ``'e8m0mxfp'``
* - INT8
- 8
- 0.015625 → 1.984375
- ``'mxint'``
* The E8M0 format is unsigned and designed to use as a scaling factor for blocks of the other formats.
* The INT8 format is like a signed two's complement 8-bit integer but with a scaling factor of 2\ :sup:`-6`. So despite its name it is actually a float. The standard doesn't specify whether the largest negative value (-2.0) is a supported number or not. This implementation allows it.
* The E4M3 format is similar to the `p4binary8` format but with a different exponent bias and it wastes some values. It has no 'inf' values, instead opting to have two 'nan' values and two zero values.
* The E5M2 format is similar to the `p3binary8` format but wastes even more values. It does have positive and negative 'inf' values, but also six 'nan' values and two zero values.
The MX formats are designed to work with an external scaling factor.
This should be in the E8M0 format, which uses a byte to encode the powers of two from 2\ :sup:`-127` to 2\ :sup:`127`, plus a 'nan' value.
This can be specified in bitstring as part of the `Dtype`, and is very useful inside an ``Array``. ::
>>> d = b'some_byte_data'
>>> a = Array(Dtype('e2m1mxfp', scale=2**10), d)
>>> a.pp()
<Array dtype='Dtype('e2m1mxfp', scale=2 ** 10)', length=28, itemsize=4 bits, total data size=14 bytes> [
0: 6144.0 1536.0 4096.0 -6144.0 4096.0 -3072.0 4096.0 3072.0 3072.0 -6144.0 4096.0 1024.0 6144.0 -512.0
14: 6144.0 2048.0 4096.0 3072.0 3072.0 -6144.0 4096.0 2048.0 4096.0 512.0 6144.0 2048.0 4096.0 512.0
]
To change the scale, replace the dtype in the `Array`::
>>> a.dtype = Dtype('e2m1mxfp', scale=2**6)
When initialising an `Array` from a list of values, you can also use the string ``'auto'`` as the scale, and an appropriate scale will be calculated based on the data. ::
>>> a = Array(Dtype('e2m1mxfp', scale='auto'), [0.0, 0.5, 40.5, 106.25, -52.0, -8.0])
>>> a.pp()
<Array dtype='Dtype('e2m1mxfp', scale=2 ** 4)', length=6, itemsize=4 bits, total data size=3 bytes> [
0: 0.0 0.0 32.0 96.0 -48.0 -8.0
]
The scale is calculated based on the maximum absolute value of the data and the maximum representable value of the format.
The auto-scale feature is only available for 8-bit and smaller floating point formats, plus the IEEE 16-bit format.
If all of the data is zero, then the scale is set to 1.
For more details on this and these formats in general see the `OCP Microscaling formats specification. <https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf>`_
Conversion
^^^^^^^^^^
When converting from a Python float to an any of the 8-or-fewer bit formats, the 'rounds to nearest, ties to even' rule is used.
This is the same as is used in the IEEE 754 standard.
Note that for efficiency reasons Python floats are converted to 16-bit IEEE floats before being converted to their final destination.
This can mean that in edge cases the rounding to the 16-bit float will cause the next rounding to go in the other direction.
The 16-bit float has 11 bits of precision, whereas the final format has at most 4 bits of precision, so this shouldn't be a real-world problem, but it could cause discrepancies when comparing with other methods.
I plan to add a slower, more accurate mode - see Issue #342.
Values that are out of range after rounding are dealt with as follows:
- ``p3binary8`` - Out of range values are set to ``+inf`` or ``-inf``.
- ``p4binary8`` - Out of range values are set to ``+inf`` or ``-inf``.
- ``e5m2mxfp`` - Out of range values are dealt with according to the ``bitstring.options.mxfp_overflow`` setting:
- ``'saturate'`` (the default): values are set to the largest positive or negative finite value, as appropriate. Infinities will also be set to the largest finite value, despite the fact that the format has infinities.
- ``'overflow'``: Out of range values are set to ``+inf`` or ``-inf``.
- ``e4m3mxfp`` - Out of range values are dealt with according to the ``bitstring.options.mxfp_overflow`` setting:
- ``'saturate'`` (the default): values are set to the largest positive or negative value, as appropriate.
- ``'overflow'``: Out of range values are set to ``nan``.
- ``e3m2mxfp`` - Out of range values are saturated to the largest positive or negative value.
- ``e2m3mxfp`` - Out of range values are saturated to the largest positive or negative value.
- ``e2m1mxfp`` - Out of range values are saturated to the largest positive or negative value.
- ``mxint`` - Out of range values are saturated to the largest positive or negative value. Note that the most negative value in this case is ``-2.0``, which it is optional for an implementation to support.
- ``e8m0mxfp`` - No rounding is done. A ``ValueError`` will be raised if you try to convert a non-representable value. This is because the format is designed as a scaling factor, so it should generally be specified exactly.
|