1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
|
.. _performance-tips:
Performance Tips
================
This is a short guide to features present in Numba that can help with obtaining
the best performance from code. Two examples are used, both are entirely
contrived and exist purely for pedagogical reasons to motivate discussion.
The first is the computation of the trigonometric identity
``cos(x)^2 + sin(x)^2``, the second is a simple element wise square root of a
vector with reduction over summation. All performance numbers are indicative
only and unless otherwise stated were taken from running on an Intel ``i7-4790``
CPU (4 hardware threads) with an input of ``np.arange(1.e7)``.
.. note::
A reasonably effective approach to achieving high performance code is to
profile the code running with real data and use that to guide performance
tuning. The information presented here is to demonstrate features, not to act
as canonical guidance!
NoPython mode
-------------
The default mode in which Numba's ``@jit`` decorator operates is
:term:`nopython mode`. This mode is most restrictive about what can be compiled,
but results in faster executable code.
.. note::
Historically (prior to 0.59.0) the default compilation mode was a fall-back
mode whereby the compiler would try to compile in :term:`nopython mode` and
if it failed it would fall-back to :term:`object mode`. It is likely that
you'll see ``@jit(nopython=True)``, or its alias ``@njit``, in use in
code/documentation as this was the recommended best practice method to force
use of :term:`nopython mode`. Since Numba 0.59.0 this is no long necessary
as :term:`nopython mode` is the default mode for ``@jit``.
Loops
-----
Whilst NumPy has developed a strong idiom around the use of vector operations,
Numba is perfectly happy with loops too. For users familiar with C or Fortran,
writing Python in this style will work fine in Numba (after all, LLVM gets a
lot of use in compiling C lineage languages). For example::
@njit
def ident_np(x):
return np.cos(x) ** 2 + np.sin(x) ** 2
@njit
def ident_loops(x):
r = np.empty_like(x)
n = len(x)
for i in range(n):
r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2
return r
The above run at almost identical speeds when decorated with ``@njit``, without
the decorator the vectorized function is a couple of orders of magnitude faster.
+-----------------+-------+----------------+
| Function Name | @njit | Execution time |
+=================+=======+================+
| ``ident_np`` | No | 0.581s |
+-----------------+-------+----------------+
| ``ident_np`` | Yes | 0.659s |
+-----------------+-------+----------------+
| ``ident_loops`` | No | 25.2s |
+-----------------+-------+----------------+
| ``ident_loops`` | Yes | 0.670s |
+-----------------+-------+----------------+
A Case for Object mode: LoopLifting
-----------------------------------
Some functions may be incompatible with the restrictive :term:`nopython mode`
but contain compatible loops. You can enable these functions to attempt nopython
mode on their loops by setting ``@jit(forceobj=True)``. The incompatible code
segments will run in object mode.
Whilst using looplifting in object mode can provide some performance increase,
compiling functions entirely in :term:`nopython mode` is key to achieving
optimal performance.
.. _fast-math:
Fastmath
--------
In certain classes of applications strict IEEE 754 compliance is less
important. As a result it is possible to relax some numerical rigour with
view of gaining additional performance. The way to achieve this behaviour in
Numba is through the use of the ``fastmath`` keyword argument::
@njit(fastmath=False)
def do_sum(A):
acc = 0.
# without fastmath, this loop must accumulate in strict order
for x in A:
acc += np.sqrt(x)
return acc
@njit(fastmath=True)
def do_sum_fast(A):
acc = 0.
# with fastmath, the reduction can be vectorized as floating point
# reassociation is permitted.
for x in A:
acc += np.sqrt(x)
return acc
+-----------------+-----------------+
| Function Name | Execution time |
+=================+=================+
| ``do_sum`` | 35.2 ms |
+-----------------+-----------------+
| ``do_sum_fast`` | 17.8 ms |
+-----------------+-----------------+
In some cases you may wish to opt-in to only a subset of possible fast-math
optimizations. This can be done by supplying a set of `LLVM fast-math flags
<https://llvm.org/docs/LangRef.html#fast-math-flags>`_ to ``fastmath``.::
def add_assoc(x, y):
return (x - y) + y
print(njit(fastmath=False)(add_assoc)(0, np.inf)) # nan
print(njit(fastmath=True) (add_assoc)(0, np.inf)) # 0.0
print(njit(fastmath={'reassoc', 'nsz'})(add_assoc)(0, np.inf)) # 0.0
print(njit(fastmath={'reassoc'}) (add_assoc)(0, np.inf)) # nan
print(njit(fastmath={'nsz'}) (add_assoc)(0, np.inf)) # nan
Parallel=True
-------------
If code contains operations that are parallelisable (:ref:`and supported
<numba-parallel-supported>`) Numba can compile a version that will run in
parallel on multiple native threads (no GIL!). This parallelisation is performed
automatically and is enabled by simply adding the ``parallel`` keyword
argument::
@njit(parallel=True)
def ident_parallel(x):
return np.cos(x) ** 2 + np.sin(x) ** 2
Executions times are as follows:
+--------------------+-----------------+
| Function Name | Execution time |
+====================+=================+
| ``ident_parallel`` | 112 ms |
+--------------------+-----------------+
The execution speed of this function with ``parallel=True`` present is
approximately 5x that of the NumPy equivalent and 6x that of standard
``@njit``.
Numba parallel execution also has support for explicit parallel loop
declaration similar to that in OpenMP. To indicate that a loop should be
executed in parallel the ``numba.prange`` function should be used, this function
behaves like Python ``range`` and if ``parallel=True`` is not set it acts
simply as an alias of ``range``. Loops induced with ``prange`` can be used for
embarrassingly parallel computation and also reductions.
Revisiting the reduce over sum example, assuming it is safe for the sum to be
accumulated out of order, the loop in ``n`` can be parallelised through the use
of ``prange``. Further, the ``fastmath=True`` keyword argument can be added
without concern in this case as the assumption that out of order execution is
valid has already been made through the use of ``parallel=True`` (as each thread
computes a partial sum).
::
@njit(parallel=True)
def do_sum_parallel(A):
# each thread can accumulate its own partial sum, and then a cross
# thread reduction is performed to obtain the result to return
n = len(A)
acc = 0.
for i in prange(n):
acc += np.sqrt(A[i])
return acc
@njit(parallel=True, fastmath=True)
def do_sum_parallel_fast(A):
n = len(A)
acc = 0.
for i in prange(n):
acc += np.sqrt(A[i])
return acc
Execution times are as follows, ``fastmath`` again improves performance.
+-------------------------+-----------------+
| Function Name | Execution time |
+=========================+=================+
| ``do_sum_parallel`` | 9.81 ms |
+-------------------------+-----------------+
| ``do_sum_parallel_fast``| 5.37 ms |
+-------------------------+-----------------+
.. _intel-svml:
Intel SVML
----------
Intel provides a short vector math library (SVML) that contains a large number
of optimised transcendental functions available for use as compiler
intrinsics. If the ``intel-cmplr-lib-rt`` package is present in the
environment (or the SVML libraries are simply locatable!) then Numba
automatically configures the LLVM back end to use the SVML intrinsic functions
where ever possible. SVML provides both high and low accuracy versions of each
intrinsic and the version that is used is determined through the use of the
``fastmath`` keyword. The default is to use high accuracy which is accurate to
within ``1 ULP``, however if ``fastmath`` is set to ``True`` then the lower
accuracy versions of the intrinsics are used (answers to within ``4 ULP``).
First obtain SVML, using conda for example::
conda install intel-cmplr-lib-rt
.. note::
The SVML library was previously provided through the ``icc_rt`` conda
package. The ``icc_rt`` package has since become a meta-package and as of
version ``2021.1.1`` it has ``intel-cmplr-lib-rt`` amongst other packages as
a dependency. Installing the recommended ``intel-cmplr-lib-rt`` package
directly results in fewer installed packages.
Rerunning the identity function example ``ident_np`` from above with various
combinations of options to ``@njit`` and with/without SVML yields the following
performance results (input size ``np.arange(1.e8)``). For reference, with just
NumPy the function executed in ``5.84s``:
+-----------------------------------+--------+-------------------+
| ``@njit`` kwargs | SVML | Execution time |
+===================================+========+===================+
| ``None`` | No | 5.95s |
+-----------------------------------+--------+-------------------+
| ``None`` | Yes | 2.26s |
+-----------------------------------+--------+-------------------+
| ``fastmath=True`` | No | 5.97s |
+-----------------------------------+--------+-------------------+
| ``fastmath=True`` | Yes | 1.8s |
+-----------------------------------+--------+-------------------+
| ``parallel=True`` | No | 1.36s |
+-----------------------------------+--------+-------------------+
| ``parallel=True`` | Yes | 0.624s |
+-----------------------------------+--------+-------------------+
| ``parallel=True, fastmath=True`` | No | 1.32s |
+-----------------------------------+--------+-------------------+
| ``parallel=True, fastmath=True`` | Yes | 0.576s |
+-----------------------------------+--------+-------------------+
It is evident that SVML significantly increases the performance of this
function. The impact of ``fastmath`` in the case of SVML not being present is
zero, this is expected as there is nothing in the original function that would
benefit from relaxing numerical strictness.
Linear algebra
--------------
Numba supports most of ``numpy.linalg`` in no Python mode. The internal
implementation relies on a LAPACK and BLAS library to do the numerical work
and it obtains the bindings for the necessary functions from SciPy. Therefore,
to achieve good performance in ``numpy.linalg`` functions with Numba it is
necessary to use a SciPy built against a well optimised LAPACK/BLAS library.
In the case of the Anaconda distribution SciPy is built against Intel's MKL
which is highly optimised and as a result Numba makes use of this performance.
|