1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249
|
=========================================
Notes on Numba's threading implementation
=========================================
The execution of the work presented by the Numba ``parallel`` targets is
undertaken by the Numba threading layer. Practically, the "threading layer"
is a Numba built-in library that can perform the required concurrent execution.
At the time of writing there are three threading layers available, each
implemented via a different lower level native threading library. More
information on the threading layers and appropriate selection of a threading
layer for a given application/system can be found in the
:ref:`threading layer documentation <numba-threading-layer>`.
The pertinent information to note for the following sections is that the
function in the threading library that performs the parallel execution is the
``parallel_for`` function. The job of this function is to both orchestrate and
execute the parallel tasks.
The relevant source files referenced in this document are
- ``numba/np/ufunc/tbbpool.cpp``
- ``numba/np/ufunc/omppool.cpp``
- ``numba/np/ufunc/workqueue.c``
These files contain the TBB, OpenMP, and workqueue threadpool
implementations, respectively. Each includes the functions
``set_num_threads()``, ``get_num_threads()``, and ``get_thread_id()``, as
well as the relevant logic for thread masking in their respective
schedulers. Note that the basic thread local variable logic is duplicated in
each of these files, and not shared between them.
- ``numba/np/ufunc/parallel.py``
This file contains the Python and JIT compatible wrappers for
``set_num_threads()``, ``get_num_threads()``, and ``get_thread_id()``, as
well as the code that loads the above libraries into Python and launches the
threadpool.
- ``numba/parfors/parfor_lowering.py``
This file contains the main logic for generating code for the parallel
backend. The thread mask is accessed in this file in the code that generates
scheduler code, and passed to the relevant backend scheduler function (see
below).
Thread masking
--------------
As part of its design, Numba never launches new threads beyond the threads
that are launched initially with ``numba.np.ufunc.parallel._launch_threads()``
when the first parallel execution is run. This is due to the way threads were
already implemented in Numba prior to thread masking being implemented. This
restriction was kept to keep the design simple, although it could be removed
in the future. Consequently, it's possible to programmatically set the number
of threads, but only to less than or equal to the total number that have
already been launched. This is done by "masking" out unused threads, causing
them to do no work. For example, on a 16 core machine, if the user were to
call ``set_num_threads(4)``, Numba would always have 16 threads present, but
12 of them would sit idle for parallel computations. A further call to
``set_num_threads(16)`` would cause those same threads to do work in later
computations.
:ref:`Thread masking <numba-threading-layer-thread-masking>` was added to make
it possible for a user to programmatically alter the number of threads
performing work in the threading layer. Thread masking proved challenging to
implement as it required the development of a programming model that is suitable
for users, easy to reason about, and could be implemented safely, with
consistent behavior across the various threading layers.
Programming model
~~~~~~~~~~~~~~~~~
The programming model chosen is similar to that found in OpenMP. The reasons
for this choice were that it is familiar to a lot of users, restricted in
scope and also simple. The number of threads in use is specified by calling
``set_num_threads`` and the number of threads in use can be queried by calling
``get_num_threads``.These two functions are synonymous with their OpenMP
counterparts (with the above restriction that the mask must be less than or
equal to the number of launched threads). The execution semantics are also
similar to OpenMP in that once a parallel region is launched, altering the
thread mask has no impact on the currently executing region, but will have an
impact on parallel regions executed subsequently.
The Implementation
~~~~~~~~~~~~~~~~~~
So as to place no further restrictions on user code other than those that
already existed in the threading layer libraries, careful consideration of the
design of thread masking was required. The "thread mask" cannot be stored in a
global value as concurrent use of the threading layer may result in classic
forms of race conditions on the value itself. Numerous designs were discussed
involving various types of mutex on such a global value, all of which were
eventually broken through thought experiment alone. It eventually transpired
that, following some OpenMP implementations, the "thread mask" is best
implemented as a ``thread local``. This means each thread that executes a Numba
parallel function will have a thread local storage (TLS) slot that contains the
value of the thread mask to use when scheduling threads in the ``parallel_for``
function.
The above notion of TLS use for a thread mask is relatively easy to implement,
``get_num_threads`` and ``set_num_threads`` simply need to address the TLS slot
in a given threading layer. This also means that the execution schedule for a
parallel region can be derived from a run time call to ``get_num_threads``. This
is achieved via a well known and relatively easy to implement pattern of a ``C``
library function registration and wrapping it in the internal Numba
implementation.
In addition to satisfying the original upfront thread masking requirements, a
few more complicated scenarios needed consideration as follows.
Nested parallelism
******************
In all threading layers a "main thread" will invoke the ``parallel_for``
function and then in the parallel region, depending on the threading layer,
some number of additional threads will assist in doing the actual work.
If the work contains a call to another parallel function (i.e. nested
parallelism) it is necessary for the thread making the call to know what the
"thread mask" of the main thread is so that it can propagate it into the
``parallel_for`` call it makes when executing the nested parallel function.
The implementation of this behavior is threading layer specific but the general
principle is for the "main thread" to always "send" the value of the thread mask
from its TLS slot to all threads in the threading layer that are active in the
parallel region. These active threads then update their TLS slots with this
value prior to performing any work. The net result of this implementation detail
is that:
* thread masks correctly propagate into nested functions
* it's still possible for each thread in a parallel region to safely have a
different mask with which to call nested functions, if it's not set explicitly
then the inherited mask from the "main thread" is used
* threading layers which have dynamic scheduling with threads potentially
joining and leaving the active pool during a ``parallel_for`` execution are
successfully accommodated
* any "main thread" thread mask is entirely decoupled from the in-flux nature
of the thread masks of the threads in the active thread pool
Python threads independently invoking parallel functions
********************************************************
The threading layer launch sequence is heavily guarded to ensure that the
launch is both thread and process safe and run once per process. In a system
with numerous Python ``threading`` module threads all using Numba, the first
thread through the launch sequence will get its thread mask set appropriately,
but no further threads can run the launch sequence. This means that other
threads will need their initial thread mask set some other way. This is
achieved when ``get_num_threads`` is called and no thread mask is present, in
this case the thread mask will be set to the default. In the implementation,
"no thread mask is present" is represented by the value ``-1`` and the "default
thread mask" (unset) is represented by the value ``0``. The implementation also
immediately calls ``set_num_threads(NUMBA_NUM_THREADS)`` after doing this, so
if either ``-1`` or ``0`` is encountered as a result from ``get_num_threads()`` it
indicates a bug in the above processes.
OS ``fork()`` calls
*******************
The use of TLS was also in part driven by the Linux (the most popular
platform for Numba use by far) having a ``fork(2, 3P)`` call that will do TLS
propagation into child processes, see ``clone(2)``\ 's ``CLONE_SETTLS``.
Thread ID
*********
A private ``get_thread_id()`` function was added to each threading backend,
which returns a unique ID for each thread. This can be accessed from Python by
``numba.np.ufunc.parallel._get_thread_id()`` (it can also be used inside a
JIT compiled function). The thread ID function is useful for testing that the
thread masking behavior is correct, but it should not be used outside of the
tests. For example, one can call ``set_num_threads(4)`` and then collect all
unique ``_get_thread_id()``\ s in a parallel region to verify that only 4
threads are run.
Caveats
~~~~~~~
Some caveats to be aware of when testing thread masking:
- The TBB backend may choose to schedule fewer than the given mask number of
threads. Thus a test such as the one described above may return fewer than 4
unique threads.
- The workqueue backend is not threadsafe, so attempts to do multithreading
nested parallelism with it may result in deadlocks or other undefined
behavior. The workqueue backend will raise a SIGABRT signal if it detects
nested parallelism.
- Certain backends may reuse the main thread for computation, but this
behavior shouldn't be relied upon (for instance, if propagating exceptions).
Use in Code Generation
~~~~~~~~~~~~~~~~~~~~~~
The general pattern for using ``get_num_threads`` in code generation is
.. code:: python
from llvmlite import ir as llvmir
get_num_threads = cgutils.get_or_insert_function(builder.module
llvmir.FunctionType(llvmir.IntType(types.intp.bitwidth), []),
name="get_num_threads")
num_threads = builder.call(get_num_threads, [])
with cgutils.if_unlikely(builder, builder.icmp_signed('<=', num_threads,
num_threads.type(0))):
cgutils.printf(builder, "num_threads: %d\n", num_threads)
context.call_conv.return_user_exc(builder, RuntimeError,
("Invalid number of threads. "
"This likely indicates a bug in Numba.",))
# Pass num_threads through to the appropriate backend function here
See the code in ``numba/parfors/parfor_lowering.py``.
The guard against ``num_threads`` being <= 0 is not strictly necessary, but it
can protect against accidentally incorrect behavior in case the thread masking
logic contains a bug.
The ``num_threads`` variable should be passed through to the appropriate
backend function, such as ``do_scheduling`` or ``parallel_for``. If it's used
in some way other than passing it through to the backend function, the above
considerations should be taken into account to ensure the use of the
``num_threads`` variable is safe. It would probably be better to keep such
logic in the threading backends, rather than trying to do it in code
generation.
.. _chunk-details-label:
Parallel Chunksize Details
~~~~~~~~~~~~~~~~~~~~~~~~~~
There are some cases in which the actual parallel work chunk sizes may differ
from the requested
chunk size that is requested through :func:`numba.set_parallel_chunksize`.
First, if the number of required chunks based on the specified chunk size
is less than the number of configured threads then Numba will use all of the configured
threads to execute the parallel region. In this case, the actual chunk size will be
less than the requested chunk size. Second, due to truncation, in cases where the
iteration count is slightly less than a multiple of the chunk size
(e.g., 14 iterations and a specified chunk size of 5), the actual chunk size will be
larger than the specified chunk size. As in the given example, the number of chunks
would be 2 and the actual chunk size would be 7 (i.e. 14 / 2). Lastly, since Numba
divides an N-dimensional iteration space into N-dimensional (hyper)rectangular chunks,
it may be the case there are not N integer factors whose product is equal to the chunk
size. In this case, some chunks will have an area/volume larger than the chunk size
whereas others will be less than the specified chunk size.
|