1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349
|
**********************************
How does the CPU dispatcher work?
**********************************
NumPy dispatcher is based on multi-source compiling, which means taking
a certain source and compiling it multiple times with different compiler
flags and also with different **C** definitions that affect the code
paths. This enables certain instruction-sets for each compiled object
depending on the required optimizations and ends with linking the
returned objects together.
.. figure:: ../figures/opt-infra.png
This mechanism should support all compilers and it doesn't require any
compiler-specific extension, but at the same time it adds a few steps to
normal compilation that are explained as follows.
1- Configuration
~~~~~~~~~~~~~~~~
Configuring the required optimization by the user before starting to build the
source files via the two command arguments as explained above:
- ``--cpu-baseline``: minimal set of required optimizations.
- ``--cpu-dispatch``: dispatched set of additional optimizations.
2- Discovering the environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this part, we check the compiler and platform architecture
and cache some of the intermediary results to speed up rebuilding.
3- Validating the requested optimizations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By testing them against the compiler, and seeing what the compiler can
support according to the requested optimizations.
4- Generating the main configuration header
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The generated header ``_cpu_dispatch.h`` contains all the definitions and
headers of instruction-sets for the required optimizations that have been
validated during the previous step.
It also contains extra C definitions that are used for defining NumPy's
Python-level module attributes ``__cpu_baseline__`` and ``__cpu_dispatch__``.
**What is in this header?**
The example header was dynamically generated by gcc on an X86 machine.
The compiler supports ``--cpu-baseline="sse sse2 sse3"`` and
``--cpu-dispatch="ssse3 sse41"``, and the result is below.
.. code:: c
// The header should be located at numpy/numpy/core/src/common/_cpu_dispatch.h
/**NOTE
** C definitions prefixed with "NPY_HAVE_" represent
** the required optimizations.
**
** C definitions prefixed with 'NPY__CPU_TARGET_' are protected and
** shouldn't be used by any NumPy C sources.
*/
/******* baseline features *******/
/** SSE **/
#define NPY_HAVE_SSE 1
#include <xmmintrin.h>
/** SSE2 **/
#define NPY_HAVE_SSE2 1
#include <emmintrin.h>
/** SSE3 **/
#define NPY_HAVE_SSE3 1
#include <pmmintrin.h>
/******* dispatch-able features *******/
#ifdef NPY__CPU_TARGET_SSSE3
/** SSSE3 **/
#define NPY_HAVE_SSSE3 1
#include <tmmintrin.h>
#endif
#ifdef NPY__CPU_TARGET_SSE41
/** SSE41 **/
#define NPY_HAVE_SSE41 1
#include <smmintrin.h>
#endif
**Baseline features** are the minimal set of required optimizations configured
via ``--cpu-baseline``. They have no preprocessor guards and they're
always on, which means they can be used in any source.
Does this mean NumPy's infrastructure passes the compiler's flags of
baseline features to all sources?
Definitely, yes. But the :ref:`dispatch-able sources <dispatchable-sources>` are
treated differently.
What if the user specifies certain **baseline features** during the
build but at runtime the machine doesn't support even these
features? Will the compiled code be called via one of these definitions, or
maybe the compiler itself auto-generated/vectorized certain piece of code
based on the provided command line compiler flags?
During the loading of the NumPy module, there's a validation step
which detects this behavior. It will raise a Python runtime error to inform the
user. This is to prevent the CPU reaching an illegal instruction error causing
a segfault.
**Dispatch-able features** are our dispatched set of additional optimizations
that were configured via ``--cpu-dispatch``. They are not activated by
default and are always guarded by other C definitions prefixed with
``NPY__CPU_TARGET_``. C definitions ``NPY__CPU_TARGET_`` are only
enabled within **dispatch-able sources**.
.. _dispatchable-sources:
5- Dispatch-able sources and configuration statements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dispatch-able sources are special **C** files that can be compiled multiple
times with different compiler flags and also with different **C**
definitions. These affect code paths to enable certain
instruction-sets for each compiled object according to "**the
configuration statements**" that must be declared between a **C**
comment\ ``(/**/)`` and start with a special mark **@targets** at the
top of each dispatch-able source. At the same time, dispatch-able
sources will be treated as normal **C** sources if the optimization was
disabled by the command argument ``--disable-optimization`` .
**What are configuration statements?**
Configuration statements are sort of keywords combined together to
determine the required optimization for the dispatch-able source.
Example:
.. code:: c
/*@targets avx2 avx512f vsx2 vsx3 asimd asimdhp */
// C code
The keywords mainly represent the additional optimizations configured
through ``--cpu-dispatch``, but it can also represent other options such as:
- Target groups: pre-configured configuration statements used for
managing the required optimizations from outside the dispatch-able source.
- Policies: collections of options used for changing the default
behaviors or forcing the compilers to perform certain things.
- "baseline": a unique keyword represents the minimal optimizations
that configured through ``--cpu-baseline``
**Numpy's infrastructure handles dispatch-able sources in four steps**:
- **(A) Recognition**: Just like source templates and F2PY, the
dispatch-able sources requires a special extension ``*.dispatch.c``
to mark C dispatch-able source files, and for C++
``*.dispatch.cpp`` or ``*.dispatch.cxx``
**NOTE**: C++ not supported yet.
- **(B) Parsing and validating**: In this step, the
dispatch-able sources that had been filtered by the previous step
are parsed and validated by the configuration statements for each one
of them one by one in order to determine the required optimizations.
- **(C) Wrapping**: This is the approach taken by NumPy's
infrastructure, which has proved to be sufficiently flexible in order
to compile a single source multiple times with different **C**
definitions and flags that affect the code paths. The process is
achieved by creating a temporary **C** source for each required
optimization that related to the additional optimization, which
contains the declarations of the **C** definitions and includes the
involved source via the **C** directive **#include**. For more
clarification take a look at the following code for AVX512F :
.. code:: c
/*
* this definition is used by NumPy utilities as suffixes for the
* exported symbols
*/
#define NPY__CPU_TARGET_CURRENT AVX512F
/*
* The following definitions enable
* definitions of the dispatch-able features that are defined within the main
* configuration header. These are definitions for the implied features.
*/
#define NPY__CPU_TARGET_SSE
#define NPY__CPU_TARGET_SSE2
#define NPY__CPU_TARGET_SSE3
#define NPY__CPU_TARGET_SSSE3
#define NPY__CPU_TARGET_SSE41
#define NPY__CPU_TARGET_POPCNT
#define NPY__CPU_TARGET_SSE42
#define NPY__CPU_TARGET_AVX
#define NPY__CPU_TARGET_F16C
#define NPY__CPU_TARGET_FMA3
#define NPY__CPU_TARGET_AVX2
#define NPY__CPU_TARGET_AVX512F
// our dispatch-able source
#include "/the/absuolate/path/of/hello.dispatch.c"
- **(D) Dispatch-able configuration header**: The infrastructure
generates a config header for each dispatch-able source, this header
mainly contains two abstract **C** macros used for identifying the
generated objects, so they can be used for runtime dispatching
certain symbols from the generated objects by any **C** source. It is
also used for forward declarations.
The generated header takes the name of the dispatch-able source after
excluding the extension and replace it with ``.h``, for example
assume we have a dispatch-able source called ``hello.dispatch.c`` and
contains the following:
.. code:: c
// hello.dispatch.c
/*@targets baseline sse42 avx512f */
#include <stdio.h>
#include "numpy/utils.h" // NPY_CAT, NPY_TOSTR
#ifndef NPY__CPU_TARGET_CURRENT
// wrapping the dispatch-able source only happens to the additional optimizations
// but if the keyword 'baseline' provided within the configuration statements,
// the infrastructure will add extra compiling for the dispatch-able source by
// passing it as-is to the compiler without any changes.
#define CURRENT_TARGET(X) X
#define NPY__CPU_TARGET_CURRENT baseline // for printing only
#else
// since we reach to this point, that's mean we're dealing with
// the additional optimizations, so it could be SSE42 or AVX512F
#define CURRENT_TARGET(X) NPY_CAT(NPY_CAT(X, _), NPY__CPU_TARGET_CURRENT)
#endif
// Macro 'CURRENT_TARGET' adding the current target as suffux to the exported symbols,
// to avoid linking duplications, NumPy already has a macro called
// 'NPY_CPU_DISPATCH_CURFX' similar to it, located at
// numpy/numpy/core/src/common/npy_cpu_dispatch.h
// NOTE: we tend to not adding suffixes to the baseline exported symbols
void CURRENT_TARGET(simd_whoami)(const char *extra_info)
{
printf("I'm " NPY_TOSTR(NPY__CPU_TARGET_CURRENT) ", %s\n", extra_info);
}
Now assume you attached **hello.dispatch.c** to the source tree, then
the infrastructure should generate a temporary config header called
**hello.dispatch.h** that can be reached by any source in the source
tree, and it should contain the following code :
.. code:: c
#ifndef NPY__CPU_DISPATCH_EXPAND_
// To expand the macro calls in this header
#define NPY__CPU_DISPATCH_EXPAND_(X) X
#endif
// Undefining the following macros, due to the possibility of including config headers
// multiple times within the same source and since each config header represents
// different required optimizations according to the specified configuration
// statements in the dispatch-able source that derived from it.
#undef NPY__CPU_DISPATCH_BASELINE_CALL
#undef NPY__CPU_DISPATCH_CALL
// nothing strange here, just a normal preprocessor callback
// enabled only if 'baseline' specified within the configuration statements
#define NPY__CPU_DISPATCH_BASELINE_CALL(CB, ...) \
NPY__CPU_DISPATCH_EXPAND_(CB(__VA_ARGS__))
// 'NPY__CPU_DISPATCH_CALL' is an abstract macro is used for dispatching
// the required optimizations that specified within the configuration statements.
//
// @param CHK, Expected a macro that can be used to detect CPU features
// in runtime, which takes a CPU feature name without string quotes and
// returns the testing result in a shape of boolean value.
// NumPy already has macro called "NPY_CPU_HAVE", which fits this requirement.
//
// @param CB, a callback macro that expected to be called multiple times depending
// on the required optimizations, the callback should receive the following arguments:
// 1- The pending calls of @param CHK filled up with the required CPU features,
// that need to be tested first in runtime before executing call belong to
// the compiled object.
// 2- The required optimization name, same as in 'NPY__CPU_TARGET_CURRENT'
// 3- Extra arguments in the macro itself
//
// By default the callback calls are sorted depending on the highest interest
// unless the policy "$keep_sort" was in place within the configuration statements
// see "Dive into the CPU dispatcher" for more clarification.
#define NPY__CPU_DISPATCH_CALL(CHK, CB, ...) \
NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX512F)), AVX512F, __VA_ARGS__)) \
NPY__CPU_DISPATCH_EXPAND_(CB((CHK(SSE)&&CHK(SSE2)&&CHK(SSE3)&&CHK(SSSE3)&&CHK(SSE41)), SSE41, __VA_ARGS__))
An example of using the config header in light of the above:
.. code:: c
// NOTE: The following macros are only defined for demonstration purposes only.
// NumPy already has a collections of macros located at
// numpy/numpy/core/src/common/npy_cpu_dispatch.h, that covers all dispatching
// and declarations scenarios.
#include "numpy/npy_cpu_features.h" // NPY_CPU_HAVE
#include "numpy/utils.h" // NPY_CAT, NPY_EXPAND
// An example for setting a macro that calls all the exported symbols at once
// after checking if they're supported by the running machine.
#define DISPATCH_CALL_ALL(FN, ARGS) \
NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_ALL_CB, FN, ARGS) \
NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_ALL_CB, FN, ARGS)
// The preprocessor callbacks.
// The same suffixes as we define it in the dispatch-able source.
#define DISPATCH_CALL_ALL_CB(CHECK, TARGET_NAME, FN, ARGS) \
if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; }
#define DISPATCH_CALL_BASELINE_ALL_CB(FN, ARGS) \
FN NPY_EXPAND(ARGS);
// An example for setting a macro that calls the exported symbols of highest
// interest optimization, after checking if they're supported by the running machine.
#define DISPATCH_CALL_HIGH(FN, ARGS) \
if (0) {} \
NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_HIGH_CB, FN, ARGS) \
NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_HIGH_CB, FN, ARGS)
// The preprocessor callbacks
// The same suffixes as we define it in the dispatch-able source.
#define DISPATCH_CALL_HIGH_CB(CHECK, TARGET_NAME, FN, ARGS) \
else if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; }
#define DISPATCH_CALL_BASELINE_HIGH_CB(FN, ARGS) \
else { FN NPY_EXPAND(ARGS); }
// NumPy has a macro called 'NPY_CPU_DISPATCH_DECLARE' can be used
// for forward declarations any kind of prototypes based on
// 'NPY__CPU_DISPATCH_CALL' and 'NPY__CPU_DISPATCH_BASELINE_CALL'.
// However in this example, we just handle it manually.
void simd_whoami(const char *extra_info);
void simd_whoami_AVX512F(const char *extra_info);
void simd_whoami_SSE41(const char *extra_info);
void trigger_me(void)
{
// bring the auto-generated config header
// which contains config macros 'NPY__CPU_DISPATCH_CALL' and
// 'NPY__CPU_DISPATCH_BASELINE_CALL'.
// it is highly recommended to include the config header before executing
// the dispatching macros in case if there's another header in the scope.
#include "hello.dispatch.h"
DISPATCH_CALL_ALL(simd_whoami, ("all"))
DISPATCH_CALL_HIGH(simd_whoami, ("the highest interest"))
// An example of including multiple config headers in the same source
// #include "hello2.dispatch.h"
// DISPATCH_CALL_HIGH(another_function, ("the highest interest"))
}
|