File: Runtimes.rst

package info (click to toggle)
llvm-toolchain-13 1%3A13.0.1-11
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 1,418,840 kB
  • sloc: cpp: 5,290,826; ansic: 996,570; asm: 544,593; python: 188,212; objc: 72,027; lisp: 30,291; f90: 25,395; sh: 24,898; javascript: 9,780; pascal: 9,398; perl: 7,484; ml: 5,432; awk: 3,523; makefile: 2,913; xml: 953; cs: 573; fortran: 539
file content (410 lines) | stat: -rw-r--r-- 18,195 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
.. _openmp_runtimes:

LLVM/OpenMP Runtimes
====================

There are four distinct types of LLVM/OpenMP runtimes 

LLVM/OpenMP Host Runtime (``libomp``)
-------------------------------------

An `early (2015) design document <https://openmp.llvm.org/Reference.pdf>`_ for
the LLVM/OpenMP host runtime, aka.  `libomp.so`, is available as a `pdf
<https://openmp.llvm.org/Reference.pdf>`_.


LLVM/OpenMP Target Host Runtime (``libomptarget``)
--------------------------------------------------

.. _libopenmptarget_environment_vars:

Environment Variables
^^^^^^^^^^^^^^^^^^^^^

``libomptarget`` uses environment variables to control different features of the
library at runtime. This allows the user to obtain useful runtime information as
well as enable or disable certain features. A full list of supported environment
variables is defined below.

    * ``LIBOMPTARGET_DEBUG=<Num>``
    * ``LIBOMPTARGET_PROFILE=<Filename>``
    * ``LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD=<Num>``
    * ``LIBOMPTARGET_INFO=<Num>``
    * ``LIBOMPTARGET_HEAP_SIZE=<Num>``
    * ``LIBOMPTARGET_STACK_SIZE=<Num>``

LIBOMPTARGET_DEBUG
""""""""""""""""""

``LIBOMPTARGET_DEBUG`` controls whether or not debugging information will be
displayed. This feature is only availible if ``libomptarget`` was built with
``-DOMPTARGET_DEBUG``. The debugging output provided is intended for use by
``libomptarget`` developers. More user-friendly output is presented when using
``LIBOMPTARGET_INFO``.

LIBOMPTARGET_PROFILE
""""""""""""""""""""
``LIBOMPTARGET_PROFILE`` allows ``libomptarget`` to generate time profile output
similar to Clang's ``-ftime-trace`` option. This generates a JSON file based on
`Chrome Tracing`_ that can be viewed with ``chrome://tracing`` or the
`Speedscope App`_. Building this feature depends on the `LLVM Support Library`_
for time trace output. Using this library is enabled by default when building
using the CMake option ``OPENMP_ENABLE_LIBOMPTARGET_PROFILING``. The output will
be saved to the filename specified by the environment variable. For multi-threaded
applications, profiling in ``libomp`` is also needed. Setting the CMake option
``OPENMP_ENABLE_LIBOMP_PROFILING=ON`` to enable the feature. Note that this will
turn ``libomp`` into a C++ library.

.. _`Chrome Tracing`: https://www.chromium.org/developers/how-tos/trace-event-profiling-tool

.. _`Speedscope App`: https://www.speedscope.app/

.. _`LLVM Support Library`: https://llvm.org/docs/SupportLibrary.html

LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD
"""""""""""""""""""""""""""""""""""""

``LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`` sets the threshold size for which the
``libomptarget`` memory manager will handle the allocation. Any allocations
larger than this threshold will not use the memory manager and be freed after
the device kernel exits. The default threshold value is ``8KB``. If
``LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`` is set to ``0`` the memory manager
will be completely disabled.

LIBOMPTARGET_INFO
"""""""""""""""""

``LIBOMPTARGET_INFO`` allows the user to request different types of runtime
information from ``libomptarget``. ``LIBOMPTARGET_INFO`` uses a 32-bit field to
enable or disable different types of information. This includes information
about data-mappings and kernel execution. It is recommended to build your
application with debugging information enabled, this will enable filenames and
variable declarations in the information messages. OpenMP Debugging information
is enabled at any level of debugging so a full debug runtime is not required.
For minimal debugging information compile with `-gline-tables-only`, or compile
with `-g` for full debug information. A full list of flags supported by
``LIBOMPTARGET_INFO`` is given below. 

    * Print all data arguments upon entering an OpenMP device kernel: ``0x01``
    * Indicate when a mapped address already exists in the device mapping table:
      ``0x02``
    * Dump the contents of the device pointer map at kernel exit: ``0x04``
    * Indicate when an entry is changed in the device mapping table: ``0x08``
    * Print OpenMP kernel information from device plugins: ``0x10``
    * Indicate when data is copied to and from the device: ``0x20``

Any combination of these flags can be used by setting the appropriate bits. For
example, to enable printing all data active in an OpenMP target region along
with ``CUDA`` information, run the following ``bash`` command.

.. code-block:: console

   $ env LIBOMPTARGET_INFO=$((1 << 0x1 | 1 << 0x10)) ./your-application

Or, to enable every flag run with every bit set.

.. code-block:: console

   $ env LIBOMPTARGET_INFO=-1 ./your-application

For example, given a small application implementing the ``ZAXPY`` BLAS routine,
``Libomptarget`` can provide useful information about data mappings and thread
usages.

.. code-block:: c++

    #include <complex>

    using complex = std::complex<double>;

    void zaxpy(complex *X, complex *Y, complex D, std::size_t N) {
    #pragma omp target teams distribute parallel for
      for (std::size_t i = 0; i < N; ++i)
        Y[i] = D * X[i] + Y[i];
    }

    int main() {
      const std::size_t N = 1024;
      complex X[N], Y[N], D;
    #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N])
      zaxpy(X, Y, D, N);
    }

Compiling this code targeting ``nvptx64`` with all information enabled will
provide the following output from the runtime library.

.. code-block:: console

    $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only zaxpy.cpp -o zaxpy
    $ env LIBOMPTARGET_INFO=-1 ./zaxpy

.. code-block:: text

    Info: Entering OpenMP data region at zaxpy.cpp:14:1 with 2 arguments:
    Info: to(X[0:N])[16384]
    Info: tofrom(Y[0:N])[16384]
    Info: Creating new map entry with HstPtrBegin=0x00007fff0d259a40,
          TgtPtrBegin=0x00007fdba5800000, Size=16384, RefCount=1, Name=X[0:N]
    Info: Copying data from host to device, HstPtr=0x00007fff0d259a40,
          TgtPtr=0x00007fdba5800000, Size=16384, Name=X[0:N]
    Info: Creating new map entry with HstPtrBegin=0x00007fff0d255a40,
          TgtPtrBegin=0x00007fdba5804000, Size=16384, RefCount=1, Name=Y[0:N]
    Info: Copying data from host to device, HstPtr=0x00007fff0d255a40,
          TgtPtr=0x00007fdba5804000, Size=16384, Name=Y[0:N]
    Info: OpenMP Host-Device pointer mappings after block at zaxpy.cpp:14:1:
    Info: Host Ptr           Target Ptr         Size (B) RefCount Declaration
    Info: 0x00007fff0d255a40 0x00007fdba5804000 16384    1        Y[0:N] at zaxpy.cpp:13:17
    Info: 0x00007fff0d259a40 0x00007fdba5800000 16384    1        X[0:N] at zaxpy.cpp:13:11
    Info: Entering OpenMP kernel at zaxpy.cpp:6:1 with 4 arguments:
    Info: firstprivate(N)[8] (implicit)
    Info: use_address(Y)[0] (implicit)
    Info: tofrom(D)[16] (implicit)
    Info: use_address(X)[0] (implicit)
    Info: Mapping exists (implicit) with HstPtrBegin=0x00007fff0d255a40,
          TgtPtrBegin=0x00007fdba5804000, Size=0, RefCount=2 (incremented), Name=Y
    Info: Creating new map entry with HstPtrBegin=0x00007fff0d2559f0,
          TgtPtrBegin=0x00007fdba5808000, Size=16, RefCount=1, Name=D
    Info: Copying data from host to device, HstPtr=0x00007fff0d2559f0,
          TgtPtr=0x00007fdba5808000, Size=16, Name=D
    Info: Mapping exists (implicit) with HstPtrBegin=0x00007fff0d259a40,
          TgtPtrBegin=0x00007fdba5800000, Size=0, RefCount=2 (incremented), Name=X
    Info: Mapping exists with HstPtrBegin=0x00007fff0d255a40,
          TgtPtrBegin=0x00007fdba5804000, Size=0, RefCount=2 (update suppressed)
    Info: Mapping exists with HstPtrBegin=0x00007fff0d2559f0,
          TgtPtrBegin=0x00007fdba5808000, Size=16, RefCount=1 (update suppressed)
    Info: Mapping exists with HstPtrBegin=0x00007fff0d259a40,
          TgtPtrBegin=0x00007fdba5800000, Size=0, RefCount=2 (update suppressed)
    Info: Launching kernel __omp_offloading_10305_c08c86__Z5zaxpyPSt7complexIdES1_S0_m_l6
          with 8 blocks and 128 threads in SPMD mode
    Info: Mapping exists with HstPtrBegin=0x00007fff0d259a40,
          TgtPtrBegin=0x00007fdba5800000, Size=0, RefCount=1 (decremented)
    Info: Mapping exists with HstPtrBegin=0x00007fff0d2559f0,
          TgtPtrBegin=0x00007fdba5808000, Size=16, RefCount=1 (deferred final decrement)
    Info: Copying data from device to host, TgtPtr=0x00007fdba5808000,
          HstPtr=0x00007fff0d2559f0, Size=16, Name=D
    Info: Mapping exists with HstPtrBegin=0x00007fff0d255a40,
          TgtPtrBegin=0x00007fdba5804000, Size=0, RefCount=1 (decremented)
    Info: Removing map entry with HstPtrBegin=0x00007fff0d2559f0,
          TgtPtrBegin=0x00007fdba5808000, Size=16, Name=D
    Info: OpenMP Host-Device pointer mappings after block at zaxpy.cpp:6:1:
    Info: Host Ptr           Target Ptr         Size (B) RefCount Declaration
    Info: 0x00007fff0d255a40 0x00007fdba5804000 16384    1        Y[0:N] at zaxpy.cpp:13:17
    Info: 0x00007fff0d259a40 0x00007fdba5800000 16384    1        X[0:N] at zaxpy.cpp:13:11
    Info: Exiting OpenMP data region at zaxpy.cpp:14:1 with 2 arguments:
    Info: to(X[0:N])[16384]
    Info: tofrom(Y[0:N])[16384]
    Info: Mapping exists with HstPtrBegin=0x00007fff0d255a40,
          TgtPtrBegin=0x00007fdba5804000, Size=16384, RefCount=1 (deferred final decrement)
    Info: Copying data from device to host, TgtPtr=0x00007fdba5804000,
          HstPtr=0x00007fff0d255a40, Size=16384, Name=Y[0:N]
    Info: Mapping exists with HstPtrBegin=0x00007fff0d259a40,
          TgtPtrBegin=0x00007fdba5800000, Size=16384, RefCount=1 (deferred final decrement)
    Info: Removing map entry with HstPtrBegin=0x00007fff0d255a40,
          TgtPtrBegin=0x00007fdba5804000, Size=16384, Name=Y[0:N]
    Info: Removing map entry with HstPtrBegin=0x00007fff0d259a40,
          TgtPtrBegin=0x00007fdba5800000, Size=16384, Name=X[0:N]

From this information, we can see the OpenMP kernel being launched on the CUDA
device with enough threads and blocks for all ``1024`` iterations of the loop in
simplified :doc:`SPMD Mode <Offloading>`. The information from the OpenMP data
region shows the two arrays ``X`` and ``Y`` being copied from the host to the
device. This creates an entry in the host-device mapping table associating the
host pointers to the newly created device data. The data mappings in the OpenMP
device kernel show the default mappings being used for all the variables used
implicitly on the device. Because ``X`` and ``Y`` are already mapped in the
device's table, no new entries are created. Additionally, the default mapping
shows that ``D`` will be copied back from the device once the OpenMP device
kernel region ends even though it isn't written to. Finally, at the end of the
OpenMP data region the entries for ``X`` and ``Y`` are removed from the table.

The information level can be controlled at runtime using an internal
libomptarget library call ``__tgt_set_info_flag``. This allows for different
levels of information to be enabled or disabled for certain regions of code.
Using this requires declaring the function signature as an external function so
it can be linked with the runtime library.

.. code-block:: c++

    extern "C" void __tgt_set_info_flag(uint32_t);

    extern foo();

    int main() {
      __tgt_set_info_flag(0x10);
    #pragma omp target
      foo();
    }

.. _libopenmptarget_errors:

Errors:
^^^^^^^

``libomptarget`` provides error messages when the program fails inside the
OpenMP target region. Common causes of failure could be an invalid pointer
access, running out of device memory, or trying to offload when the device is
busy. If the application was built with debugging symbols the error messages
will additionally provide the source location of the OpenMP target region.

For example, consider the following code that implements a simple parallel
reduction on the GPU. This code has a bug that causes it to fail in the
offloading region.

.. code-block:: c++

    #include <cstdio>

    double sum(double *A, std::size_t N) {
      double sum = 0.0;
    #pragma omp target teams distribute parallel for reduction(+:sum)
      for (int i = 0; i < N; ++i)
        sum += A[i];
    
      return sum;
    }
    
    int main() {
      const int N = 1024;
      double A[N];
      sum(A, N);
    }

If this code is compiled and run, there will be an error message indicating what is
going wrong.

.. code-block:: console

    $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o sum
    $ ./sum

.. code-block:: text

    CUDA error: an illegal memory access was encountered 
    Libomptarget error: Copying data from device failed.
    Libomptarget error: Call to targetDataEnd failed, abort target.
    Libomptarget error: Failed to process data after launching the kernel.
    Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
    sum.cpp:5:1: Libomptarget error 1: failure of target construct while offloading is mandatory

This shows that there is an illegal memory access occuring inside the OpenMP
target region once execution has moved to the CUDA device, suggesting a
segmentation fault. This then causes a chain reaction of failures in
``libomptarget``. Another message suggests using the ``LIBOMPTARGET_INFO``
environment variable as described in :ref:`libopenmptarget_environment_vars`. If
we do this it will print the sate of the host-target pointer mappings at the
time of failure.

.. code-block:: console

    $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o sum
    $ env LIBOMPTARGET_INFO=4 ./sum

.. code-block:: text

    info: OpenMP Host-Device pointer mappings after block at sum.cpp:5:1:
    info: Host Ptr           Target Ptr         Size (B) RefCount Declaration
    info: 0x00007ffc058280f8 0x00007f4186600000 8        1        sum at sum.cpp:4:10

This tells us that the only data mapped between the host and the device is the
``sum`` variable that will be copied back from the device once the reduction has
ended. There is no entry mapping the host array ``A`` to the device. In this
situation, the compiler cannot determine the size of the array at compile time
so it will simply assume that the pointer is mapped on the device already by
default. The solution is to add an explicit map clause in the target region.

.. code-block:: c++

    double sum(double *A, std::size_t N) {
      double sum = 0.0;
    #pragma omp target teams distribute parallel for reduction(+:sum) map(to:A[0 : N])
      for (int i = 0; i < N; ++i)
        sum += A[i];
    
      return sum;
    }

LIBOMPTARGET_STACK_SIZE
"""""""""""""""""""""""

This environment variable sets the stack size in bytes for the CUDA plugin. This
can be used to increase or decrease the standard amount of memory reserved for
each thread's stack.

LIBOMPTARGET_HEAP_SIZE
"""""""""""""""""""""""

This environment variable sets the amount of memory in bytes that can be
allocated using ``malloc`` and ``free`` for the CUDA plugin. This is necessary
for some applications that allocate too much memory either through the user or
globalization.

.. toctree::
   :hidden:
   :maxdepth: 1

   Offloading

LLVM/OpenMP Target Host Runtime Plugins (``libomptarget.rtl.XXXX``)
-------------------------------------------------------------------

.. _device_runtime:


.. _remote_offloading_plugin:

Remote Offloading Plugin:
^^^^^^^^^^^^^^^^^^^^^^^^^

The remote offloading plugin permits the execution of OpenMP target regions
on devices in remote hosts in addition to the devices connected to the local
host. All target devices on the remote host will be exposed to the
application as if they were local devices, that is, the remote host CPU or
its GPUs can be offloaded to with the appropriate device number. If the
server is running on the same host, each device may be identified twice:
once through the device plugins and once through the device plugins that the
server application has access to.

This plugin consists of ``libomptarget.rtl.rpc.so`` and
``openmp-offloading-server`` which should be running on the (remote) host. The
server application does not have to be running on a remote host, and can
instead be used on the same host in order to debug memory mapping during offloading.
These are implemented via gRPC/protobuf so these libraries are required to
build and use this plugin. The server must also have access to the necessary
target-specific plugins in order to perform the offloading.

Due to the experimental nature of this plugin, the CMake variable
``LIBOMPTARGET_ENABLE_EXPERIMENTAL_REMOTE_PLUGIN`` must be set in order to
build this plugin. For example, the rpc plugin is not designed to be
thread-safe, the server cannot concurrently handle offloading from multiple
applications at once (it is synchronous) and will terminate after a single
execution. Note that ``openmp-offloading-server`` is unable to
remote offload onto a remote host itself and will error out if this is attempted.

Remote offloading is configured via environment variables at runtime of the OpenMP application:
    * ``LIBOMPTARGET_RPC_ADDRESS=<Address>:<Port>``
    * ``LIBOMPTARGET_RPC_ALLOCATOR_MAX=<NumBytes>``
    * ``LIBOMPTARGET_BLOCK_SIZE=<NumBytes>``
    * ``LIBOMPTARGET_RPC_LATENCY=<Seconds>``

LIBOMPTARGET_RPC_ADDRESS
""""""""""""""""""""""""
The address and port at which the server is running. This needs to be set for
the server and the application, the default is ``0.0.0.0:50051``. A single
OpenMP executable can offload onto multiple remote hosts by setting this to
comma-seperated values of the addresses.

LIBOMPTARGET_RPC_ALLOCATOR_MAX
""""""""""""""""""""""""""""""
After allocating this size, the protobuf allocator will clear. This can be set for both endpoints.

LIBOMPTARGET_BLOCK_SIZE
"""""""""""""""""""""""
This is the maximum size of a single message while streaming data transfers between the two endpoints and can be set for both endpoints.

LIBOMPTARGET_RPC_LATENCY
""""""""""""""""""""""""
This is the maximum amount of time the client will wait for a response from the server.

LLVM/OpenMP Target Device Runtime (``libomptarget-ARCH-SUBARCH.bc``)
--------------------------------------------------------------------