File: OMP120.rst

package info (click to toggle)

llvm-toolchain-15 1%3A15.0.6-4

links: PTS, VCS
area: main
in suites: bookworm
size: 1,554,644 kB
sloc: cpp: 5,922,452; ansic: 1,012,136; asm: 674,362; python: 191,568; objc: 73,855; f90: 42,327; lisp: 31,913; pascal: 11,973; javascript: 10,144; sh: 9,421; perl: 7,447; ml: 5,527; awk: 3,523; makefile: 2,520; xml: 885; cs: 573; fortran: 567

file content (94 lines) | stat: -rw-r--r-- 3,408 bytes

parent folder | download | duplicates (21)

.. _omp120:

Transformed generic-mode kernel to SPMD-mode [OMP120]
=====================================================

This optimization remark indicates that the execution strategy for the OpenMP
target offloading kernel was changed. Generic-mode kernels are executed by a
single thread that schedules parallel worker threads using a state machine. This
code transformation can move a kernel that was initially generated in generic
mode to SPMD-mode where all threads are active at the same time with no state
machine. This execution strategy is closer to how the threads are actually
executed on a GPU target. This is only possible if the instructions previously
executed by a single thread have no side-effects or can be guarded. If the
instructions have no side-effects they are simply recomputed by each thread.

Generic-mode is often considerably slower than SPMD-mode because of the extra
overhead required to separately schedule worker threads and pass data between
them.This optimization allows users to use generic-mode semantics while
achieving the performance of SPMD-mode. This can be helpful when defining shared
memory between the threads using :ref:`OMP111 <omp111>`.

Examples
--------

Normally, any kernel that contains split OpenMP target and parallel regions will
be executed in generic-mode. Sometimes it is easier to use generic-mode
semantics to define shared memory, or more tightly control the distribution of
the threads. This shows a naive matrix-matrix multiplication that contains code
that will need to be guarded.

.. code-block:: c++

  void matmul(int M, int N, int K, double *A, double *B, double *C) {
  #pragma omp target teams distribute collapse(2) \
    map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
    for (int i = 0; i < M; i++) {
      for (int j = 0; j < N; j++) {
        double sum = 0.0;

  #pragma omp parallel for reduction(+:sum) default(firstprivate)
        for (int k = 0; k < K; k++)
          sum += A[i*K + k] * B[k*N + j];

        C[i*N + j] = sum;
      }
    }
  }

.. code-block:: console

   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt omp120.cpp
   omp120.cpp:6:14: remark: Replaced globalized variable with 8 bytes of shared memory. [OMP111]
        double sum = 0.0;
               ^
   omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120]
   #pragma omp target teams distribute collapse(2) \
   ^

This requires guarding the store to the shared variable ``sum`` and the store to
the matrix ``C``. This can be thought of as generating the code below.

.. code-block:: c++

  void matmul(int M, int N, int K, double *A, double *B, double *C) {
  #pragma omp target teams distribute collapse(2) \
    map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
    for (int i = 0; i < M; i++) {
      for (int j = 0; j < N; j++) {
      double sum;
  #pragma omp parallel default(firstprivate) shared(sum)
      {
      #pragma omp barrier
      if (omp_get_thread_num() == 0)
        sum = 0.0;
      #pragma omp barrier

  #pragma omp for reduction(+:sum)
        for (int k = 0; k < K; k++)
          sum += A[i*K + k] * B[k*N + j];

      #pragma omp barrier
      if (omp_get_thread_num() == 0)
        C[i*N + j] = sum;
      #pragma omp barrier
      }
      }
    }
  }


Diagnostic Scope
----------------

OpenMP target offloading optimization remark.