1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363
|
.. meta::
:description: This chapter explains the HIP programming model, the contract
between the programmer and the compiler/runtime executing the
code, how it maps to the hardware.
:keywords: AMD, ROCm, HIP, CUDA, API design
*******************************************************************************
HIP programming model
*******************************************************************************
The HIP programming model makes it easy to map data-parallel C/C++ algorithms to
massively parallel, wide single instruction, multiple data (SIMD) architectures,
such as GPUs.
While the model may be expressed in most imperative languages, (for example
Python via PyHIP) this document will focus on the original C/C++ API of HIP.
A basic understanding of the underlying device architecture helps you
make efficient use of HIP and general purpose graphics processing unit (GPGPU)
programming in general.
RDNA & CDNA architecture summary
================================
GPUs in general are made up of basic building blocks called compute units (CUs),
that execute the threads of a kernel. These CUs provide the necessary resources
for the threads: the Arithmetic Logical Units (ALUs), register files, caches and
shared memory for efficient communication between the threads.
This design allows for efficient execution of kernels while also being able to
scale from small GPUs embedded in APUs with few CUs up to GPUs designed for data
centers with hundreds of CUs. Figure :ref:`rdna3_cu` and :ref:`cdna3_cu` show
examples of such compute units.
For architecture details, check :ref:`hardware_implementation`.
.. _rdna3_cu:
.. figure:: ../data/understand/programming_model/rdna3_cu.png
:alt: Block diagram showing the structure of an RDNA3 Compute Unit. It
consists of four SIMD units, each including a vector and scalar register
file, with the corresponding scalar and vector ALUs. All four SIMDs
share a scalar and instruction cache, as well as the shared memory. Two
of the SIMD units each share an L0 cache.
Block Diagram of an RDNA3 Compute Unit.
.. _cdna3_cu:
.. figure:: ../data/understand/programming_model/cdna3_cu.png
:alt: Block diagram showing the structure of a CDNA3 compute unit. It includes
Shader Cores, the Matrix Core Unit, a Local Data Share used for sharing
memory between threads in a block, an L1 Cache and a Scheduler. The
Shader Cores represent the vector ALUs and the Matrix Core Unit the
matrix ALUs. The Local Data Share is used as the shared memory.
Block Diagram of a CDNA3 Compute Unit.
Heterogeneous Programming
=========================
The HIP programming model assumes two execution contexts. One is referred to as
*host* while compute kernels execute on a *device*. These contexts have
different capabilities, therefor slightly different rules apply. The *host*
execution is defined by the C++ abstract machine, while *device* execution
follows the :ref:`SIMT model<programming_model_simt>` of HIP. These execution contexts in
code are signified by the ``__host__`` and ``__device__`` decorators. There are
a few key differences between the two:
* The C++ abstract machine assumes a unified memory address space, meaning that
one can always access any given address in memory (assuming the absence of
data races). HIP however introduces several memory namespaces, an address
from one means nothing in another. Moreover, not all address spaces are
accessible from all contexts.
Looking at :ref:`rdna3_cu` and :ref:`cdna3_cu`, you can see that
every CU has an instance of storage backing the namespace ``__shared__``.
Even if the host were to have access to these regions of
memory, the performance benefits of the segmented memory subsystem are
supported by the inability of asynchronous access from the host.
* Not all C++ language features map cleanly to typical device architectures,
some are very expensive (meaning slow) to implement on GPU devices, therefor
they are forbidden in device contexts to avoid users tapping into features
that unexpectedly decimate their program's performance. Offload devices targeted
by HIP aren't general purpose devices, at least not in the sense that a CPU is.
HIP focuses on data parallel computations and as such caters to throughput
optimized architectures, such as GPUs or accelerators derived from GPU
architectures.
* Asynchrony is at the forefront of the HIP API. Computations launched on the device
execute asynchronously with respect to the host, and it is the user's responsibility to
synchronize their data dispatch/fetch with computations on the device.
.. note::
HIP does perform implicit synchronization on occasions, more advanced than other
APIs such as OpenCL or SYCL, in which the responsibility of synchronization mostly
depends on the user.
.. _programming_model_simt:
Single instruction multiple threads (SIMT)
==========================================
The SIMT programming model behind the HIP device-side execution is a middle-ground
between SMT (Simultaneous Multi-Threading) programming known from multicore CPUs,
and SIMD (Single Instruction, Multiple Data) programming mostly known from exploiting
relevant instruction sets on CPUs (for example SSE/AVX/Neon).
A HIP device compiler maps SIMT code written in HIP C++ to an inherently SIMD
architecture (like GPUs). This is done by scalarizing the entire kernel and issuing the scalar
instructions of multiple kernel instances (called threads) to each of the SIMD engine lanes, rather
than exploiting data parallelism within a single instance of a kernel and spreading
identical instructions over the available SIMD engines.
Consider the following kernel:
.. code-block:: cpp
__global__ void k(float4* a, const float4* b)
{
int tid = threadIdx.x;
int bid = blockIdx.x;
int dim = blockDim.x;
a[tid] += (tid + bid - dim) * b[tid];
}
The incoming four-vector of floating-point values ``b`` is multiplied by a
scalar and then added element-wise to the four-vector floating-point values of
``a``. On modern SIMD-capable architectures, the four-vector ops are expected to
compile to a single SIMD instruction. However, GPU execution of this kernel will
typically break down the vector elements into 4 separate threads for parallel execution,
as seen in the following figure:
.. _simt:
.. figure:: ../data/understand/programming_model/simt.svg
:alt: Image representing the instruction flow of a SIMT program. Two identical
arrows pointing downward with blocks representing the instructions
inside and ellipsis between the arrows. The instructions represented in
the arrows are, from top to bottom: ADD, DIV, FMA, FMA, FMA and FMA.
Instruction flow of the sample SIMT program.
In HIP, lanes of the SIMD architecture are fed by mapping threads of a SIMT
execution, one thread down each lane of an SIMD engine. Execution parallelism
usually isn't exploited from the width of the built-in vector types, but across multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, etc.
.. _inherent_thread_model:
Inherent thread model
=====================
The SIMT nature of HIP is captured by the ability to execute user-provided
device programs, expressed as single-source C/C++ functions or sources compiled
online/offline to binaries, in bulk.
All threads of a kernel are uniquely identified by a set of integral values, called thread IDs.
The set of integers identifying a thread relate to the hierarchy in which the threads execute.
The thread hierarchy inherent to how AMD GPUs operate is depicted in the
following figure.
.. _inherent_thread_hierarchy:
.. figure:: ../data/understand/programming_model/thread_hierarchy.svg
:alt: Diagram depicting nested rectangles of varying color. The outermost one
titled "Grid", inside sets of uniform rectangles layered on one another
titled "Block". Each "Block" containing sets of uniform rectangles
layered on one another titled "Warp". Each of the "Warp" titled
rectangles filled with downward pointing arrows inside.
Hierarchy of thread groups.
Warp (or Wavefront)
The innermost grouping of threads is called a warp, or a wavefront in ISA terms. A warp
is the most tightly coupled groups of threads, both physically and logically. Threads
inside a warp are also called lanes, and the integral value identifying them is the lane ID.
.. tip::
Lane IDs aren't queried like other thread IDs, but are user-calculated. As a
consequence, they are only as multidimensional as the user interprets the
calculated values to be.
The size of a warp is architecture dependent and always fixed. For AMD GPUs
the wavefront is typically 64 threads, though sometimes 32 threads. Warps are
signified by the set of communication primitives at their disposal, as
discussed in :ref:`warp-cross-lane`.
.. _inherent_thread_hierarchy_block:
Block
The middle grouping is called a block or thread block. The defining feature
of a block is that all threads in a block will share an instance of memory
which they may use to share data or synchronize with one another.
The size of a block is user-configurable but is limited by the queryable
capabilities of the executing hardware. The unique ID of the thread within a
block is 3-dimensional as provided by the API. When linearizing thread IDs
within a block, assume the "fast index" being dimension ``x``, followed by
the ``y`` and ``z`` dimensions.
.. _inherent_thread_hierarchy_grid:
Grid
The outermost grouping is called a grid. A grid manifests as a single
dispatch of kernels for execution. The unique ID of each block within a grid
is 3-dimensional, as provided by the API and is queryable by every thread
within the block.
Cooperative groups thread model
-------------------------------
The Cooperative groups API introduces new APIs to launch, group, subdivide,
synchronize and identify threads, as well as some predefined group-collective
algorithms, but most importantly a matching threading model to think in terms of.
It relaxes some restrictions of the :ref:`inherent_thread_model` imposed by the
strict 1:1 mapping of architectural details to the programming model. Cooperative
groups let you define your own set of thread groups which may fit your user-cases
better than the defaults defined by the hardware.
.. note::
The implicit groups defined by kernel launch parameters are still available
when working with cooperative groups.
For further information, see :doc:`Cooperative groups </how-to/cooperative_groups>`.
Memory model
============
The hierarchy of threads introduced by the :ref:`inherent_thread_model` is induced
by the memory subsystem of GPUs. The following figure summarizes the memory
namespaces and how they relate to the various levels of the threading model.
.. _memory_hierarchy:
.. figure:: ../data/understand/programming_model/memory_hierarchy.svg
:alt: Diagram depicting nested rectangles of varying color. The outermost one
titled "Grid", inside it are two identical rectangles titled "Block",
inside them are ones titled "Local" with multiple "Warp" titled rectangles.
Blocks have not just Local inside, but also rectangles titled "Shared".
Inside the Grid is a rectangle titled "Global" with three others inside:
"Surface", "Texture" (same color) and "Constant" (different color).
Memory hierarchy.
Local or per-thread memory
Read-write storage only visible to the threads defining the given variables,
also called per-thread memory. The size of a block for a given kernel, and thereby
the number of concurrent warps, are limited by local memory usage.
This relates to an important aspect: occupancy. This is the default memory
namespace.
Shared memory
Read-write storage visible to all the threads in a given block.
Global
Read-write storage visible to all threads in a given grid. There are
specialized versions of global memory with different usage semantics which
are typically backed by the same hardware storing global.
Constant
Read-only storage visible to all threads in a given grid. It is a limited
segment of global with queryable size.
Texture
Read-only storage visible to all threads in a given grid and accessible
through additional APIs.
Surface
A read-write version of texture memory.
Execution model
===============
HIP programs consist of two distinct scopes:
* The host-side API running on the host processor. There are two APIs available:
* The HIP runtime API which enables use of the single-source programming
model.
* The HIP driver API which sits at a lower level and most importantly differs
by removing some facilities provided by the runtime API, most
importantly around kernel launching and argument setting. It is geared
towards implementing abstractions atop, such as the runtime API itself.
Offers two additional pieces of functionality not provided by the Runtime
API: ``hipModule`` and ``hipCtx`` APIs. For further details, check
:doc:`HIP driver API </how-to/hip_porting_driver_api>`.
* The device-side kernels running on GPUs. Both the host and the device-side
APIs have synchronous and asynchronous functions in them.
.. note::
The HIP does not present two *separate* APIs link NVIDIA CUDA. HIP only extends
the HIP runtime API with new APIs for ``hipModule`` and ``hipCtx``.
Host-side execution
-------------------
The part of the host-side API which deals with device management and their
queries are synchronous. All asynchronous APIs, such as kernel execution, data
movement and potentially data allocation/freeing all happen in the context of
device streams.
Streams are FIFO buffers of commands to execute relating to a given device.
Commands which enqueue tasks on a stream all return promptly and the command is
executed asynchronously. All side effects of a command on a stream are visible
to all subsequent commands on the same stream. Multiple streams may point to
the same device and those streams may be fed from multiple concurrent host-side
threads. Execution on multiple streams may be concurrent but isn't required to
be.
Asynchronous APIs involving a stream all return a stream event which may be
used to synchronize the execution of multiple streams. A user may enqueue a
barrier onto a stream referencing an event. The barrier will block until
the command related to the event does not complete, at which point all
side effects of the command shall be visible to commands following the barrier,
even if those side effects manifest on different devices.
Streams also support executing user-defined functions as callbacks on the host.
The stream will not launch subsequent commands until the callback completes.
Device-side execution
---------------------
The SIMT programming model behind the HIP device-side execution is a
middle-ground between SMT (Simultaneous Multi-Threading) programming known from
multicore CPUs, and SIMD (Single Instruction, Multiple Data) programming
mostly known from exploiting relevant instruction sets on CPUs (for example
SSE/AVX/Neon).
Kernel launch
-------------
Kernels may be launched in multiple ways all with different syntaxes and
intended use-cases.
* Using the triple-chevron ``<<<...>>>`` operator on a ``__global__`` annotated
function.
* Using ``hipLaunchKernelGGL()`` on a ``__global__`` annotated function.
.. tip::
This name by default is a macro expanding to triple-chevron. In cases where
language syntax extensions are undesirable, or where launching templated
and/or overloaded kernel functions define the
``HIP_TEMPLATE_KERNEL_LAUNCH`` preprocessor macro before including the HIP
headers to turn it into a templated function.
* Using the launch APIs supporting the triple-chevron syntax directly.
.. caution::
These APIs are intended to be used/generated by tools such as the HIP
compiler itself and not intended towards end-user code. Should you be
writing a tool having to launch device code using HIP, consider using these
over the alternatives.
|