File: memcpy_async_tx.md

package info (click to toggle)
cccl 2.3.2-2
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 89,900 kB
  • sloc: cpp: 697,664; ansic: 26,964; python: 11,928; sh: 3,284; asm: 2,154; perl: 460; makefile: 112; xml: 13
file content (89 lines) | stat: -rw-r--r-- 3,028 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
grand_parent: Extended API
parent: Asynchronous Operations
---

# `cuda::device::memcpy_async_tx`

Defined in header `<cuda/barrier>`:

```cuda
template <typename T, size_t Alignment>
inline __device__
void cuda::device::memcpy_async_tx(
  T* dest,
  const T* src,
  cuda::aligned_size_t<Alignment> size,
  cuda::barrier<cuda::thread_scope_block>& bar);
```

Copies `size` bytes from global memory `src` to shared memory `dest` and decrements the transaction count of `bar` by `size` bytes.

## Preconditions

* `src`, `dest` are 16-byte aligned and `size` is a multiple of 16, i.e.,
  `Alignment >= 16`.
* `dest` points to a shared memory allocation that is at least `size` bytes wide.
* `src` points to a global memory allocation that is at least `size` bytes wide.
* `bar` is located in shared memory
* If either `destination` or `source` is an invalid or null pointer, the
    behavior is undefined (even if `count` is zero).

## Requires

* `is_trivially_copyable_v<T>` is true.
 
## Notes

This function can only be used under CUDA Compute Capability 9.0 (Hopper) or
higher.

There is no feature flag to check if `cuda::device::memcpy_async_tx` is
available.

**Comparison to `cuda::memcpy_async`**: `memcpy_async_tx` supports a subset of
the operations of `memcpy_async`. It gives more control over the synchronization
with a barrier than `memcpy_async`. Currently, `memcpy_async_tx` has no synchronous
fallback mechanism., i.e., it currently does not work on older hardware
(pre-CUDA Compute Capability 9.0, i.e., Hopper).

## Example

```cuda
#include <cuda/barrier>
#include <cuda/std/utility> // cuda::std::move

#if defined(__CUDA_MINIMUM_ARCH__) && __CUDA_MINIMUM_ARCH__ < 900
static_assert(false, "Insufficient CUDA Compute Capability: cuda::device::memcpy_async_tx is not available.");
#endif // __CUDA_MINIMUM_ARCH__

__device__ alignas(16) int gmem_x[2048];

__global__ void example_kernel() {
  __shared__ alignas(16) int smem_x[1024];
  __shared__ cuda::barrier<cuda::thread_scope_block> bar;
  if (threadIdx.x == 0) {
    init(&bar, blockDim.x);
  }
  __syncthreads();

  barrier::arrival_token token;
  if (threadIdx.x == 0) {
    cuda::device::memcpy_async_tx(smem_x, gmem_x, cuda::aligned_size_t<16>(sizeof(smem_x)), bar);
    token = cuda::device::barrier_arrive_tx(bar, 1, sizeof(smem_x));
  } else {
    token = bar.arrive(1);
  } 
  bar.wait(cuda::std::move(token));

  // smem_x contains the contents of gmem_x[0], ..., gmem_x[1023]
  smem_x[threadIdx.x] += 1;
}
```

[See it on Godbolt](https://godbolt.org/z/oK7Tazszx){: .btn }

[`cuda::thread_scope`]: ./memory_model.md
[Tracking asynchronous operations by the mbarrier object]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tracking-asynchronous-operations-by-the-mbarrier-object
[`cp.async.bulk` PTX instruction]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk
[thread.barrier.class paragraph 12]: https://eel.is/c++draft/thread.barrier.class#12