File: memcpy_async.md

package info (click to toggle)
libcudacxx 1.8.1-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 66,464 kB
  • sloc: cpp: 517,767; ansic: 9,474; python: 6,108; sh: 2,225; asm: 2,154; makefile: 7
file content (149 lines) | stat: -rw-r--r-- 5,591 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
grand_parent: Extended API
parent: Asynchronous Operations
---

# `cuda::memcpy_async`

Defined in header `<cuda/barrier>`:

```cuda
// (1)
template <typename Shape, cuda::thread_scope Scope, typename CompletionFunction>
__host__ __device__
void cuda::memcpy_async(void* destination, void const* source, Shape size,
                        cuda::barrier<Scope, CompletionFunction>& barrier);

// (2)
template <typename Group,
          typename Shape, cuda::thread_scope Scope, typename CompletionFunction>
__host__ __device__
void cuda::memcpy_async(Group const& group,
                        void* destination, void const* source, Shape size,
                        cuda::barrier<Scope, CompletionFunction>& barrier);
```

Defined in header `<cuda/pipeline>`:

```cuda
// (3)
template <typename Shape, cuda::thread_scope Scope>
__host__ __device__
void cuda::memcpy_async(void* destination, void const* source, Shape size,
                        cuda::pipeline<Scope>& pipeline);

// (4)
template <typename Group, typename Shape, cuda::thread_scope Scope>
__host__ __device__
void cuda::memcpy_async(Group const& group,
                        void* destination, void const* source, Shape size,
                        cuda::pipeline<Scope>& pipeline);
```

Defined in header `<cuda/annotated_ptr>`:

```cuda
// (5)
template <typename Dst, typename Src, typename SrcProperty, typename Shape, typename Sync>
__host__ __device__
void memcpy_async(Dst* dst, cuda::annotated_ptr<Src, SrcProperty> src, Shape size, Sync& sync);

// (6)
template<typename Dst, typename DstProperty, typename Src, typename SrcProperty, typename Shape, typename Sync>
__host__ __device__
void memcpy_async(cuda::annotated_ptr<Dst, DstProperty> dst, cuda::annotated_ptr<Src, SrcProperty> src, Shape size, Sync& sync);

// (7)
template<typename Group, typename Dst, typename Src, typename SrcProperty, typename Shape, typename Sync>
__host__ __device__
void memcpy_async(Group const& group, Dst* dst, cuda::annotated_ptr<Src, SrcProperty> src, Shape size, Sync& sync);

// (8)
template<typename Group, typename Dst, typename DstProperty, typename Src, typename SrcProperty, typename Shape, typename Sync>
__host__ __device__
void memcpy_async(Group const& group, cuda::annotated_ptr<Dst, DstProperty> dst, cuda::annotated_ptr<Src, SrcProperty> src, Shape size, Sync& sync);
```

`cuda::memcpy_async` asynchronously copies `size` bytes from the memory
  location pointed to by `source` to the memory location pointed to by
  `destination`.
Both objects are reinterpreted as arrays of `unsigned char`.

1. Binds the asynchronous copy completion to `cuda::barrier` and issues the copy
   in the current thread.
2. Binds the asynchronous copy completion to `cuda::barrier` and cooperatively
   issues the copy across all threads in `group`.
3. Binds the asynchronous copy completion to `cuda::pipeline` and issues the copy
   in the current thread.
4. Binds the asynchronous copy completion to `cuda::pipeline` and cooperatively
   issues the copy across all threads in `group`.
5. 5-8: convenience wrappers using `cuda::annotated_ptr` where `Sync` is 
   either `cuda::barrier` or `cuda::pipeline`.

## Notes

`cuda::memcpy_async` have similar constraints to [`std::memcpy`], namely:
* If the objects overlap, the behavior is undefined.
* If either `destination` or `source` is an invalid or null pointer, the
    behavior is undefined (even if `count` is zero).
* If the objects are [potentially-overlapping] the behavior is undefined.
* If the objects are not of [_TriviallyCopyable_] type the program is
    ill-formed, no diagnostic required.

If _Shape_ is [`cuda::aligned_size_t`], `source` and `destination` are both
  required to be aligned on `cuda::aligned_size_t::align`, else the behavior is
  undefined.

If `cuda::pipeline` is in a _quitted state_ (see [`cuda::pipeline::quit`]), the
  behavior is undefined.

For cooperative variants, if the parameters are not the same across all threads
  in `group`, the behavior is undefined.

## Template Parameters

| `Group` | A type satisfying the [_Group_] concept.                  |
| `Shape` | Either [`cuda::std::size_t`] or [`cuda::aligned_size_t`]. |

## Parameters

| `group`       | The group of threads.                                    |
| `destination` | Pointer to the memory location to copy to.               |
| `source`      | Pointer to the memory location to copy from.             |
| `size`        | The number of bytes to copy.                             |
| `barrier`     | The barrier object used to wait on the copy completion.  |
| `pipeline`    | The pipeline object used to wait on the copy completion. |

## Examples

```cuda
#include <cuda/barrier>

__global__ void example_kernel(char* dst, char* src) {
  cuda::barrier<cuda::thread_scope_system> bar;
  init(&bar, 1);

  cuda::memcpy_async(dst,     src,      1, bar);
  cuda::memcpy_async(dst + 1, src + 8,  1, bar);
  cuda::memcpy_async(dst + 2, src + 16, 1, bar);
  cuda::memcpy_async(dst + 3, src + 24, 1, bar);

  bar.arrive_and_wait();
}
```

[See it on Godbolt](https://godbolt.org/z/od6q9s8fq){: .btn }


[`std::memcpy`]: https://en.cppreference.com/w/cpp/string/byte/memcpy

[potentially-overlapping]: https://en.cppreference.com/w/cpp/language/object#Subobjects

[_TriviallyCopyable_]: https://en.cppreference.com/w/cpp/named_req/TriviallyCopyable

[_ThreadGroup_]: ./thread_group.md

[`cuda::std::size_t`]: https://en.cppreference.com/w/c/types/size_t
[`cuda::aligned_size_t`]: ./shapes/aligned_size_t.md

[`cuda::pipeline::quit`]: ./pipelines/pipeline/quit.md