File: cuda.md

package info (click to toggle)
pytorch 2.9.1%2Bdfsg-1~exp2
  • links: PTS, VCS
  • area: main
  • in suites: experimental
  • size: 180,096 kB
  • sloc: python: 1,473,255; cpp: 942,030; ansic: 79,796; asm: 7,754; javascript: 2,502; java: 1,962; sh: 1,809; makefile: 628; xml: 8
file content (305 lines) | stat: -rw-r--r-- 5,949 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
# torch.cuda

```{eval-rst}
.. automodule:: torch.cuda
```

```{eval-rst}
.. currentmodule:: torch.cuda
```

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

    StreamContext
    can_device_access_peer
    current_blas_handle
    current_device
    current_stream
    cudart
    default_stream
    device
    device_count
    device_memory_used
    device_of
    get_arch_list
    get_device_capability
    get_device_name
    get_device_properties
    get_gencode_flags
    get_stream_from_external
    get_sync_debug_mode
    init
    ipc_collect
    is_available
    is_initialized
    is_tf32_supported
    memory_usage
    set_device
    set_stream
    set_sync_debug_mode
    stream
    synchronize
    utilization
    temperature
    power_draw
    clock_rate
    AcceleratorError
    OutOfMemoryError
```

## Random Number Generator

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

    get_rng_state
    get_rng_state_all
    set_rng_state
    set_rng_state_all
    manual_seed
    manual_seed_all
    seed
    seed_all
    initial_seed

```

## Communication collectives

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

    comm.broadcast
    comm.broadcast_coalesced
    comm.reduce_add
    comm.reduce_add_coalesced
    comm.scatter
    comm.gather
```

## Streams and events

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

    Stream
    ExternalStream
    Event
```

## Graphs (beta)

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

    is_current_stream_capturing
    graph_pool_handle
    CUDAGraph
    graph
    make_graphed_callables
```

(cuda-memory-management-api)=

```{eval-rst}
.. automodule:: torch.cuda.memory
```

```{eval-rst}
.. currentmodule:: torch.cuda.memory
```

## Memory management

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

     empty_cache
     get_per_process_memory_fraction
     list_gpu_processes
     mem_get_info
     memory_stats
     memory_stats_as_nested_dict
     reset_accumulated_memory_stats
     host_memory_stats
     host_memory_stats_as_nested_dict
     reset_accumulated_host_memory_stats
     memory_summary
     memory_snapshot
     memory_allocated
     max_memory_allocated
     reset_max_memory_allocated
     memory_reserved
     max_memory_reserved
     set_per_process_memory_fraction
     memory_cached
     max_memory_cached
     reset_max_memory_cached
     reset_peak_memory_stats
     reset_peak_host_memory_stats
     caching_allocator_alloc
     caching_allocator_delete
     get_allocator_backend
     CUDAPluggableAllocator
     change_current_allocator
     MemPool
```

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

    caching_allocator_enable
```

```{eval-rst}
.. currentmodule:: torch.cuda
```

```{eval-rst}
.. autoclass:: torch.cuda.use_mem_pool
```

% FIXME The following doesn't seem to exist. Is it supposed to?
% https://github.com/pytorch/pytorch/issues/27785
% .. autofunction:: reset_max_memory_reserved

## NVIDIA Tools Extension (NVTX)

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

    nvtx.mark
    nvtx.range_push
    nvtx.range_pop
    nvtx.range
```

## Jiterator (beta)

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

    jiterator._create_jit_fn
    jiterator._create_multi_output_jit_fn
```

## TunableOp

Some operations could be implemented using more than one library or more than
one technique. For example, a GEMM could be implemented for CUDA or ROCm using
either the cublas/cublasLt libraries or hipblas/hipblasLt libraries,
respectively. How does one know which implementation is the fastest and should
be chosen? That's what TunableOp provides. Certain operators have been
implemented using multiple strategies as Tunable Operators. At runtime, all
strategies are profiled and the fastest is selected for all subsequent
operations.

See the {doc}`documentation <cuda.tunable>` for information on how to use it.

```{toctree}
:hidden: true

cuda.tunable
```

## Stream Sanitizer (prototype)

CUDA Sanitizer is a prototype tool for detecting synchronization errors between streams in PyTorch.
See the {doc}`documentation <cuda._sanitizer>` for information on how to use it.

```{toctree}
:hidden: true

cuda._sanitizer
```

## GPUDirect Storage (prototype)

The APIs in `torch.cuda.gds` provide thin wrappers around certain cuFile APIs that allow
direct memory access transfers between GPU memory and storage, avoiding a bounce buffer in the CPU. See the
[cufile api documentation](https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html#cufile-io-api)
for more details.

These APIs can be used in versions greater than or equal to CUDA 12.6. In order to use these APIs, one must
ensure that their system is appropriately configured to use GPUDirect Storage per the
[GPUDirect Storage documentation](https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/contents.html).

See the docs for {class}`~torch.cuda.gds.GdsFile` for an example of how to use these.

```{eval-rst}
.. currentmodule:: torch.cuda.gds
```

```{eval-rst}
.. autosummary::
    :toctree: generated
    :nosignatures:

    gds_register_buffer
    gds_deregister_buffer
    GdsFile

```

% This module needs to be documented. Adding here in the meantime

% for tracking purposes

```{eval-rst}
.. py:module:: torch.cuda.comm
```

```{eval-rst}
.. py:module:: torch.cuda.gds
```

```{eval-rst}
.. py:module:: torch.cuda.graphs
```

```{eval-rst}
.. py:module:: torch.cuda.jiterator
```

```{eval-rst}
.. py:module:: torch.cuda.nccl
```

```{eval-rst}
.. py:module:: torch.cuda.nvtx
```

```{eval-rst}
.. py:module:: torch.cuda.profiler
```

```{eval-rst}
.. py:module:: torch.cuda.random
```

```{eval-rst}
.. py:module:: torch.cuda.sparse
```

```{eval-rst}
.. py:module:: torch.cuda.streams
```