File: memory_kinds_implementation.md

package info (click to toggle)
gasnet 2025.8.0-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 17,424 kB
  • sloc: ansic: 114,758; cpp: 5,158; sh: 4,847; makefile: 2,715; perl: 1,774
file content (646 lines) | stat: -rw-r--r-- 29,170 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
# GASNet-EX Memory Kinds: Implementation Status

# Introduction

This document provides a detailed status of the Memory Kinds feature
implementation and is updated as that status changes.

This document makes references to an external document:

  + GASNet-EX API: Memory Kinds

which is available in the repository as <memory_kinds.pdf>
For brevity, this will be referenced as simply "the MK API".

The majority of this document will address only "CUDA_UVA" and "HIP" memory
kinds.  For information regarding other experimental memory kinds, see the
[Experimental Memory Kinds](#markdown-header-experimental-memory-kinds)
section.

# General Usage

By default, the `configure` script does not enable support for
any non-host memory kinds.  Use of new configure option `--enable-memory-kinds`
enables probes for the necessary headers and libraries for all available device
"kinds" (presently "CUDA_UVA", "HIP" and "ZE") and enables the prototype implementation of
memory kinds if such support is found.  This is the recommended mechanism to
enable memory kinds support, since it will enable additional kinds as they are
added.  For more detailed control for a given kind (such as "cuda-uva" or "hip") the
following take precedence over `--(en|dis)able-memory-kinds`:

  + `--disable-kind-[name]` disables probing for support for the named kind.
  + `--enable-kind-[name]` probes for support for the named kind, with failure
    of the probe being a fatal `configure` error.
  + `--enable-kind-[name]=probe` probes for support for the named kind, with
    failure of the probe being non-fatal (the same behavior requested for all
    kinds by using `--enable-memory-kinds`).

On our main development platforms, the logic in `configure` is sufficient to
locate the required headers and libraries with no additional options.  However,
the following options (and environment equivalents) are available to guide the
probe if needed:  

  + `--with-cuda-home=...` or `CUDA_HOME`
  + `--with-cuda-cflags=...` or `CUDA_CFLAGS`
  + `--with-cuda-libs=...` or `CUDA_LIBS`
  + `--with-cuda-ldflags=...` or `CUDA_LDFLAGS`

  + `--with-hip-home=...` or `HIP_HOME`
  + `--with-hip-cflags=...` or `HIP_CFLAGS`
  + `--with-hip-libs=...` or `HIP_LIBS`
  + `--with-hip-ldflags=...` or `HIP_LDFLAGS`

Generally, it is sufficient to provide the installation prefix of the device
library installation using the `--with-...-home` option or the corresponding
environment variable, since the remaining settings
all have sensible defaults once the installation prefix is known.

Clients can use the preprocessor identifier `GASNET_HAVE_MK_CLASS_CUDA_UVA`
(defined to `1` or undefined) to determine if support for the CUDA_UVA memory
kind was detected at configure time.

Clients can use the preprocessor identifier `GASNET_HAVE_MK_CLASS_HIP`
(defined to `1` or undefined) to determine if support for the HIP memory
kind was detected at configure time.

The preprocessor identifier `GASNET_HAVE_MK_CLASS_MULTIPLE` is more general,
providing the client with an indication if configure detected support for *any*
memory kinds other than host memory.  Again the value is either defined to `1`
or undefined.

# Supported Configurations

All current memory kinds implementation work is limited to two types of devices:

1. Devices with the CUDA Device API and Unified Virtual Addressing (UVA).
   This should include all modern NVIDIA GPUs and CUDA Toolkit versions.

2. Devices with the HIP API, which should include all AMD GPUs supported by AMD ROCm.
   Please consult the ROCm documentation for information on supported devices.

Support is further limited to the following conduit/network combinations, and
only when running drivers with support for PCIe peer-to-peer communication:  

  + ibv-conduit on Linux over NVIDIA/Mellanox InfiniBand hardware  
  + ucx-conduit on Linux over NVIDIA/Mellanox InfiniBand hardware  
  + ofi-conduit (`verbs` provider) on Linux over NVIDIA/Mellanox InfiniBand hardware  
  + ofi-conduit (`verbs` provider) on HPE Cray EX systems using Slingshot-10 NICs  
  + ofi-conduit (`cxi` provider) on HPE Cray EX systems using Slingshot-11 NICs  
  + ofi-conduit *may* work with other providers and/or networks, but the
    GASNet-EX maintainers lack access to test such configurations.  Reports of
    success or failure are encouraged!

In this document we will use "communication offload" in place
of vendor-specific terms for the relevant technology ("GPUDirect RDMA" for
NVIDIA or "ROCmRDMA" for AMD), except when a specific vendor's technology is
intended.  In some cases, additional optional software (such as GPU-specific
Linux kernel modules) is required.  Please consult your GPU vendor's
documentation for assistance determining what driver software is needed for
your specific hardware and Linux distribution.  The Open MPI and MVAPICH
projects also have some documentation regarding deploying of communication
offload for their respective MPI implementations.

With ibv-conduit, `GASNET_SEGMENT_FAST` segment mode is supported.  This is the
default segment mode, but can be specified explicitly at configure time using
the `--enable-segment-fast` option.  To be clear: `--enable-segment-large` and
`--enable-segment-everything` configurations of ibv-conduit do not support
the memory kinds work in the current implementation.

With ucx-conduit, `GASNET_SEGMENT_FAST` and `GASNET_SEGMENT_LARGE` segment
modes are supported.  Fast is the default segment mode, but can be specified
explicitly at configure time using the `--enable-segment-fast` option.  To be
clear: `--enable-segment-everything` configurations of ucx-conduit do not
support the memory kinds work in the current implementation.

With ofi-conduit, `GASNET_SEGMENT_FAST` and `GASNET_SEGMENT_LARGE` segment
modes are supported.  Fast is the default segment mode, but can be specified
explicitly at configure time using the `--enable-segment-fast` option.  To be
clear: `--enable-segment-everything` configurations of ofi-conduit do not
support the memory kinds work in the current implementation.

To the best of our knowledge communications offload support is not available on
current aarch64 (aka ARM64 or ARMv8) systems.  NVIDIA does not support UVA on
32-bit platforms, and AMD does not claim any 32-bit CPU support at all.
This work currently supports only x86-64 and ppc64le.

For configurations on an unsupported CPU architecture which otherwise appear to
to satisfy the configure-testable requirements above, a warning will be issued.
In the case that `--enable-kind-[kind]` was passed but a testable requirement
other than CPU architecture is not satisfied, `configure` will fail.  In other
cases lacking prerequisites, `GASNET_HAVE_MK_CLASS_[KIND]` will be
undefined and attempts to create device segments will fail at runtime.  Future
releases are expected to eventually include a "reference implementation" that
will allow creation of device segments on a wider range of platforms and
transparently stage transfers through host memory bounce buffers, but that is
not yet present.

If support is enabled at configure time, then the implementation will attempt
to determine at runtime if communication offload support is present or not.  If
it is not, then `gex_MK_Create()` will fail with an appropriate message.

## Limits on GPU segment size

Modern GPUs supporting communication offload utilize a "Base Address Register"
mechanism to map the device memory into the PCIe address space.  The amount of
memory which can be mapped by GASNet-EX as a GPU segment is limited by the
`BAR` capability of your GPU, which can also be severely limited by the
motherboard and/or BIOS.

To query the BAR capability of an NVIDIA GPU, run `nvidia-smi -q` and look for the
"Total" value in the "BAR1 Memory Usage" section for an (optimistic) maximum on
the amount of memory which can be mapped.

We are not currently aware of a programmatic means to query the BAR capability of
an AMD GPU.  However, the Linux kernel boot messages (sometimes available via
the `dmesg` utility or in `/var/log/boot`, or similar) may contain lines like
the following:
```
[  115.169185] amdgpu 0000:03:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[  115.169188] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[  115.169190] amdgpu 0000:03:00.0: amdgpu: AGP: 267780096M 0x000000A000000000 - 0x0000FFFFFFFFFFFF
[  115.169201] [drm] Detected VRAM RAM=32752M, BAR=32768M
```
Which in this case shows slightly under 32GB of VRAM and a BAR size of a full
32GB.  In this instance, the VRAM size of `32752M` is the relevant size and,
just like the NVIDIA case, is an (optimistic) maximum limit on mappable memory.

For either GPU vendor, these limits are per GPU across all
process and runtimes on a given node.  Thus a portion is consumed by each
GASNet-EX segment created on a given node, as well as by other uses of
GPUDirect RDMA such as an MPI implementation.  Typically a few tens of MB are
also reserved by the driver itself, though driver overheads as large as 560MB
per process have been observed.

Additionally, the BAR usage is *per-HCA* and thus use of multiple rails may
limit the size of GPU segments.

## UCX Support for Memory Kinds

The support for CUDA and HIP/ROCm device memory has been *optionally* available
in UCX since the 1.6.0 release (which is also the oldest supported by
ucx-conduit).

The ucx-conduit has support for `GEX_MK_CLASS_CUDA_UVA` and `GEX_MK_CLASS_HIP`.
However, see "UCX and CUDA memory" under "Known Issues", below.

## OFI (aka libfabric) Support for Memory Kinds

Our testing has found version 1.12.0 of libfabric to be the oldest which
correctly supports CUDA device memory (modulo the known issues listed later).
For HIP/ROCm device memory the minimum is 1.12.1.  In both cases the support in
libfabric is optional and must be explicitly enabled when the library is built.

In general, absence of the appropriate device support in a build of libfabric
will go undetected at run time and device memory will be treated as host memory.
This can lead to random data corruption, SIGBUS or SIGSEGV errors, which might
be indistinguishable from some of the known issues described later.

To enable device memory support in a build of libfabric from sources, pass its
`configure` script one or both of `--with-cuda` or `--with-rocr` (yes "rocr"
not "rocm").  If the necessary CUDA or ROCm runtime libraries are not in the
default locations, then one may pass an optional argument such as
`--with-cuda=/opt/nvidia/cuda-11.2`.

To query an install of libfabric for CUDA support, the most reliable means
known to us is to run `fi_pingpong` on a compute node with debug tracing
enabled using the two commands which follow:
```sh
$ fi_pingpong &
$ env FI_LOG_LEVEL=debug fi_pingpong localhost |& grep ofi_hmem_init
```
This will yield one or more lines of the form
```
libfabric:[PID]:[TIMESTAMP]:core:core:ofi_hmem_init():222<info> Hmem iface FI_HMEM_[API] not supported
```
Where `[API]` may be `CUDA` or `ROCR`, among others.  These lines enumerate the
device APIs which are *not* supported in the libfabric build.  It is generally
safe to assume that absence of your target API from this output means support is
present in your build of libfabric (assuming it meets the minimum version
requirements, above).

The ofi-conduit has support for `GEX_MK_CLASS_CUDA_UVA` and `GEX_MK_CLASS_HIP`.
However, see multiple "OFI *" entries under "Known Issues", below.

# Known Issues

## Loopback

The current implementation does not handle RMA operations between combinations
of host and GPU memory in the same process (loopback), though this will be
supported in the future.

It should be noted that this temporary limitation precludes use of GASNet for
transfers which could alternatively be performed using `cudaMemcpy()`, or
`cuMemcpy{DtoH,HtoD,DtoD}()` (possibly with some CUDA calls to enable
peer-to-peer access), or by using the `hipMemcpy*()` equivalents.
Therefore, it is recommended practice (and will
*continue* to be so when this limitation is removed) that clients bypass
GASNet-EX to perform such transfers.

For the most up-to-date information on this issue see
[bug 4149](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4149)

## CUDA Primary Context

The current implementation of CUDA_UVA memory kinds uses the device's primary
context for any GASNet-allocated segments, and requires that client-allocated
CUDA device memory passed to `gex_Segment_Create()` be associated with the
device's primary context.

This limitation is not part of the specification, and may be relaxed in a future
release.

## Small Gets into CUDA memory on supported InfiniBand hardware

As documented
[here](https://github.com/linux-rdma/rdma-core/blob/master/providers/mlx5/man/mlx5dv_create_qp.3.md)
current versions of libibverbs for "mlx5" generation HCAs may default to
performing small RMA Gets into an unused space in the work request structure
to minimize the number of PCI bus crossings required.  Such Get operations are
eventually completed by a `memcpy()` to the original destination when the
completion queue entry is reaped.  This `memcpy()` fails (with a `SIGSEGV`)
when the destination is device memory.

As noted in the vendor's documentation, setting `MLX5_SCATTER_TO_CQE=0` in the
environment disables this undesired behavior.  We hope to be able to provide a
better solution (automatic and specific to device memory Gets) in a future
release.

This issue has been seen to impact ibv-, ucx- and ofi-conduits running over
supported InfiniBand hardware, but not on *every* system.  Because this
work-around may increase the latency of all small RMA Gets, including those
into host memory, it is recommended that you set `MLX5_SCATTER_TO_CQE=0` only
if your system exhibits this issue.

For the most up-to-date information on this issue see
[bug 4151](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4151)

## UCX and CUDA memory

Use of GASNet-EX "CUDA_UVA" memory kinds with ucx-conduit may crash due to
incorrect default algorithm selection inside UCX for RMA transfers involving
device memory.  For sufficiently recent UCX versions, there are work-arounds
based on setting environment variables to modify the algorithm selection to be
correct for device memory at the expense of slowing of host memory RMA.

For the most up-to-date information on this issue see
[bug 4384](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4384)

## OFI and Unsupported Providers

Currently, ofi-conduit support for memory kinds is limited to the `verbs` and
`cxi` libfabric providers (when they are built with appropriate `FI_HMEM`
capability support).  If ofi-conduit is compiled with memory kinds enabled and
used with any other provider, the `GEX_MK_CLASS_*` defines will indicate support
for the relevant kinds but calls to `gex_MK_Create()` will return
`GASNET_ERR_RESOURCE`.

It is hoped that in the future builds of ofi-conduit for an unsupported provider
will not define `GEX_MK_CLASS_*`.

## OFI/InfiniBand and CUDA memory

When using ofi-conduit and the `verbs` provider over supported InfiniBand
hardware, RMA Puts with their source in CUDA device memory may crash with a
`SIGSEGV` due to incorrect algorithm selection inside libfabric.

This issue has been seen with a subset of applicable systems, with the majority
of those tested *not* reproducing the problem.  We have been unable to correlate
the problem with the versions of the HCA, firmware, driver or libibverbs.

If you are using an impacted system, this issue can be worked-around
by setting `FI_VERBS_INLINE_SIZE=0` in the environment.  Because this
setting disables a valuable performance optimization, it may increase
the latency of all small RMA Puts, including those from host memory,
and small Active Message.  Therefore, it is strongly recommended that
you set this variable *only* if your system exhibits this issue.

For the most up-to-date information on this second issue see
[bug 4494](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4494)

## OFI/Slingshot-10 and CUDA memory

Because the Slingshot-10 network uses an InfiniBand HCA and libfabric's
`verbs` provider, such systems are impacted by two of the known issues described
above.

1.  Small Gets into CUDA memory on supported InfiniBand hardware  
    The work around of setting `MLX5_SCATTER_TO_CQE=0` is effective.

2.  OFI/InfiniBand and CUDA memory  
    The work around of setting `FI_VERBS_INLINE_SIZE=0` is effective.

This is based on testing with HPE's `libfabric/1.11.0.x.y` environment module
(for multiple values of `x` and `y`).

# Tested Configurations

We have yet to establish the minimum required versions of hardware, drivers or
libraries.  However, the following platforms are our primary development
systems and represent "known good" combinations (modulo those caveats listed
elsewhere in this document):

Summit:  

  + ppc64le (IBM POWER9)
  + CUDA 10.1.243 and 11.3
  + Mellanox ConnectX-5 HCAs
  + NVIDIA Volta-class GPUs

Dirac:  

  + x86_64 (Intel Nehalem)
  + CUDA 11.1 through 11.6
  + Mellanox ConnectX-5 HCAs
  + NVIDIA Maxwell-class GPUs

AMD GPU systems:

  + x86_64 (multiple AMD CPU families)
  + ROCm 4.5
  + Mellanox ConnectX-5 and ConnectX-6 HCAs
  + AMD Instinct-class (MI50, MI100, MI250X) GPUs

HPE Cray EX systems with Slingshot-10:

  + SS10+Nvidia:
    - AMD Milan CPUs
    - Mellanox ConnectX-5 HCAs
    - Nvidia A100 GPUs
    - CUDA 11.6 and 11.7

  + SS10+AMD:
    - AMD Rome CPUs
    - Mellanox ConnectX-5 HCAs
    - AMD MI100 GPUs
    - ROCm 4.5

HPE Cray EX systems with Slingshot-11:

  + SS11+Nvidia:
    - AMD Milan CPUs
    - HPE Cassini NICs
    - Nvidia A100 GPUs
    - CUDA 11.7

  + SS11+AMD:
    - AMD Milan CPUs
    - HPE Cassini NICs
    - AMD MI250X GPUs
    - ROCm 5.1.0

Eventual minimum requirements may be lower than on the platforms listed above, or
possibly higher.

# Experimental Memory Kinds

## oneAPI Level Zero "ZE" Kind

Currently there is experimental support for the "oneAPI Level Zero" API of Intel
GPUs, which is known as the "ZE" kind.  There are known correctness issues,
the performance has yet to be characterized or tuned, and there are multiple
quality-of-implementation issues to be addressed before this support can be
recommended for use in production.

One notable aspect of the experimental status is that the kind-specific members
in `gex_MK_Create_args_t` are subject to change in future releases.

We advise contacting us at gasnet-staff@lbl.gov if you wish to use the ZE kind.

Support for the ZE kind is only present in ofi-conduit, and has only been tested
with the `cxi` provider for HPE Cray EX systems using Slingshot-11 NICs.  It has
*not* been tested with ofi-conduit using the 'verbs' libfabric provider for
Slingshot-10 or other networks using NVIDIA/Mellanox InfiniBand hardware.  
We encourage reports of success or failure with other libfabric providers.

There currently is no ZE memory kind support in ibv- or ucx-conduits.
We are seeking access to InfiniBand-connected clusters of nodes with Intel GPUs
to enable work on ibv- and ucx-conduits, and on the `verbs` provider for
ofi-conduit.  
Please contact us if you can provide such access.

Configure options and environment variables to enable and configure the ZE kind
are analogous to those for CUDA-UVA and HIP, and behave as described earlier
under [General Usage](#markdown-header-general-usage):

  + `--enable-kind-ze[=probe]`
  + `--with-ze-home=...` or `ZE_HOME`
  + `--with-ze-cflags=...` or `ZE_CFLAGS`
  + `--with-ze-libs=...` or `ZE_LIBS`
  + `--with-ze-ldflags=...` or `ZE_LDFLAGS`

Clients can use the preprocessor identifier `GASNET_HAVE_MK_CLASS_ZE`
(defined to `1` or undefined) to determine if support for the ZE memory
kind was detected at configure time.

### Minimum System Requirements for the ZE Kind

Testing of the ZE kind and Slingshot-11 networks has shown the vendor-provided
kernel with SUSE Linux Enterprise Server 15 SP4 (aka "SLES 15.4") to be usable.
However, when using the kernel provided with SLES 15.3, all calls to
`gex_EP_BindSegment()` with ZE device memory segments fail.  
We encourage reports of success or failure with other kernel versions.

Our testing has shown that prior to the 2.1.2 release of HPE's Slingshot Host
Software, 32KiB was the largest device memory segment size which could
successfully be used with the ZE kind and the ofi-conduit `cxi` provider.
Consequently, we strongly recommend the 2.1.2 release of HPE's Slingshot Host
Software as a minimum version when using the `cxi` provider.  
For the most up-to-date information on this issue see
[bug 4679](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4679)

### Known Issues with the ZE Memory Kind

As documented (see `man fi_cxi` on an HPE Cray EX), the libfabric cxi provider
can only support native memory kinds on Intel GPUs when "implicit scaling" is
disabled.  The following two environment variables should be set to accomplish
that:
```
    EnableImplicitScaling=0
    NEOReadDebugKeys=1
```
Otherwise, all calls to `gex_EP_BindSegment()` with ZE device memory segments
fail with a "Function not implemented" error message.

As of the time of writing, testing has shown 2GiB to be the largest device
memory segment size which can reliably be used with the ZE kind and the
ofi-conduit `cxi` provider.  Sizes above 2GiB are unreliable (we see random
communication errors), and 16GiB and above fail in `gex_EP_BindSegment()` with
a "Function not implemented" or "No such device or address" error message.
We encourage reports of success with device segments larger than 2GiB.

See [bug 4679](https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4679) for
more information on the topic of limited device segment sizes.

# Implementation Status Summary

Currently the implementation is sufficient (when hardware, software and
configuration constraints are met) to use the pseudo-code below to perform RMA
operations between combinations of host and GPU memory (subject to any
previously noted temporary prohibitions, such as on loopback transfers).

Please note that all error checking has been elided from this example.  Proper
checking of return codes, if any, is especially important when using this
prototype.  Modifications for `GEX_MK_CLASS_HIP` are straight-forward, requiring
only changes to the initialization of `args`.

```
  // Bootstrap and establish host memory segment for the primordial endpoints
  gex_Client_Init(&myClient, &myEP, &myTM, "MK example", &argc, &argv, 0)
  gex_Segment_Attach(...);

  // Create memory kind object for CUDA device 0
  gex_MK_t dev0Kind;
  gex_MK_Create_args_t args;
  args.gex_flags = 0;
  args.gex_class = GEX_MK_CLASS_CUDA_UVA;
  args.gex_args.gex_class_cuda_uva.gex_CUdevice = 0;
  gex_MK_Create(&dev0_kind, myClient, args, 0);

  // Ask GASNet to allocate a 1GB CUDA UVA segment
  gex_Segment_t dev0Segment;
  gex_Segment_Create(&dev0Segment, myClient, NULL, 1024*1024*1024, dev0Kind, 0);

  // Create an RMA-only endpoint, bind dev0Segment and publish RMA credentials
  gex_EP_t dev0EP;
  gex_EP_Create(&dev0EP, myClient, GEX_EP_CAPABILITY_RMA, 0);
  gex_EP_BindSegment(dev0EP, dev0Segment, 0);
  gex_EP_PublishBoundSegment(myTM, &dev0EP, 1, 0);

  // Note assumptions made in remainder of this example
  assert( gex_EP_QueryIndex(myEP) == 0 );
  assert( gex_EP_QueryIndex(dev0EP) == 1 );

  // Query device addresses needed for RMA
  void *loc_dev0, *rem_dev0;
  loc_dev0 = gex_Segment_QueryAddr(dev0Segment);
  gex_Event_t ev = gex_EP_QueryBoundSegmentNB(gex_TM_Pair(myEP,1), peer_rank, &rem_dev0, NULL, NULL, 0);
  gex_Event_Wait(ev);
  
  // [ do something that places data in GPU memory ]

  // Perform a blocking 4MB GPU-to-GPU Get
  gex_RMA_GetBlocking(gex_TM_Pair(dev0EP,1), loc_dev0, peer_rank, rem_dev0, 4*1024*1024, 0);
```

# Implementation Status by GASNet-EX API

This section describes the known limitations of each of the APIs introduced
recently in order to support memory kinds.  Due to interaction among
APIs, it is impossible to completely avoid forward references.

## `gex_Segment_Attach()`

The `gex_Segment_Attach()` call remains the only supported means by which to
create and bind a segment to the primordial endpoint.  This is true despite the
recent introduction of APIs which, when used in the proper sequence, would
appear to provide a suitable replacement.

While this is technically a limitation of the alternative APIs, it is
documented here for clarity.

## `gex_Segment_Create()`

This API, along with all types and constants required to specify its arguments,
are defined and will link in any conduit.  However, it is useful only when
multi-EP support exists (see `gex_EP_Create()` for the current scope of
multi-EP support).

On those conduits and segment modes listed under "Supported Configurations",
the implementation of this API is believed to be
complete with respect to the MK API.  In particular, it is capable of
creating segments of both client-allocated and GASNet-allocated memory, using
either the defined `kind` value `GEX_MK_HOST` or a kind created using
`gex_MK_Create()` with a class of `GEX_MK_CLASS_CUDA_UVA` or `GEX_MK_CLASS_HIP`.

Notably lacking from both specification and implementation is a means to request
or demand allocation of memory suitable for intra-nbrhd cross-mapping via PSHM.

## `gex_Segment_Destroy()`

This API is fully implemented to the current proposed specification.

## `gex_EP_Create()`

This API, along with all types and constants required to specify its arguments,
are defined and will link in any conduit.  However, it is useful only when
multi-EP support exists, as described in the following paragraphs.

Current builds of GASNet-EX will define a preprocessor macro
`GASNET_MAXEPS` which advertises the optimistic maximum number of endpoints per
process, inclusive of the primordial endpoint created by `gex_Client_Init()`.
Any call to `gex_EP_Create()` which would exceed this limit will fail with
a return of GASNET_ERR_RESOURCE.

Currently, only those conduits and segment modes listed under "Supported
Configurations" have values of `GASNET_MAXEPS` larger than 1 (each currently
33).  Additionally, these conduits support only the `GEX_EP_CAPABILITY_RMA`
capability for non-primordial endpoints.

The `GEX_FLAG_HINT_ACCEL_*` values are currently defined, but ignored.

## `gex_EP_BindSegment()`

This API, along with all types and constants required to specify its arguments,
are defined and will link in any conduit.  However, it is useful only when
multi-EP support exists (see `gex_EP_Create()` for the current scope of
multi-EP support).

## `gex_EP_PublishBoundSegment()`

This API does not appear in the MK API, nor in related documents which
preceded it.  Complete semantics are documented in `docs/GASNet-EX.txt`.

This call is currently necessary as the only means to actively distribute the
RMA credentials required by some conduits (ibv, ucx and ofi among them).  While this task is
performed in `gex_Segment_Attach()` for primordial endpoints, use of this API is
required prior to use of `gex_RMA_*()` APIs using non-primordial endpoints.  It
is hoped that this call can become optional in the future.

This API, along with all types and constants required to specify its arguments,
are defined and will link in any conduit.  However, it is useful only when
multi-EP support exists (see `gex_EP_Create()` for the current scope of
multi-EP support).

## `gex_TM_Pair()`

This API is believed to be fully implemented in all conduits and accepted by
all APIs required to do so by the MK API (notably the `gex_RMA_*()`,
`gex_AM_*()` and `gex_VIS_*()` API families).

Since multi-EP support is currently exclusive to ibv, ucx and ofi conduits (and only in
some segment modes), the use in other conduits is effectively limited to aliasing of the
primordial team.

## `gex_TM_Create()`

At this point in time, there is no support for non-primordial endpoints as
members of teams, and more specifically they are not accepted in the `args`
passed to `gex_TM_Create()`.  This means all RMA communication involving
non-primordial endpoints must currently utilize a `gex_TM_t` returned by
`gex_TM_Pair()`.

## `gex_TM_Dup()`

Not implemented.

## `gex_MK_Create()`

This API is implemented as described in the MK API. This
includes the conditional definition (defined to `1` or undefined) of
`GASNET_HAVE_MK_CLASS_CUDA_UVA` and/or `GASNET_HAVE_MK_CLASS_HIP`, each of which
is defined only when the respective headers and libs were located at configure
time *and* one is using a conduit and segment mode listed under "Supported
Configurations".
Otherwise these feature macros will be undefined.

While these feature macros have only a conditional definition, the
enum values `GEX_MK_CLASS_CUDA_UVA` and `GEX_MK_CLASS_HIP` are both
defined unconditionally in `gasnet_mk.h`.
Any calls to `gex_MK_Create()` specifying a class when *not* supported will
return `GASNET_ERR_BAD_ARG`, as documented in the MK API.

## `gex_MK_Destroy()`

This API is implemented, but is currently only allowed for a `gex_MK_t` not
ever used to create a segment.