1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465
|
# Frequently Asked Questions
## General
### Overview
#### What is UCX?
UCX is a framework (collection of libraries and interfaces) that provides efficient
and relatively easy way to construct widely used HPC protocols: MPI tag matching,
RMA operations, rendezvous protocols, stream, fragmentation, remote atomic operations, etc.
#### What is UCP, UCT, UCS?
* **UCT** is a transport layer that abstracts the differences across various hardware architectures and provides a low-level API that enables the implementation of communication protocols. The primary goal of the layer is to provide direct and efficient access to hardware network resources with minimal software overhead. For this purpose, UCT relies on low-level drivers such as uGNI, Verbs, shared memory, ROCM, CUDA. In addition, the layer provides constructs for communication context management (thread-based and application level), and allocation and management of device-specific memories including those found in accelerators. In terms of communication APIs, UCT defines interfaces for immediate (short), buffered copy-and-send (bcopy), and zero-copy (zcopy) communication operations. The short operations are optimized for small messages that can be posted and completed in place. The bcopy operations are optimized for medium size messages that are typically sent through a so-called bouncing-buffer. Finally, the zcopy operations expose zero-copy memory-to-memory communication semantics.
* **UCP** implements higher-level protocols that are typically used by message passing (MPI) and PGAS programming models by using lower-level capabilities exposed through the UCT layer.
UCP is responsible for the following functionality: initialization of the library, selection of transports for communication, message fragmentation, and multi-rail communication. Currently, the API has the following classes of interfaces: Initialization, Remote Memory Access (RMA) communication, Atomic Memory Operations (AMO), Active Message, Tag-Matching, and Collectives.
* **UCS** is a service layer that provides the necessary functionality for implementing portable and efficient utilities.
#### How can I contribute?
1. Fork
2. Fix bug or implement a new feature
3. Open Pull Request
#### How do I get in touch with UCX developers?
Please join our mailing list: https://elist.ornl.gov/mailman/listinfo/ucx-group or
submit issues on github: https://github.com/openucx/ucx/issues
<br/>
### UCX mission
#### What are the key features of UCX?
* **Open source framework supported by vendors**
The UCX framework is maintained and supported by hardware vendors in addition to the open source community. Every pull-request is tested and multiple hardware platforms supported by vendors community.
* **Performance, performance, performance!**
The framework architecture, data structures, and components are designed to provide optimized access to the network hardware.
* **High level API for a broad range HPC programming models.**
UCX provides a high-level and performance-portable network API. The API targets a variety of programming models ranging from high-performance MPI implementation to Apache Spark. UCP API abstracts differences and fills in the gaps across interconnects implemented in the UCT layer. As a result, implementations of programming models and libraries (MPI, OpenSHMEM, Apache Spark, RAPIDS, etc.) is simplified while providing efficient support for multiple interconnects (uGNI, Verbs, TCP, shared memory, ROCM, CUDA, etc.).
* **Support for interaction between multiple transports (or providers) to deliver messages.**
For example, UCX has the logic (in UCP) to make 'GPUDirect', IB' and share memory work together efficiently to deliver the data where it is needed without the user dealing with this.
* **Cross-transport multi-rail capabilities.** UCX protocol layer can utilize multiple transports,
event on different types of hardware, to deliver messages faster, without the need for
any special tuning.
* **Utilizing hardware offloads for optimized performance**, such as RDMA, Hardware tag-matching
hardware atomic operations, etc.
#### What protocols are supported by UCX?
UCP implements RMA put/get, send/receive with tag matching, Active messages, atomic operations. In near future we plan to add support for commonly used collective operations.
#### Is UCX replacement for GASNET?
No. GASNET exposes high level API for PGAS programming management that provides symmetric memory management capabilities and build in runtime environments. These capabilities are out of scope of UCX project.
Instead, GASNET can leverage UCX framework for fast end efficient implementation of GASNET for the network technologies support by UCX.
#### What is the relation between UCX and network drivers?
UCX framework does not provide drivers, instead it relies on the drivers provided by vendors. Currently we use: OFA VERBs, Cray's UGNI, NVIDIA CUDA.
#### What is the relation between UCX and OFA Verbs or Libfabrics?
UCX is a middleware communication framework that relies on device drivers, e.g. RDMA, CUDA, ROCM. RDMA and OS-bypass network devices typically implement device drivers using the RDMA-core Linux subsystem that is supported by UCX. Support for other network abstractions can be added based on requests and contributions from the community.
#### Is UCX a user-level driver?
UCX is not a user-level driver. Typically, drivers aim to expose fine-grained access to the network architecture-specific features.
UCX abstracts the differences across various drivers and fill-in the gaps using software protocols for some of the architectures that don't provide hardware level support for all the operations.
<br/>
### Dependencies
#### What stuff should I have on my machine to use UCX?
UCX detects the exiting libraries on the build machine and enables/disables support
for various features accordingly.
If some of the modules UCX was built with are not found during runtime, they will
be silently disabled.
* **Basic shared memory and TCP support** - always enabled.
* **Optimized shared memory** - requires knem or xpmem drivers. On modern kernels, CMA (cross-memory-attach) will also be used if available.
* **RDMA support** - requires rdma-core or libibverbs library. UCX >= 1.12.0 requires rdma-core >= 28.0 or MLNX_OFED >= 5.0.
* **NVIDIA GPU support** - requires CUDA >= 6.0. UCX >= 1.8 requires CUDA with nv_peer_mem support.
* **AMD GPU support** - requires ROCm version >= 4.0.
#### Does UCX depend on an external runtime environment?
UCX does not depend on an external runtime environment.
`ucx_perftest` (UCX based application/benchmark) can be linked with an external runtime environment that can be used for remote `ucx_perftest` launch, but this an optional configuration which is only used for environments that do not provide direct access to compute nodes. By default this option is disabled.
#### I get an error "cannot find package 'github.com/openucx/ucx/bindings/go/src/ucx'" when building Go bindings. How do I fix this?
This error occurs because Go modules are disabled in your local Go environment. To resolve it, set the GO111MODULE environment variable to auto by running: `go env -w GO111MODULE=auto`. This will permanently enable module-aware mode in the local Go environment, allowing Go to locate the necessary packages.
<br/>
### Configuration and tuning
#### How can I specify special configuration and tunings for UCX?
UCX takes parameters from specific **environment variables**, which start with the
prefix `UCX_`.
> **IMPORTANT NOTE:** Setting UCX environment variables to non-default values
may lead to undefined behavior. The environment variables are mostly intended for
advanced users, or for specific tunings or workarounds recommended by the UCX community.
#### Where can I see all UCX environment variables?
* Running `ucx_info -c` prints all environment variables and their default values.
* Running `ucx_info -cf` prints the documentation for all environment variables.
#### UCX configuration file
UCX looks for a configuration file in `{prefix}/etc/ucx/ucx.conf`, where `{prefix}` is the installation prefix configured during compilation.
It allows customization of the various parameters. An environment variable
has precedence over the value defined in `ucx.conf`.
The file can be created using `ucx_info -Cf`.
#### Build user application with UCX
In order to build the application with UCX development libraries, UCX supports a
metainformation subsystem based on the pkg-config tool. For example, this is how
pkg-config can be incorporated in a Makefile-based build:
```
program: program.c
$(CC) program.c $(shell pkg-config --cflags --libs ucx)
```
When linking with static UCX libraries, the user must to list all required
transport modules explicitly. For example, in order to support only cma and
knem transports, the user have to use:
```
program: program.c
$(CC) -static program.c $(shell pkg-config --cflags --libs --static ucx-cma ucx-knem ucx)
```
Currently, the following transport modules can be used with pkg-config:
<table>
<tr><th>Package name</th><th>Provided transport service</th></tr>
<tr><td>ucx-cma</td><td>Shared memory using <a href="https://lwn.net/Articles/405284">Linux Cross-Memory Attach</a></td></tr>
<tr><td>ucx-knem</td><td>Shared memory using <a href="https://knem.gitlabpages.inria.fr">High-Performance Intra-Node MPI Communication</a></td></tr>
<tr><td>ucx-xpmem</td><td>Shared memory using <a href="https://github.com/hjelmn/xpmem">XPMEM</a></td></tr>
<tr><td>ucx-ib</td><td><a href="https://developer.nvidia.com/networking">Infiniband</a> based network transport</td></tr>
<tr><td>ucx-rdmacm</td><td>Connection manager based on <a href="https://github.com/ofiwg/librdmacm">RDMACM</a></td></tr>
</table>
<br/>
TCP, basic shared memory, and self transports are built into UCT and don't
need additional compilation actions.
##### IMPORTANT NOTE:
The package ucx-ib requires static libraries for `libnl` and `numactl`,
as a dependency of `rdma-core`. Most Linux distributions do not provide these
static libraries by default, so they need to be built and installed manually.
They can be downloaded from the following locations:
<table>
<tr><td>libnl</td><td>https://www.infradead.org/~tgr/libnl</td><td>(tested on version 3.2.25)</td></tr>
<tr><td>numactl</td><td>https://github.com/numactl/numactl</td><td>(tested on version 2.0.14)</td></tr>
</table>
<br/>
---
<br/>
## Network capabilities
### Selecting networks and transports
#### Which network devices does UCX use?
By default, UCX tries to use all available devices on the machine, and selects
best ones based on performance characteristics (bandwidth, latency, NUMA locality, etc).
Setting `UCX_NET_DEVICES=<dev1>,<dev2>,...` would restrict UCX to using **only**
the specified devices.
For example:
* `UCX_NET_DEVICES=eth2` - Use the Ethernet device eth2 for TCP sockets transport.
* `UCX_NET_DEVICES=mlx5_2:1` - Use the RDMA device mlx5_2, port 1
Running `ucx_info -d` would show all available devices on the system that UCX can utilize.
#### Which transports does UCX use?
By default, UCX tries to use all available transports, and select best ones
according to their performance capabilities and scale (passed as estimated number
of endpoints to *ucp_init()* API).
For example:
* On machines with Ethernet devices only, shared memory will be used for intra-node
communication and TCP sockets for inter-node communication.
* On machines with RDMA devices, RC transport will be used for small scale, and
DC transport (available with Connect-IB devices and above) will be used for large
scale. If DC is not available, UD will be used for large scale.
* If GPUs are present on the machine, GPU transports will be enabled for detecting
memory pointer type and copying to/from GPU memory.
It's possible to restrict the transports in use by setting `UCX_TLS=<tl1>,<tl2>,...`.
`^` at the beginning turns the list into a deny list.
The list of all transports supported by UCX on the current machine can be generated
by `ucx_info -d` command.
> **IMPORTANT NOTE**
> In some cases restricting the transports can lead to unexpected and undefined behavior:
> * Using *rc_verbs* or *rc_mlx5* also requires *ud_verbs* or *ud_mlx5* transport for bootstrap.
> * Applications using GPU memory must also specify GPU transports for detecting and
> handling non-host memory.
In addition to the built-in transports it's possible to use aliases which specify multiple transports.
##### List of main transports and aliases
<table>
<tr><td>all</td><td>use all the available transports.</td></tr>
<tr><td>sm or shm</td><td>all shared memory transports.</td></tr>
<tr><td>ugni</td><td>ugni_rdma and ugni_udt.</td></tr>
<tr><td>rc</td><td>RC (=reliable connection), "accelerated" transports are used if possible.</td></tr>
<tr><td>ud</td><td>UD (=unreliable datagram), "accelerated" is used if possible.</td></tr>
<tr><td>dc</td><td>DC - Mellanox scalable offloaded dynamic connection transport</td></tr>
<tr><td>rc_x</td><td>Same as "rc", but using accelerated transports only</td></tr>
<tr><td>rc_v</td><td>Same as "rc", but using Verbs-based transports only</td></tr>
<tr><td>ud_x</td><td>Same as "ud", but using accelerated transports only</td></tr>
<tr><td>ud_v</td><td>Same as "ud", but using Verbs-based transports only</td></tr>
<tr><td>cuda</td><td>CUDA (NVIDIA GPU) memory support: cuda_copy, cuda_ipc, gdr_copy</td></tr>
<tr><td>rocm</td><td>ROCm (AMD GPU) memory support: rocm_copy, rocm_ipc, rocm_gdr</td></tr>
<tr><td>tcp</td><td>TCP over SOCK_STREAM sockets</td></tr>
<tr><td>self</td><td>Loopback transport to communicate within the same process</td></tr>
</table>
<br/>
For example:
- `UCX_TLS=rc` will select RC, UD for bootstrap, and prefer accelerated transports
- `UCX_TLS=rc,cuda` will select RC along with Cuda memory transports
- `UCX_TLS=^rc` will select all available transports, except RC
> **IMPORTANT NOTE**
> `UCX_TLS=^ud` will select all available transports, except UD. However, UD
will still be available for bootstrap. Only `UCX_TLS=^ud,ud:aux` will disable UD
completely.
<br/>
### Multi-rail
#### Does UCX support multi-rail?
Yes.
#### What is the default behavior in a multi-rail environment?
By default UCX would pick the 2 best network devices, and split large
messages between the rails. For example, in a 100MB message - the 1st 50MB
would be sent on the 1st device, and the 2nd 50MB would be sent on the 2nd device.
If the device network speeds are not the same, the split will be proportional to
their speed ratio.
The devices to use are selected according to best network speed, PCI bandwidth,
and NUMA locality.
#### Is it possible to use more than 2 rails?
Yes, by setting `UCX_MAX_RNDV_RAILS=<num-rails>`. Currently up to 4 are supported.
#### Is it possible that each process would just use the closest device?
Yes, by `UCX_MAX_RNDV_RAILS=1` each process would use a single network device
according to NUMA locality.
#### Can I disable multi-rail?
Yes, by setting `UCX_NET_DEVICES=<dev>` to the single device that should be used.
<br/>
### Adaptive routing
#### Does UCX support adaptive routing fabrics?
Yes.
#### What do I need to do to run UCX with adaptive routing?
Setting UCX_IB_AR_ENABLE activates adaptive routing for both InfiniBand and
RoCE clusters. For InfiniBand, it attempts to select the first Service Level
(SL) with adaptive routing enabled. In the case of RoCE, adaptive routing is
enabled if the hardware configuration supports it. If set to `yes` and the
network does not support it, an error will occur. Conversely, if set to `try`,
any lack of support will be silently ignored.
<br/>
### RoCE
#### How to specify service level with UCX?
Setting `UCX_IB_SL=<sl-num>` will make UCX run on the given service level.
#### How to specify DSCP priority?
Setting `UCX_IB_TRAFFIC_CLASS=<num>`.
#### How to specify which address to use?
Setting `UCX_IB_GID_INDEX=<num>` would make UCX use the specified GID index on
the RoCE port. The system command `show_gids` would print all available addresses
and their indexes.
---
<br/>
## Working with GPU
### GPU support
#### How UCX supports GPU?
UCX protocol operations can work with GPU memory pointers the same way as with Host
memory pointers. For example, the 'buffer' argument passed to `ucp_tag_send_nb()` can
be either host or GPU memory.
#### Which GPUs are supported?
Currently UCX supports NVIDIA GPUs by Cuda library, and AMD GPUs by ROCm library.
#### Which UCX APIs support GPU memory?
Currently only UCX tagged APIs, stream APIs, and active messages APIs fully
support GPU memory. Remote memory access APIs, including atomic operations,
have an incomplete support for GPU memory; the full support is planned for
future releases.
<table>
<tr><th>API</th><th>GPU memory support level</th></tr>
<tr><td>Tag (ucp_tag_send_XX/ucp_tag_recv_XX)</td><td>Full support</td></tr>
<tr><td>Stream (ucp_stream_send/ucp_stream_recv_XX)</td><td>Full support</td></tr>
<tr><td>Active messages (ucp_am_send_XX/ucp_am_recv_data_XX)</td><td>Full support</td></tr>
<tr><td>Remote memory access (ucp_put_XX/ucp_get_XX)</td><td>Partial support</td></tr>
<tr><td>Atomic operations (ucp_atomic_XX)</td><td>Partial support</td></tr>
</table>
<br/>
#### How to run UCX with GPU support?
In order to run UCX with GPU support, you will need an application which allocates
GPU memory (for example,
[MPI OSU benchmarks with Cuda support](https://mvapich.cse.ohio-state.edu/benchmarks)),
and UCX compiled with GPU support. Then you can run the application as usual (for
example, with MPI) and whenever GPU memory is passed to UCX, it either use GPU-direct
for zero copy operations, or copy the data to/from host memory.
> NOTE When specifying UCX_TLS explicitly, must also specify cuda/rocm for GPU memory
> support, otherwise the GPU memory will not be recognized.
> For example: `UCX_TLS=rc,cuda` or `UCX_TLS=dc,rocm`
#### I'm running UCX with GPU memory and getting a segfault, why?
Most likely UCX does not detect that the pointer is a GPU memory and tries to
access it from CPU. It can happen if UCX is not compiled with GPU support, or fails
to load CUDA or ROCm modules due to missing library paths or version mismatch.
Please run `ucx_info -d | grep cuda` or `ucx_info -d | grep rocm` to check for
UCX GPU support.
In some cases, the internal memory type cache can misdetect GPU memory as host
memory, also leading to invalid memory access. This cache can be disabled by
setting `UCX_MEMTYPE_CACHE=n`.
#### Why am I getting the error "provided PTX was compiled with an unsupported toolchain"?
The application is loading a cuda binary that was compiled for newer version of
cuda than installed, and the failure is detected asynchronously by a Cuda API
call from UCX. In order to fix the issue, you would need to install a newer cuda
version or compile the cuda binaries with the appropriate -arch option to nvcc.
Refer to https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list
for the appropriate `-arch` option to pass to nvcc.
<br/>
### Performance considerations
#### Does UCX support zero-copy for GPU memory over RDMA?
Yes. For large messages UCX can transfer GPU memory using zero-copy RDMA using
rendezvous protocol. It requires a peer memory driver for the relevant GPU type
to be loaded, or (starting with UCX v1.14.0) dmabuf support on the system.
> **NOTE:** In some cases if the RDMA network device and the GPU are not on
the same NUMA node, such zero-copy transfer is inefficient.
#### What is needed for dmabuf support?
- UCX v1.14.0 or later.
- Linux kernel >= 5.12 (for example, Ubuntu 22.04).
- Cuda 11.7 or later, installed with "-m=kernel-open" flag.
> **NOTE:** Currently UCX code assumes that dmabuf support is uniform across all
available GPU devices.
---
<br/>
## Introspection
### Protocol selection
#### How can I tell which protocols and transports are being used for communication?
- Set `UCX_LOG_LEVEL=info` to print basic information about transports and devices:
```console
$ mpirun -x UCX_LOG_LEVEL=info -np 2 --map-by node osu_bw D D
[1645203303.393917] [host1:42:0] ucp_context.c:1782 UCX INFO UCP version is 1.13 (release 0)
[1645203303.485011] [host2:43:0] ucp_context.c:1782 UCX INFO UCP version is 1.13 (release 0)
[1645203303.701062] [host1:42:0] parser.c:1918 UCX INFO UCX_* env variable: UCX_LOG_LEVEL=info
[1645203303.758427] [host2:43:0] parser.c:1918 UCX INFO UCX_* env variable: UCX_LOG_LEVEL=info
[1645203303.759862] [host2:43:0] ucp_worker.c:1877 UCX INFO ep_cfg[2]: tag(self/memory0 knem/memory cuda_copy/cuda rc_mlx5/mlx5_0:1)
[1645203303.760167] [host1:42:0] ucp_worker.c:1877 UCX INFO ep_cfg[2]: tag(self/memory0 knem/memory cuda_copy/cuda rc_mlx5/mlx5_0:1)
# MPI_Init() took 500.788 msec
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
[1645203303.805848] [host2:43:0] ucp_worker.c:1877 UCX INFO ep_cfg[3]: tag(rc_mlx5/mlx5_0:1)
[1645203303.873362] [host1:42:a] ucp_worker.c:1877 UCX INFO ep_cfg[3]: tag(rc_mlx5/mlx5_0:1)
...
```
- When using protocols v2, set `UCX_PROTO_INFO=y` for detailed information:
```console
$ mpirun -x UCX_PROTO_ENABLE=y -x UCX_PROTO_INFO=y -np 2 --map-by node osu_bw D D
[1645027038.617078] [host1:42:0] +---------------+---------------------------------------------------------------------------------------------------+
[1645027038.617101] [host1:42:0] | mpi ep_cfg[2] | tagged message by ucp_tag_send*() from host memory |
[1645027038.617104] [host1:42:0] +---------------+--------------------------------------------------+------------------------------------------------+
[1645027038.617107] [host1:42:0] | 0..8184 | eager short | self/memory0 |
[1645027038.617110] [host1:42:0] | 8185..9806 | eager copy-in copy-out | self/memory0 |
[1645027038.617112] [host1:42:0] | 9807..inf | (?) rendezvous zero-copy flushed write to remote | 55% on knem/memory and 45% on rc_mlx5/mlx5_0:1 |
[1645027038.617115] [host1:42:0] +---------------+--------------------------------------------------+------------------------------------------------+
[1645027038.617307] [host2:43:0] +---------------+---------------------------------------------------------------------------------------------------+
[1645027038.617337] [host2:43:0] | mpi ep_cfg[2] | tagged message by ucp_tag_send*() from host memory |
[1645027038.617341] [host2:43:0] +---------------+--------------------------------------------------+------------------------------------------------+
[1645027038.617344] [host2:43:0] | 0..8184 | eager short | self/memory0 |
[1645027038.617348] [host2:43:0] | 8185..9806 | eager copy-in copy-out | self/memory0 |
[1645027038.617351] [host2:43:0] | 9807..inf | (?) rendezvous zero-copy flushed write to remote | 55% on knem/memory and 45% on rc_mlx5/mlx5_0:1 |
[1645027038.617354] [host2:43:0] +---------------+--------------------------------------------------+------------------------------------------------+
# MPI_Init() took 1479.255 msec
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Size Bandwidth (MB/s)
[1645027038.674035] [host2:43:0] +---------------+--------------------------------------------------------------+
[1645027038.674043] [host2:43:0] | mpi ep_cfg[3] | tagged message by ucp_tag_send*() from host memory |
[1645027038.674047] [host2:43:0] +---------------+-------------------------------------------+------------------+
[1645027038.674049] [host2:43:0] | 0..2007 | eager short | rc_mlx5/mlx5_0:1 |
[1645027038.674052] [host2:43:0] | 2008..8246 | eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
[1645027038.674055] [host2:43:0] | 8247..17297 | eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
[1645027038.674058] [host2:43:0] | 17298..inf | (?) rendezvous zero-copy read from remote | rc_mlx5/mlx5_0:1 |
[1645027038.674060] [host2:43:0] +---------------+-------------------------------------------+------------------+
[1645027038.680982] [host2:43:0] +---------------+------------------------------------------------------------------------------------+
[1645027038.680993] [host2:43:0] | mpi ep_cfg[3] | tagged message by ucp_tag_send*() from cuda/GPU0 |
[1645027038.680996] [host2:43:0] +---------------+-----------------------------------------------------------------+------------------+
[1645027038.680999] [host2:43:0] | 0..8246 | eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
[1645027038.681001] [host2:43:0] | 8247..811555 | eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
[1645027038.681004] [host2:43:0] | 811556..inf | (?) rendezvous pipeline cuda_copy, fenced write to remote, cuda | rc_mlx5/mlx5_0:1 |
[1645027038.681007] [host2:43:0] +---------------+-----------------------------------------------------------------+------------------+
[1645027038.693843] [host1:42:a] +---------------+--------------------------------------------------------------+
[1645027038.693856] [host1:42:a] | mpi ep_cfg[3] | tagged message by ucp_tag_send*() from host memory |
[1645027038.693858] [host1:42:a] +---------------+-------------------------------------------+------------------+
[1645027038.693861] [host1:42:a] | 0..2007 | eager short | rc_mlx5/mlx5_0:1 |
[1645027038.693863] [host1:42:a] | 2008..8246 | eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
[1645027038.693865] [host1:42:a] | 8247..17297 | eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
[1645027038.693867] [host1:42:a] | 17298..inf | (?) rendezvous zero-copy read from remote | rc_mlx5/mlx5_0:1 |
[1645027038.693869] [host1:42:a] +---------------+-------------------------------------------+------------------+
...
```
|