1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
|
## EFA Libfabric Provider Documentation
The EFA Libfabric provider supports the Amazon Elastic Fabric Adapter (EFA), an
OS bypass network interface available on Amazon EC2 instances. The EFA device
supports both reliable and unreliable datagram send and receive semantics, the
EFA Libfabric provider adds additional functionality in software such as tag
matching, reordering, and software emulation for features the hardware does not
support natively. EFA provides lower and more consistent latency and higher
throughput compared to TCP transports which provides better application
performance for HPC and Machine Learning applications on Amazon EC2.
Please see the [fi_efa(7) man
page](https://ofiwg.github.io/libfabric/master/man/fi_efa.7.html) for more
information on the features and capabilities of the EFA Libfabric provider.
### Background information
The EFA developer documentation assumes a working knowledge of OS bypass
networking and the Libfabric API. The [OFI Programmer's
Guide](https://github.com/ofiwg/ofi-guide/blob/master/OFIGuide.md) provides
motivation for Libfabric and defines the API and structures used by Libfabric
applications.
For more information on EFA, SRD and the [AWS Nitro
System](https://aws.amazon.com/ec2/nitro/), please refer to these resources:
* [A Cloud-Optimized Transport Protocol for Elastic and Scalable
HPC](https://ieeexplore.ieee.org/document/91673990) whitepaper
* [AWS re:Invent 2019 - Monday Night Live with Peter
DeSantis](https://www.youtube.com/watch?v=GPUWATKe15E&feature=youtu.be&t=228)
keynote
* [HPC Application Scaling with Elastic Fabric Adapter (EFA) and Scalable
Reliable Datagram
(SRD)](https://pages.awscloud.com/HPC-Application-Scaling-with-Elastic-Fabric-Adapter-EFA-and-Scalable-Reliable-Datagram-SRD_2020_0004-CMP_OD.html)
tech talk
### Getting started with EFA Libfabric provider development
You will need an Amazon EC2 instance which has EFA support. The [EFA getting
started guide](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) in
the AWS documentation is a good reference on which Amazon EC2 instances support
EFA and how to setup an EFA enabled instance.
To get started with EFA Libfabric provider development, you will need to either
install a pre-packaged version or build the:
* [EFA kernel
driver](https://github.com/amzn/amzn-drivers/tree/master/kernel/linux/efa) - The
driver is required to enable the EFA device and is utilized by libibverbs and
Libfabric to setup and teardown device resources such as queue pairs,
completion queues, memory registration, and address handles. Some OS
distributions provide an up-to-date version of the EFA kernel driver such as
Amazon Linux 2 and Ubuntu.
* [rdma-core](https://github.com/linux-rdma/rdma-core) - The EFA Libfabric
provider utilizes the libibverbs library which provides an abstraction layer
for the Linux kernel verbs interface. This avoids tightly coupling the
Libfabric provider to the EFA kernel driver and simplifies the Libfabric
provider. Similar to the driver, there are OS distributions that pre-package
rdma-core. EFA device support was added to rdma-core version 24.0. However,
it's best to use the latest rdma-core release for bugfixes and to support the
latest device features.
* Ensure you have configured your instance to increase the locked memory limits
(unlimited is fine) and set aside [huge
pages](https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt) for the best
performance. The EFA Libfabric provider will utilize these for the bounce
buffers used for sends and matching receives.
* `FI_HMEM` and CUDA support for NVIDIA GPUDirect + EFA is available, see the
`--with-cuda` configure flag. GPUDirect will be enabled by default if CUDA is
installed in the default system paths.
|