File: ofi.rst

package info (click to toggle)
openmpi 5.0.8-4
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 201,684 kB
  • sloc: ansic: 613,078; makefile: 42,353; sh: 11,194; javascript: 9,244; f90: 7,052; java: 6,404; perl: 5,179; python: 1,859; lex: 740; fortran: 61; cpp: 20; tcl: 12
file content (136 lines) | stat: -rw-r--r-- 5,854 bytes parent folder | download | duplicates (10)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
OpenFabrics Interfaces (OFI) / Libfabric support
================================================

.. error:: TODO This section needs to be converted from FAQ Q&A style
           to regular documentation style.

What is OFI / Libfabric?
------------------------

"OFI" stands for the `OpenFabrics Interfaces
<https://libfabric.org/>`_, which are implemented in the ``libfabric``
library.  These two terms are typically used interchangeably.

Open MPI supports many different underlying networks via Libfabric,
including (but not limited to):

* AWS EFA
* Cisco usNIC
* Cray uGNI
* Cornelis Networks Omni-Path
* HPE Slingshot 11

In general, the OFI-based components in Open MPI will auto-select
themselves as appropriate at run time.

That being said, additional questions are available in this FAQ
section to provide more information about specific OFI-based network
types and support.

/////////////////////////////////////////////////////////////////////////

What are the Libfabric (OFI) components in Open MPI?
----------------------------------------------------

Open MPI has three main components for Libfabric (a.k.a., OFI)
communications:

#. ``ofi`` MTL: Available since Open MPI v1.10, this component is used
   with the ``cm`` PML and is used for two-sided MPI communication
   (e.g., ``MPI_SEND`` and ``MPI_RECV``).

  The ``ofi`` MTL requires that the Libfabric provider support
  reliable datagrams with ordered tagged messaging (specifically:
  ``FI_EP_RDM`` endpoints, ``FI_TAGGED`` capabilities, and
  ``FI_ORDER_SAS`` ordering).

#. ``ofi`` BTL: Available since Open MPI v4.0.0, this component is primarily
   intended for one-sided MPI communications (e.g., ``MPI_PUT``). It
   can also support BTL send/recv operations.
   ``ofi`` BTL requires that the Libfabric provider support reliable
   datagrams, RMA and atomic operations, and remote atomic completion
   notifications (specifically: ``FI_EP_RDM`` endpoints, ``FI_RMA``
   and ``FI_ATOMIC`` capabilities, and ``FI_DELIVERY_COMPLETE`` op
   flags).

#. ``usnic`` BTL: This BTL is used exclusively with Cisco usNIC-based
   networks.  It will auto-select itself over the other OFI-based
   components when run with Cisco usNIC-based networks.

See each Lifabric provider man page (e.g., fi_sockets(7)) to understand which
provider will work for each of the above-listed Open MPI components. Some
providers may require to be used with one of the Libfabric utility providers;
for example, the verbs provider needs to be paired with utility provider
``ofi_rxm`` to provide reliable datagram endpoint support (``verbs;ofi_rxm``).

Both components have MCA parameters to specify the Libfabric provider(s) that
will be included/excluded in the selection process. For example:

.. code-block::

   shell$ mpirun --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2 mpi_hello

In addition, each component has specific parameters for each one; see
``ompi_info --param <framework> <component> -level 9`` for a full
list. For example:

.. code-block::

   shell$ ompi_info --param mtl ofi --level 9

.. important:: When using the HPE CXI provider and ``mpirun`` as the job launcher,
          it is recommended that the PRTE ``ras_base_launch_orted_on_hn`` MCA parameter be set to 1.
          This can be done by adding ``--prtemca ras_base_launch_orted_on_hn 1`` to the job launch
          command line.  This ensures that MPI processes launched on the first node of
          an allocation are able to use the CXI provider.
          
For more information refer to the `Libfabric web site
<https://libfabric.org/>`_.

/////////////////////////////////////////////////////////////////////////

Omni-Path: How can the multi-rail settings be adjusted if multiple HFI (Host Fabric Interface) cards are installed on the system?
---------------------------------------------------------------------------------------------------------------------------------

Multi-Rail feature allows a process to use multiple HFIs to transfer a message
to improve message bandwidth. The PSM2 library handles the support for multi-rail
which is off by default. The multi-rail settings can be modified using the
following environment variables:

* ``PSM2_MULTIRAIL=[0,1,2] ]``: 0=Disabled, 1=Enable across all HFIs in the
  system, 2=Enable multi-rail within a NUMA node.
* ``PSM2_MULTIRAIL_MAP=unit:port,unit:port...``

The variables above may be included in the ``mpirun`` command line or in
the environment. For example:

.. code-block::

   shell$ mpirun -mca mtl [psm2|ofi] -x PSM2_MULTIRAIL=1 -n 2 -H host1,host2 ./a.out

.. note:: When using the OFI MTL, please ensure that the PSM2 OFI
          provider is used for communication with OPA devices.

/////////////////////////////////////////////////////////////////////////

Omni-Path: What is Multi-HFI support in PSM2 and how does it differ from multi-rail?
------------------------------------------------------------------------------------

Multi-HFI support is intended to describe the use of multiple HFIs in
a system among MPI ranks local to a node in order to load-balance the
hardware resources. It differs from the Multi-Rail feature, which is
intended to allow a single process to use all HFIs in the system. For
an MPI job with multiple processes on a single node, the default PSM2
behavior depends on the affinity settings of the MPI process. The PSM2
library defaults to using the HFI (Host Fabric Interface) that is in
the same NUMA node as that of the MPI process.

Users can restrict access to a single HFI using the environment variable:

* ``HFI_UNIT=N``: valid values of N are 0,1,2 and 3

More details can be found on the PSM2 Programmer's Guide and the Omni-Path
Fabric Performance Tuning Guide.

Please see the `Cornelis Networks Customer Center <https://customercenter.cornelisnetworks.com/>`_
for more details.