1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282
|
---
layout: page
title: fi_psm2(7)
tagline: Libfabric Programmer's Manual
---
{% include JB/setup %}
# NAME
fi_psm2 \- The PSM2 Fabric Provider
# OVERVIEW
The *psm2* provider runs over the PSM 2.x interface that is supported
by the Intel Omni-Path Fabric. PSM 2.x has all the PSM 1.x features
plus a set of new functions with enhanced capabilities. Since PSM 1.x
and PSM 2.x are not ABI compatible the *psm2* provider only works with
PSM 2.x and doesn't support Intel TrueScale Fabric.
# LIMITATIONS
The *psm2* provider doesn't support all the features defined in the
libfabric API. Here are some of the limitations:
Endpoint types
: Only support non-connection based types *FI_DGRAM* and *FI_RDM*
Endpoint capabilities
: Endpoints can support any combination of data transfer capabilities
*FI_TAGGED*, *FI_MSG*, *FI_ATOMICS*, and *FI_RMA*. These capabilities
can be further refined by *FI_SEND*, *FI_RECV*, *FI_READ*, *FI_WRITE*,
*FI_REMOTE_READ*, and *FI_REMOTE_WRITE* to limit the direction of
operations.
*FI_MULTI_RECV* is supported for non-tagged message queue only.
Scalable endpoints are supported if the underlying PSM2 library supports
multiple endpoints. This condition must be satisfied both when the
provider is built and when the provider is used. See the *Scalable
endpoints* section for more information.
Other supported capabilities include *FI_TRIGGER*, *FI_REMOTE_CQ_DATA*,
*FI_RMA_EVENT*, *FI_SOURCE*, and *FI_SOURCE_ERR*. Furthermore,
*FI_NAMED_RX_CTX* is supported when scalable endpoints are enabled.
Modes
: *FI_CONTEXT* is required for the *FI_TAGGED* and *FI_MSG*
capabilities. That means, any request belonging to these two
categories that generates a completion must pass as the operation
context a valid pointer to type *struct fi_context*, and the space
referenced by the pointer must remain untouched until the request
has completed. If none of *FI_TAGGED* and *FI_MSG* is asked for,
the *FI_CONTEXT* mode is not required.
Progress
: The *psm2* provider requires manual progress. The application is
expected to call *fi_cq_read* or *fi_cntr_read* function from time
to time when no other libfabric function is called to ensure
progress is made in a timely manner. The provider does support
auto progress mode. However, the performance can be significantly
impacted if the application purely depends on the provider to
make auto progress.
Scalable endpoints
: Scalable endpoints support depends on the multi-EP feature of the *PSM2*
library. If the *PSM2* library supports this feature, the availability is
further controlled by an environment variable *PSM2_MULTI_EP*. The *psm2*
provider automatically sets this variable to 1 if it is not set. The
feature can be disabled explicitly by setting *PSM2_MULTI_EP* to 0.
When creating a scalable endpoint, the exact number of contexts requested
should be set in the "fi_info" structure passed to the *fi_scalable_ep*
function. This number should be set in "fi_info->ep_attr->tx_ctx_cnt" or
"fi_info->ep_attr->rx_ctx_cnt" or both, whichever greater is used. The
*psm2* provider allocates all requested contexts upfront when the scalable
endpoint is created. The same context is used for both Tx and Rx.
For optimal performance, it is advised to avoid having multiple threads
accessing the same context, either directly by posting send/recv/read/write
request, or indirectly by polling associated completion queues or counters.
Using the scalable endpoint as a whole in communication functions is not
supported. Instead, individual tx context or rx context of the scalable
endpoint should be used. Similarly, using the address of the scalable
endpoint as the source address or destination address doesn't collectively
address all the tx/rx contexts. It addresses only the first tx/rx context,
instead.
Shared Tx contexts
: In order to achieve the purpose of saving PSM context by using shared Tx
context, the endpoints bound to the shared Tx contexts need to be Tx only.
The reason is that Rx capability always requires a PSM context, which can
also be automatically used for Tx. As the result, allocating a shared Tx
context for Rx capable endpoints actually consumes one extra context
instead of saving some.
Unsupported features
: These features are unsupported: connection management, passive endpoint,
and shared receive context.
# RUNTIME PARAMETERS
The *psm2* provider checks for the following environment variables:
*FI_PSM2_UUID*
: PSM requires that each job has a unique ID (UUID). All the processes
in the same job need to use the same UUID in order to be able to
talk to each other. The PSM reference manual advises to keep UUID
unique to each job. In practice, it generally works fine to reuse
UUID as long as (1) no two jobs with the same UUID are running at
the same time; and (2) previous jobs with the same UUID have exited
normally. If running into "resource busy" or "connection failure"
issues with unknown reason, it is advisable to manually set the UUID
to a value different from the default.
The default UUID is 00FF00FF-0000-0000-0000-00FF0F0F00FF.
It is possible to create endpoints with UUID different from the one
set here. To achieve that, set 'info->ep_attr->auth_key' to the uuid
value and 'info->ep_attr->auth_key_size' to its size (16 bytes) when
calling fi_endpoint() or fi_scalable_ep(). It is still true that an
endpoint can only communicate with endpoints with the same UUID.
*FI_PSM2_NAME_SERVER*
: The *psm2* provider has a simple built-in name server that can be used
to resolve an IP address or host name into a transport address needed
by the *fi_av_insert* call. The main purpose of this name server is to
allow simple client-server type applications (such as those in *fabtests*)
to be written purely with libfabric, without using any out-of-band
communication mechanism. For such applications, the server would run first
to allow endpoints be created and registered with the name server, and
then the client would call *fi_getinfo* with the *node* parameter set to
the IP address or host name of the server. The resulting *fi_info*
structure would have the transport address of the endpoint created by the
server in the *dest_addr* field. Optionally the *service* parameter can
be used in addition to *node*. Notice that the *service* number is
interpreted by the provider and is not a TCP/IP port number.
The name server is on by default. It can be turned off by setting the
variable to 0. This may save a small amount of resource since a separate
thread is created when the name server is on.
The provider detects OpenMPI and MPICH runs and changes the default setting
to off.
*FI_PSM2_TAGGED_RMA*
: The RMA functions are implemented on top of the PSM Active Message functions.
The Active Message functions have limit on the size of data can be transferred
in a single message. Large transfers can be divided into small chunks and
be pipe-lined. However, the bandwidth is sub-optimal by doing this way.
The *psm2* provider use PSM tag-matching message queue functions to achieve
higher bandwidth for large size RMA. It takes advantage of the extra tag bits
available in PSM2 to separate the RMA traffic from the regular tagged message
queue.
The option is on by default. To turn it off set the variable to 0.
*FI_PSM2_DELAY*
: Time (seconds) to sleep before closing PSM endpoints. This is a workaround
for a bug in some versions of PSM library.
The default setting is 0.
*FI_PSM2_TIMEOUT*
: Timeout (seconds) for gracefully closing PSM endpoints. A forced closing
will be issued if timeout expires.
The default setting is 5.
*FI_PSM2_CONN_TIMEOUT*
: Timeout (seconds) for establishing connection between two PSM endpoints.
The default setting is 5.
*FI_PSM2_PROG_INTERVAL*
: When auto progress is enabled (asked via the hints to *fi_getinfo*),
a progress thread is created to make progress calls from time to time.
This option set the interval (microseconds) between progress calls.
The default setting is 1 if affinity is set, or 1000 if not. See
*FI_PSM2_PROG_AFFINITY*.
*FI_PSM2_PROG_AFFINITY*
: When set, specify the set of CPU cores to set the progress thread
affinity to. The format is
`<start>[:<end>[:<stride>]][,<start>[:<end>[:<stride>]]]*`,
where each triplet `<start>:<end>:<stride>` defines a block of
core_ids. Both `<start>` and `<end>` can be either the `core_id`
(when >=0) or `core_id - num_cores` (when <0).
By default affinity is not set.
*FI_PSM2_INJECT_SIZE*
: Maximum message size allowed for fi_inject and fi_tinject calls. This is
an experimental feature to allow some applications to override default
inject size limitation. When the inject size is larger than the default
value, some inject calls might block.
The default setting is 64.
*FI_PSM2_LOCK_LEVEL*
: When set, dictate the level of locking being used by the provider. Level
2 means all locks are enabled. Level 1 disables some locks and is suitable
for runs that limit the access to each PSM2 context to a single thread.
Level 0 disables all locks and thus is only suitable for single threaded
runs.
To use level 0 or level 1, wait object and auto progress mode cannot be
used because they introduce internal threads that may break the conditions
needed for these levels.
The default setting is 2.
*FI_PSM2_LAZY_CONN*
: There are two strategies on when to establish connections between the PSM2
endpoints that OFI endpoints are built on top of. In eager connection mode,
connections are established when addresses are inserted into the address
vector. In lazy connection mode, connections are established when addresses
are used the first time in communication. Eager connection mode has slightly
lower critical path overhead but lazy connection mode scales better.
This option controls how the two connection modes are used. When set to 1,
lazy connection mode is always used. When set to 0, eager connection mode
is used when required conditions are all met and lazy connection mode is
used otherwise. The conditions for eager connection mode are: (1) multiple
endpoint (and scalable endpoint) support is disabled by explicitly setting
PSM2_MULTI_EP=0; and (2) the address vector type is FI_AV_MAP.
The default setting is 0.
*FI_PSM2_DISCONNECT*
: The provider has a mechanism to automatically send disconnection notifications
to all connected peers before the local endpoint is closed. As the response,
the peers call *psm2_ep_disconnect* to clean up the connection state at their
side. This allows the same PSM2 epid be used by different dynamically started
processes (clients) to communicate with the same peer (server). This mechanism,
however, introduce extra overhead to the finalization phase. For applications
that never reuse epids within the same session such overhead is unnecessary.
This option controls whether the automatic disconnection notification mechanism
should be enabled. For client-server application mentioned above, the client
side should set this option to 1, but the server should set it to 0.
The default setting is 0.
*FI_PSM2_TAG_LAYOUT*
: Select how the 96-bit PSM2 tag bits are organized. Currently three choices are
available: *tag60* means 32-4-60 partitioning for CQ data, internal protocol
flags, and application tag. *tag64* means 4-28-64 partitioning for internal
protocol flags, CQ data, and application tag. *auto* means to choose either
*tag60* or *tag64* based on the hints passed to fi_getinfo -- *tag60* is used
if remote CQ data support is requested explicitly, either by passing non-zero value
via *hints->domain_attr->cq_data_size* or by including *FI_REMOTE_CQ_DATA* in
*hints->caps*, otherwise *tag64* is used. If *tag64* is the result of automatic
selection, *fi_getinfo* also returns a second instance of the provider with
*tag60* layout.
The default setting is *auto*.
Notice that if the provider is compiled with macro *PSMX2_TAG_LAYOUT* defined
to 1 (means *tag60*) or 2 (means *tag64*), the choice is fixed at compile time
and this runtime option will be disabled.
# PSM2 EXTENSIONS
The *psm2* provider supports limited low level parameter setting through the
fi_set_val() and fi_get_val() functions. Currently the following parameters
can be set via the domain fid:
* FI_PSM2_DISCONNECT *
: Overwite the global runtime parameter *FI_PSM2_DISCONNECT* for this domain.
See the *RUNTIME PARAMETERS* section for details.
Valid parameter names are defined in the header file *rdma/fi_ext_psm2.h*.
# SEE ALSO
[`fabric`(7)](fabric.7.html),
[`fi_provider`(7)](fi_provider.7.html),
[`fi_psm`(7)](fi_psm.7.html),
[`fi_psm3`(7)](fi_psm3.7.html),
|