File: pkt-processing.md

package info (click to toggle)
mpich 4.3.2-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 101,184 kB
  • sloc: ansic: 1,040,629; cpp: 82,270; javascript: 40,763; perl: 27,933; python: 16,041; sh: 14,676; xml: 14,418; f90: 12,916; makefile: 9,270; fortran: 8,046; java: 4,635; asm: 324; ruby: 103; awk: 27; lisp: 19; php: 8; sed: 4
file content (81 lines) | stat: -rw-r--r-- 4,570 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
## EFA Libfabric Send/Receive/Completion Paths

### Overview

The EFA provider supports two different endpoint types, `FI_EP_RDM` and
`FI_EP_DGRAM`. This document covers `FI_EP_RDM` as it implements a wire
protocol and software support for some of the Libfabric API such as tag
matching, send after send ordering guarantees, segmentation and reassembly for
large messages, emulation for RMA and atomics, and more.

There are a couple key data structures that are used to implement these
software-level features. The wire protocol that we implement is covered in a
separate document.

### Relevant data structures and functions

`efa_rdm_ep` contains device information and structures for the endpoint including
the device/shm endpoints and completion queues and their state, the packet
pools for recv/send, outstanding app receives to be matched, outstanding sends
in progress, sends and receives queued due to resource exhaustion, unexpected
messages, and structures to track out of order packets and remote peer
capabilities and status.

`efa_rdm_ope` contains information and structures used in send/receive operations. 
It is used in send operation for send posted directly by the app or indirectly 
by emulated read/write operations. When the send is completed a send completion 
will be written and the txe (TX entry) will be released.
It is used in  receive operation for a receive posted by the app. This structure 
is used for tag matching, to queue unexpected messages to be matched later, and to 
keep track of whether long message receives are completed. Just like the txe,
when a receive operation is completed a receive completion is written to the app 
and the `rxe` (RX entry) is released.

`efa_rdm_ep_progress` is the progress handler we register when the completion queue
is created and is called via the util completion queue functions. While the EFA
device will progress sends and receives posted to it, the Libfabric provider
has to process those device completions, potentially copy data out of a bounce
buffer into the application buffer, and write the application completions. This
all happens in this function. The progress handler also progresses long
messages and queued messages.

### Dealing with device resource exhaustion

The EFA device has fixed send and receive queue sizes which the Libfabric
provider has to manage. In general, we try to write an error to the app when
resources are exhausted as the app can manage resource exhaustion better than
the provider. However, there are some cases where we have to queue packets or
store state about a send or receive to be acted on later.

The first case is control messages that have to be queued, for example, we may
send parts of a message and then hit the device limit when sending a segmented,
medium message, or fail to send a control packet containing information that
can't be reconstructed in the future. `efa_rdm_ope_post_send_or_queue` handles
those cases.

We also may queue an rxe/te if we're unable to continue sending segments
or if we fail to post a control message for that entry. You'll find the lists
where those are queued and progressed in `efa_rdm_ep_progress_internal`.

### Dealing with receiver not ready errors (RNR)

Finally, the EFA device may write an error completion for RNR, meaning there is
no receive buffer available for the device to place the payload. This can
happen when the application is not posting receive buffers fast enough, but for
the `FI_EP_RDM` receive buffers are pre posted as packets are processed. When
we get RNR in that case, this means that a peer is overloaded.  This can happen
for any control or data packet we post, so to handle this we queue these
packets to be sent later after we backoff for the remote peer.

The occasional RNR is expected so we configure the device to retransmit a
handful of times without writing an error to the host. This is to avoid the
latency penalty of the device writing an error completion, the provider
processing that completion, and trying the send again. However, once the
Libfabric provider receives an RNR for the same packet that we already tried to
retransmit we start random exponential backoff for that peer. We stop sending
to that peer until the peer exits backoff, meaning we either received a
successful send completion for that peer or the backoff timer expires.

See `efa_rdm_ep_queue_rnr_pkt` for where the packets are queued and backoff timers are
set, and see `efa_rdm_ep_check_peer_backoff_timer` for where those timers are
checked and we allow sends to that remote peer again.