1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658
|
---
layout: page
title: fi_endpoint(3)
tagline: Libfabric Programmer's Manual
---
{% include JB/setup %}
# NAME
fi_endpoint \- Fabric endpoint operations
fi_endpoint / fi_endpoint2 / fi_scalable_ep / fi_passive_ep / fi_close
: Allocate or close an endpoint.
fi_ep_bind
: Associate an endpoint with hardware resources, such as event queues,
completion queues, counters, address vectors, or shared transmit/receive
contexts.
fi_scalable_ep_bind
: Associate a scalable endpoint with an address vector
fi_pep_bind
: Associate a passive endpoint with an event queue
fi_enable
: Transitions an active endpoint into an enabled state.
fi_cancel
: Cancel a pending asynchronous data transfer
fi_ep_alias
: Create an alias to the endpoint
fi_control
: Control endpoint operation.
fi_getopt / fi_setopt
: Get or set endpoint options.
fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
: Open a transmit or receive context.
fi_tc_dscp_set / fi_tc_dscp_get
: Convert between a DSCP value and a network traffic class
fi_rx_size_left / fi_tx_size_left (DEPRECATED)
: Query the lower bound on how many RX/TX operations may be posted without
an operation returning -FI_EAGAIN. This functions have been deprecated
and will be removed in a future version of the library.
# SYNOPSIS
```c
#include <rdma/fabric.h>
#include <rdma/fi_endpoint.h>
int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
struct fid_ep **ep, void *context);
int fi_endpoint2(struct fid_domain *domain, struct fi_info *info,
struct fid_ep **ep, uint64_t flags, void *context);
int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
struct fid_ep **sep, void *context);
int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
struct fid_pep **pep, void *context);
int fi_tx_context(struct fid_ep *sep, int index,
struct fi_tx_attr *attr, struct fid_ep **tx_ep,
void *context);
int fi_rx_context(struct fid_ep *sep, int index,
struct fi_rx_attr *attr, struct fid_ep **rx_ep,
void *context);
int fi_stx_context(struct fid_domain *domain,
struct fi_tx_attr *attr, struct fid_stx **stx,
void *context);
int fi_srx_context(struct fid_domain *domain,
struct fi_rx_attr *attr, struct fid_ep **rx_ep,
void *context);
int fi_close(struct fid *ep);
int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);
int fi_enable(struct fid_ep *ep);
int fi_cancel(struct fid_ep *ep, void *context);
int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);
int fi_control(struct fid *ep, int command, void *arg);
int fi_getopt(struct fid *ep, int level, int optname,
void *optval, size_t *optlen);
int fi_setopt(struct fid *ep, int level, int optname,
const void *optval, size_t optlen);
uint32_t fi_tc_dscp_set(uint8_t dscp);
uint8_t fi_tc_dscp_get(uint32_t tclass);
DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);
DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
```
# ARGUMENTS
*fid*
: On creation, specifies a fabric or access domain. On bind,
identifies the event queue, completion queue, counter, or address vector to
bind to the endpoint. In other cases, it's a fabric identifier of an
associated resource.
*info*
: Details about the fabric interface endpoint to be opened, obtained
from fi_getinfo.
*ep*
: A fabric endpoint.
*sep*
: A scalable fabric endpoint.
*pep*
: A passive fabric endpoint.
*context*
: Context associated with the endpoint or asynchronous operation.
*index*
: Index to retrieve a specific transmit/receive context.
*attr*
: Transmit or receive context attributes.
*flags*
: Additional flags to apply to the operation.
*command*
: Command of control operation to perform on endpoint.
*arg*
: Optional control argument.
*level*
: Protocol level at which the desired option resides.
*optname*
: The protocol option to read or set.
*optval*
: The option value that was read or to set.
*optlen*
: The size of the optval buffer.
# DESCRIPTION
Endpoints are transport level communication portals. There are two
types of endpoints: active and passive. Passive endpoints belong to a
fabric domain and are most often used to listen for incoming connection
requests. However, a passive endpoint may be used to reserve a fabric address
that can be granted to an active endpoint. Active endpoints belong to access
domains and can perform data transfers.
Active endpoints may be connection-oriented or connectionless, and may
provide data reliability. The data transfer interfaces -- messages (fi_msg),
tagged messages (fi_tagged), RMA (fi_rma), and atomics (fi_atomic) --
are associated with active endpoints. In basic configurations, an
active endpoint has transmit and receive queues. In general, operations
that generate traffic on the fabric are posted to the transmit queue.
This includes all RMA and atomic operations, along with sent messages and
sent tagged messages. Operations that post buffers for receiving incoming
data are submitted to the receive queue.
Active endpoints are created in the disabled state. They must
transition into an enabled state before accepting data transfer
operations, including posting of receive buffers. The fi_enable call
is used to transition an active endpoint into an enabled state. The
fi_connect and fi_accept calls will also transition an endpoint into
the enabled state, if it is not already active.
In order to transition an endpoint into an enabled state, it must be
bound to one or more fabric resources. An endpoint that will generate
asynchronous completions, either through data transfer operations or
communication establishment events, must be bound to the appropriate
completion queues or event queues, respectively, before being enabled.
Additionally, endpoints that use manual progress must be associated
with relevant completion queues or event queues in order to drive
progress. For endpoints that are only used as the target of RMA or
atomic operations, this means binding the endpoint to a completion
queue associated with receive processing. Connectionless endpoints must
be bound to an address vector.
Once an endpoint has been activated, it may be associated with an address
vector. Receive buffers may be posted to it and
calls may be made to connection establishment routines.
Connectionless endpoints may also perform data transfers.
The behavior of an endpoint may be adjusted by setting its control
data and protocol options. This allows the underlying provider to
redirect function calls to implementations optimized to meet the
desired application behavior.
If an endpoint experiences a critical error, it will transition back
into a disabled state. Critical errors are reported through the
event queue associated with the EP. In certain cases, a disabled endpoint may
be re-enabled. The ability to transition back into an enabled
state is provider specific and depends on the type of error that
the endpoint experienced. When an endpoint is disabled as a result
of a critical error, all pending operations are discarded.
## fi_endpoint / fi_passive_ep / fi_scalable_ep
fi_endpoint allocates a new active endpoint. fi_passive_ep allocates a
new passive endpoint. fi_scalable_ep allocates a scalable endpoint.
The properties and behavior of the endpoint are defined based on the
provided struct fi_info. See fi_getinfo for additional details on
fi_info. fi_info flags that control the operation of an endpoint are
defined below. See section SCALABLE ENDPOINTS.
If an active endpoint is allocated in order to accept a connection request,
the fi_info parameter must be the same as the fi_info structure provided with
the connection request (FI_CONNREQ) event.
An active endpoint may acquire the properties of a passive endpoint by setting
the fi_info handle field to the passive endpoint fabric descriptor. This is
useful for applications that need to reserve the fabric address of an
endpoint prior to knowing if the endpoint will be used on the active or passive
side of a connection. For example, this feature is useful for simulating
socket semantics. Once an active endpoint acquires the properties of a passive
endpoint, the passive endpoint is no longer bound to any fabric resources and
must no longer be used. The user is expected to close the passive endpoint
after opening the active endpoint in order to free up any lingering resources
that had been used.
## fi_endpoint2
Similar to fi_endpoint, buf accepts an extra parameter *flags*. Mainly used for
opening endpoints that use peer transfer feature. See
[`fi_peer`(3)](fi_peer.3.html)
## fi_close
Closes an endpoint and release all resources associated with it.
When closing a scalable endpoint, there must be no opened transmit contexts, or
receive contexts associated with the scalable endpoint. If resources are still
associated with the scalable endpoint when attempting to close, the call will
return -FI_EBUSY.
Outstanding operations posted to the endpoint when fi_close is
called will be discarded. Discarded operations will silently be dropped,
with no completions reported. Additionally, a provider may discard previously
completed operations from the associated completion queue(s). The
behavior to discard completed operations is provider specific.
## fi_ep_bind
fi_ep_bind is used to associate an endpoint with other allocated
resources, such as completion queues, counters, address vectors,
event queues, shared contexts, and memory regions. The type of objects that
must be bound with an endpoint depend on the endpoint type and its
configuration.
Passive endpoints must be bound with an EQ that supports connection
management events. Connectionless endpoints must be bound to a
single address vector. If an endpoint is using a shared transmit
and/or receive context, the shared contexts must be bound to the endpoint.
CQs, counters, AV, and shared contexts must be bound to endpoints
before they are enabled either explicitly or implicitly.
An endpoint must be bound with CQs capable of reporting completions for any
asynchronous operation initiated on the endpoint. For example, if the
endpoint supports any outbound transfers (sends, RMA, atomics, etc.), then
it must be bound to a completion queue that can report transmit completions.
This is true even if the endpoint is configured to suppress successful
completions, in order that operations that complete in error may be reported
to the user.
An active endpoint may direct asynchronous completions to different
CQs, based on the type of operation. This is specified using
fi_ep_bind flags. The following flags may be OR'ed together when
binding an endpoint to a completion domain CQ.
*FI_RECV*
: Directs the notification of inbound data transfers to the specified
completion queue. This includes received messages. This binding
automatically includes FI_REMOTE_WRITE, if applicable to the
endpoint.
*FI_SELECTIVE_COMPLETION*
: By default, data transfer operations write CQ completion entries
into the associated completion queue after they have successfully completed.
Applications can use this bind flag to selectively enable when
completions are generated. If FI_SELECTIVE_COMPLETION is specified,
data transfer operations will not generate CQ entries for _successful_
completions unless FI_COMPLETION is set as an operational flag for the
given operation. Operations that fail asynchronously will still generate
completions, even if a completion is not requested. FI_SELECTIVE_COMPLETION
must be OR'ed with FI_TRANSMIT and/or FI_RECV flags.
When FI_SELECTIVE_COMPLETION is set, the user must determine when a
request that does NOT have FI_COMPLETION set has completed indirectly,
usually based on the completion of a subsequent operation or by using
completion counters. Use of this flag may improve performance by allowing
the provider to avoid writing a CQ completion entry for every operation.
See Notes section below for additional information on how this flag
interacts with the FI_CONTEXT and FI_CONTEXT2 mode bits.
*FI_TRANSMIT*
: Directs the completion of outbound data transfer requests to the
specified completion queue. This includes send message, RMA, and
atomic operations.
An endpoint may optionally be bound to a completion counter. Associating
an endpoint with a counter is in addition to binding the EP with a CQ. When
binding an endpoint to a counter, the following flags may be specified.
*FI_READ*
: Increments the specified counter whenever an RMA read, atomic fetch,
or atomic compare operation initiated from the endpoint has completed
successfully or in error.
*FI_RECV*
: Increments the specified counter whenever a message is
received over the endpoint. Received messages include both tagged
and normal message operations.
*FI_REMOTE_READ*
: Increments the specified counter whenever an RMA read, atomic fetch, or
atomic compare operation is initiated from a remote endpoint that
targets the given endpoint. Use of this flag requires that the
endpoint be created using FI_RMA_EVENT.
*FI_REMOTE_WRITE*
: Increments the specified counter whenever an RMA write or base
atomic operation is initiated from a remote endpoint that targets
the given endpoint. Use of this flag requires that the
endpoint be created using FI_RMA_EVENT.
*FI_SEND*
: Increments the specified counter whenever a message transfer initiated
over the endpoint has completed successfully or in error. Sent messages
include both tagged and normal message operations.
*FI_WRITE*
: Increments the specified counter whenever an RMA write or base atomic
operation initiated from the endpoint has completed successfully or in error.
An endpoint may only be bound to a single CQ or counter for a given
type of operation. For example, a EP may not bind to two counters
both using FI_WRITE. Furthermore, providers may limit CQ and counter
bindings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
## fi_scalable_ep_bind
fi_scalable_ep_bind is used to associate a scalable endpoint with an
address vector. See section on SCALABLE ENDPOINTS. A scalable
endpoint has a single transport level address and can support multiple
transmit and receive contexts. The transmit and receive contexts share
the transport-level address. Address vectors that are bound to
scalable endpoints are implicitly bound to any transmit or receive
contexts created using the scalable endpoint.
## fi_enable
This call transitions the endpoint into an enabled state. An endpoint
must be enabled before it may be used to perform data transfers.
Enabling an endpoint typically results in hardware resources being
assigned to it. Endpoints making use of completion queues, counters,
event queues, and/or address vectors must be bound to them before being
enabled.
Calling connect or accept on an endpoint will implicitly enable an
endpoint if it has not already been enabled.
fi_enable may also be used to re-enable an endpoint that has been
disabled as a result of experiencing a critical error. Applications
should check the return value from fi_enable to see if a disabled
endpoint has successfully be re-enabled.
## fi_cancel
fi_cancel attempts to cancel an outstanding asynchronous operation.
Canceling an operation causes the fabric provider to search for the
operation and, if it is still pending, complete it as having been
canceled. An error queue entry will be available in the
associated error queue with error code FI_ECANCELED. On the other hand,
if the operation completed before the call to fi_cancel, then the
completion status of that operation will be available in the associated
completion queue. No specific entry related to fi_cancel itself will be posted.
Cancel uses the context parameter associated with an operation to identify
the request to cancel. Operations posted without a valid context parameter --
either no context parameter is specified or the context value was ignored
by the provider -- cannot be canceled. If multiple outstanding operations
match the context parameter, only one will be canceled. In this case, the
operation which is canceled is provider specific.
The cancel operation is asynchronous, but will complete within a bounded
period of time.
## fi_ep_alias
This call creates an alias to the specified endpoint. Conceptually,
an endpoint alias provides an alternate software path from the
application to the underlying provider hardware. An alias EP differs
from its parent endpoint only by its default data transfer flags. For
example, an alias EP may be configured to use a different completion
mode. By default, an alias EP inherits the same data transfer flags
as the parent endpoint. An application can use fi_control to modify
the alias EP operational flags.
When allocating an alias, an application may configure either the transmit
or receive operational flags. This avoids needing a separate call to
fi_control to set those flags. The flags passed to fi_ep_alias must
include FI_TRANSMIT or FI_RECV (not both) with other operational flags OR'ed
in. This will override the transmit or receive flags,
respectively, for operations posted through the alias endpoint.
All allocated aliases must be closed for the underlying endpoint to be
released.
## fi_control
The control operation is used to adjust the default behavior of an
endpoint. It allows the underlying provider to redirect function
calls to implementations optimized to meet the desired application
behavior. As a result, calls to fi_ep_control must be serialized
against all other calls to an endpoint.
The base operation of an endpoint is selected during creation using
struct fi_info. The following control commands and arguments may be
assigned to an endpoint.
**FI_BACKLOG - int *value**
: This option only applies to passive endpoints. It is used to set the
connection request backlog for listening endpoints.
**FI_GETOPSFLAG -- uint64_t *flags**
: Used to retrieve the current value of flags associated with the data
transfer operations initiated on the endpoint. The control argument must
include FI_TRANSMIT or FI_RECV (not both) flags to indicate the type of
data transfer flags to be returned.
See below for a list of control flags.
**FI_GETWAIT -- void \*\***
: This command allows the user to retrieve the file descriptor associated
with a socket endpoint. The fi_control arg parameter should be an address
where a pointer to the returned file descriptor will be written. See fi_eq.3
for addition details using fi_control with FI_GETWAIT. The file descriptor
may be used for notification that the endpoint is ready to send or receive
data.
**FI_SETOPSFLAG -- uint64_t *flags**
: Used to change the data transfer operation flags associated with an
endpoint. The control argument must include FI_TRANSMIT or FI_RECV (not both)
to indicate the type of data transfer that the flags should apply to, with other
flags OR'ed in. The given flags will override the previous transmit and receive
attributes that were set when the endpoint was created.
Valid control flags are defined below.
## fi_getopt / fi_setopt
Endpoint protocol operations may be retrieved using fi_getopt or set
using fi_setopt. Applications specify the level that a desired option
exists, identify the option, and provide input/output buffers to get
or set the option. fi_setopt provides an application a way to adjust
low-level protocol and implementation specific details of an endpoint.
The following option levels and option names and parameters are defined.
*FI_OPT_ENDPOINT*
- *FI_OPT_BUFFERED_LIMIT - size_t*
: Defines the maximum size of a buffered message that will be reported
to users as part of a receive completion when the FI_BUFFERED_RECV mode
is enabled on an endpoint.
fi_getopt() will return the currently configured threshold, or the
provider's default threshold if one has not be set by the application.
fi_setopt() allows an application to configure the threshold. If the
provider cannot support the requested threshold, it will fail the
fi_setopt() call with FI_EMSGSIZE. Calling fi_setopt() with the
threshold set to SIZE_MAX will set the threshold to the maximum
supported by the provider. fi_getopt() can then be used to retrieve
the set size.
In most cases, the sending and receiving endpoints must be
configured to use the same threshold value, and the threshold must be
set prior to enabling the endpoint.
- *FI_OPT_BUFFERED_MIN - size_t*
: Defines the minimum size of a buffered message that will be reported.
Applications would set this to a size that's big enough to decide whether
to discard or claim a buffered receive or when to claim a buffered receive
on getting a buffered receive completion. The value is typically used by a
provider when sending a rendezvous protocol request where it would send
at least FI_OPT_BUFFERED_MIN bytes of application data along with it. A smaller
sized rendezvous protocol message usually results in better latency for the
overall transfer of a large message.
- *FI_OPT_CM_DATA_SIZE - size_t*
: Defines the size of available space in CM messages for user-defined
data. This value limits the amount of data that applications can exchange
between peer endpoints using the fi_connect, fi_accept, and fi_reject
operations. The size returned is dependent upon the properties of the
endpoint, except in the case of passive endpoints, in which the size reflects
the maximum size of the data that may be present as part of a connection
request event. This option is read only.
- *FI_OPT_MIN_MULTI_RECV - size_t*
: Defines the minimum receive buffer space available when the receive
buffer is released by the provider (see FI_MULTI_RECV). Modifying this
value is only guaranteed to set the minimum buffer space needed on
receives posted after the value has been changed. It is recommended
that applications that want to override the default MIN_MULTI_RECV
value set this option before enabling the corresponding endpoint.
- *FI_OPT_FI_HMEM_P2P - int*
: Defines how the provider should handle peer to peer FI_HMEM transfers for
this endpoint. By default, the provider will chose whether to use peer to peer
support based on the type of transfer (FI_HMEM_P2P_ENABLED). Valid values
defined in fi_endpoint.h are:
* FI_HMEM_P2P_ENABLED: Peer to peer support may be used by the provider
to handle FI_HMEM transfers, and which transfers are initiated using
peer to peer is subject to the provider implementation.
* FI_HMEM_P2P_REQUIRED: Peer to peer support must be used for
transfers, transfers that cannot be performed using p2p will be
reported as failing.
* FI_HMEM_P2P_PREFERRED: Peer to peer support should be used by the
provider for all transfers if available, but the provider may choose
to copy the data to initiate the transfer if peer to peer support is
unavailable.
* FI_HMEM_P2P_DISABLED: Peer to peer support should not be used.
: fi_setopt() will return -FI_EOPNOTSUPP if the mode requested cannot be supported
by the provider.
: The FI_HMEM_DISABLE_P2P environment variable discussed in
[`fi_mr`(3)](fi_mr.3.html) takes precedence over this setopt option.
- *FI_OPT_XPU_TRIGGER - struct fi_trigger_xpu \**
: This option only applies to the fi_getopt() call. It is used to query
the maximum number of variables required to support XPU
triggered operations, along with the size of each variable.
The user provides a filled out struct fi_trigger_xpu on input. The iface
and device fields should reference an HMEM domain. If the provider does not
support XPU triggered operations from the given device, fi_getopt() will
return -FI_EOPNOTSUPP. On input, var should reference an array of
struct fi_trigger_var data structures, with count set to the size of the
referenced array. If count is 0, the var field will be ignored, and the
provider will return the number of fi_trigger_var structures needed. If
count is > 0, the provider will set count to the needed value, and for
each fi_trigger_var available, set the datatype and count of the variable
used for the trigger.
- *FI_OPT_CUDA_API_PERMITTED - bool \**
: This option only applies to the fi_setopt call. It is used to control
endpoint's behavior in making calls to CUDA API. By default, an endpoint
is permitted to call CUDA API. If user wish to prohibit an endpoint from
making such calls, user can achieve that by set this option to false.
If an endpoint's support of CUDA memory relies on making calls to CUDA API,
it will return -FI_EOPNOTSUPP for the call to fi_setopt.
If either CUDA library or CUDA device is not available, endpoint will
return -FI_EINVAL.
All providers that support FI_HMEM capability implement this option.
## fi_tc_dscp_set
This call converts a DSCP defined value into a libfabric traffic class value.
It should be used when assigning a DSCP value when setting the tclass field
in either domain or endpoint attributes
## fi_tc_dscp_get
This call returns the DSCP value associated with the tclass field for the
domain or endpoint attributes.
## fi_rx_size_left (DEPRECATED)
This function has been deprecated and will be removed in a future version
of the library. It may not be supported by all providers.
The fi_rx_size_left call returns a lower bound on the number of receive
operations that may be posted to the given endpoint without that operation
returning -FI_EAGAIN. Depending on the specific details of the subsequently
posted receive operations (e.g., number of iov entries, which receive function
is called, etc.), it may be possible to post more receive operations than
originally indicated by fi_rx_size_left.
## fi_tx_size_left (DEPRECATED)
This function has been deprecated and will be removed in a future version
of the library. It may not be supported by all providers.
The fi_tx_size_left call returns a lower bound on the number of transmit
operations that may be posted to the given endpoint without that operation
returning -FI_EAGAIN. Depending on the specific details of the subsequently
posted transmit operations (e.g., number of iov entries, which transmit
function is called, etc.), it may be possible to post more transmit operations
than originally indicated by fi_tx_size_left.
# ENDPOINT ATTRIBUTES
The fi_ep_attr structure defines the set of attributes associated with
an endpoint. Endpoint attributes may be further refined using the transmit
and receive context attributes as shown below.
{% highlight c %}
struct fi_ep_attr {
enum fi_ep_type type;
uint32_t protocol;
uint32_t protocol_version;
size_t max_msg_size;
size_t msg_prefix_size;
size_t max_order_raw_size;
size_t max_order_war_size;
size_t max_order_waw_size;
uint64_t mem_tag_format;
size_t tx_ctx_cnt;
size_t rx_ctx_cnt;
size_t auth_key_size;
uint8_t *auth_key;
};
{% endhighlight %}
## type - Endpoint Type
If specified, indicates the type of fabric interface communication
desired. Supported types are:
*FI_EP_DGRAM*
: Supports a connectionless, unreliable datagram communication.
Message boundaries are maintained, but the maximum message size may
be limited to the fabric MTU. Flow control is not guaranteed.
*FI_EP_MSG*
: Provides a reliable, connection-oriented data transfer service with
flow control that maintains message boundaries.
*FI_EP_RDM*
: Reliable datagram message. Provides a reliable, connectionless data
transfer service with flow control that maintains message
boundaries.
*FI_EP_SOCK_DGRAM*
: A connectionless, unreliable datagram endpoint with UDP socket-like
semantics. FI_EP_SOCK_DGRAM is most useful for applications designed
around using UDP sockets. See the SOCKET ENDPOINT section for additional
details and restrictions that apply to datagram socket endpoints.
*FI_EP_SOCK_STREAM*
: Data streaming endpoint with TCP socket-like semantics. Provides
a reliable, connection-oriented data transfer service that does
not maintain message boundaries. FI_EP_SOCK_STREAM is most useful for
applications designed around using TCP sockets. See the SOCKET
ENDPOINT section for additional details and restrictions that apply
to stream endpoints.
*FI_EP_UNSPEC*
: The type of endpoint is not specified. This is usually provided as
input, with other attributes of the endpoint or the provider
selecting the type.
## Protocol
Specifies the low-level end to end protocol employed by the provider.
A matching protocol must be used by communicating endpoints to ensure
interoperability. The following protocol values are defined.
Provider specific protocols are also allowed. Provider specific
protocols will be indicated by having the upper bit of the
protocol value set to one.
*FI_PROTO_EFA*
: Proprietary protocol on Elastic Fabric Adapter fabric. It supports both
DGRAM and RDM endpoints.
*FI_PROTO_GNI*
: Protocol runs over Cray GNI low-level interface.
*FI_PROTO_IB_RDM*
: Reliable-datagram protocol implemented over InfiniBand reliable-connected
queue pairs.
*FI_PROTO_IB_UD*
: The protocol runs over Infiniband unreliable datagram queue pairs.
*FI_PROTO_IWARP*
: The protocol runs over the Internet wide area RDMA protocol transport.
*FI_PROTO_IWARP_RDM*
: Reliable-datagram protocol implemented over iWarp reliable-connected
queue pairs.
*FI_PROTO_NETWORKDIRECT*
: Protocol runs over Microsoft NetworkDirect service provider interface.
This adds reliable-datagram semantics over the NetworkDirect connection-
oriented endpoint semantics.
*FI_PROTO_PSMX*
: The protocol is based on an Intel proprietary protocol known as PSM,
performance scaled messaging. PSMX is an extended version of the
PSM protocol to support the libfabric interfaces.
*FI_PROTO_PSMX2*
: The protocol is based on an Intel proprietary protocol known as PSM2,
performance scaled messaging version 2. PSMX2 is an extended version of the
PSM2 protocol to support the libfabric interfaces.
*FI_PROTO_PSMX3*
: The protocol is Intel's protocol known as PSM3, performance scaled
messaging version 3. PSMX3 is implemented over RoCEv2 and verbs.
*FI_PROTO_RDMA_CM_IB_RC*
: The protocol runs over Infiniband reliable-connected queue pairs,
using the RDMA CM protocol for connection establishment.
*FI_PROTO_RXD*
: Reliable-datagram protocol implemented over datagram endpoints. RXD is
a libfabric utility component that adds RDM endpoint semantics over
DGRAM endpoint semantics.
*FI_PROTO_RXM*
: Reliable-datagram protocol implemented over message endpoints. RXM is
a libfabric utility component that adds RDM endpoint semantics over
MSG endpoint semantics.
*FI_PROTO_SOCK_TCP*
: The protocol is layered over TCP packets.
*FI_PROTO_UDP*
: The protocol sends and receives UDP datagrams. For example, an
endpoint using *FI_PROTO_UDP* will be able to communicate with a
remote peer that is using Berkeley *SOCK_DGRAM* sockets using
*IPPROTO_UDP*.
*FI_PROTO_UNSPEC*
: The protocol is not specified. This is usually provided as input,
with other attributes of the socket or the provider selecting the
actual protocol.
## protocol_version - Protocol Version
Identifies which version of the protocol is employed by the provider.
The protocol version allows providers to extend an existing protocol,
by adding support for additional features or functionality for example,
in a backward compatible manner. Providers that support different versions
of the same protocol should inter-operate, but only when using the
capabilities defined for the lesser version.
## max_msg_size - Max Message Size
Defines the maximum size for an application data transfer as a single
operation.
## msg_prefix_size - Message Prefix Size
Specifies the size of any required message prefix buffer space. This
field will be 0 unless the FI_MSG_PREFIX mode is enabled. If
msg_prefix_size is > 0 the specified value will be a multiple of
8-bytes.
## Max RMA Ordered Size
The maximum ordered size specifies the delivery order of transport
data into target memory for RMA and atomic operations. Data ordering
is separate, but dependent on message ordering (defined below). Data
ordering is unspecified where message order is not defined.
Data ordering refers to the access of the same target memory by subsequent
operations. When back to back RMA read or write operations access the
same registered memory location, data ordering indicates whether the
second operation reads or writes the target memory after the first
operation has completed. For example, will an RMA read that follows
an RMA write read back the data that was written? Similarly, will an
RMA write that follows an RMA read update the target buffer after the
read has transferred the original data? Data ordering answers these
questions, even in the presence of errors, such as the need to resend
data because of lost or corrupted network traffic.
RMA ordering applies between two operations, and not within a single
data transfer. Therefore, ordering is defined
per byte-addressable memory location. I.e. ordering specifies
whether location X is accessed by the second operation after the first
operation. Nothing is implied about the completion of the first
operation before the second operation is initiated. For example, if
the first operation updates locations X and Y, but the second operation
only accesses location X, there are no guarantees defined relative to
location Y and the second operation.
In order to support large data transfers being broken into multiple packets
and sent using multiple paths through the fabric, data ordering may be
limited to transfers of a specific size or less. Providers specify when
data ordering is maintained through the following values. Note that even
if data ordering is not maintained, message ordering may be.
*max_order_raw_size*
: Read after write size. If set, an RMA or atomic read operation
issued after an RMA or atomic write operation, both of which are
smaller than the size, will be ordered. Where the target memory
locations overlap, the RMA or atomic read operation will see the
results of the previous RMA or atomic write.
*max_order_war_size*
: Write after read size. If set, an RMA or atomic write operation
issued after an RMA or atomic read operation, both of which are
smaller than the size, will be ordered. The RMA or atomic read
operation will see the initial value of the target memory location
before a subsequent RMA or atomic write updates the value.
*max_order_waw_size*
: Write after write size. If set, an RMA or atomic write operation
issued after an RMA or atomic write operation, both of which are
smaller than the size, will be ordered. The target memory location
will reflect the results of the second RMA or atomic write.
An order size value of 0 indicates that ordering is not guaranteed.
A value of -1 guarantees ordering for any data size.
## mem_tag_format - Memory Tag Format
The memory tag format is a bit array used to convey the number of
tagged bits supported by a provider. Additionally, it may be used to
divide the bit array into separate fields. The mem_tag_format
optionally begins with a series of bits set to 0, to signify bits
which are ignored by the provider. Following the initial prefix of
ignored bits, the array will consist of alternating groups of bits set
to all 1's or all 0's. Each group of bits corresponds to a tagged
field. The implication of defining a tagged field is that when a mask
is applied to the tagged bit array, all bits belonging to a single
field will either be set to 1 or 0, collectively.
For example, a mem_tag_format of 0x30FF indicates support for 14
tagged bits, separated into 3 fields. The first field consists of
2-bits, the second field 4-bits, and the final field 8-bits. Valid
masks for such a tagged field would be a bitwise OR'ing of zero or
more of the following values: 0x3000, 0x0F00, and 0x00FF. The provider
may not validate the mask provided by the application for performance
reasons.
By identifying fields within a tag, a provider may be able to optimize
their search routines. An application which requests tag fields must
provide tag masks that either set all mask bits corresponding to a
field to all 0 or all 1. When negotiating tag fields, an application
can request a specific number of fields of a given size. A provider
must return a tag format that supports the requested number of fields,
with each field being at least the size requested, or fail the
request. A provider may increase the size of the fields. When reporting
completions (see FI_CQ_FORMAT_TAGGED), it is not guaranteed that the
provider would clear out any unsupported tag bits in the tag field of
the completion entry.
It is recommended that field sizes be ordered from smallest to
largest. A generic, unstructured tag and mask can be achieved by
requesting a bit array consisting of alternating 1's and 0's.
## tx_ctx_cnt - Transmit Context Count
Number of transmit contexts to associate with the endpoint. If not
specified (0), 1 context will be assigned if the endpoint supports
outbound transfers. Transmit contexts are independent transmit queues
that may be separately configured. Each transmit context may be bound
to a separate CQ, and no ordering is defined between contexts.
Additionally, no synchronization is needed when accessing contexts in
parallel.
If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
be configured to use a shared transmit context, if supported by the
provider. Providers that do not support shared transmit contexts will
fail the request.
See the scalable endpoint and shared contexts sections for additional
details.
## rx_ctx_cnt - Receive Context Count
Number of receive contexts to associate with the endpoint. If not
specified, 1 context will be assigned if the endpoint supports inbound
transfers. Receive contexts are independent processing queues that
may be separately configured. Each receive context may be bound to a
separate CQ, and no ordering is defined between contexts.
Additionally, no synchronization is needed when accessing contexts in
parallel.
If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
be configured to use a shared receive context, if supported by the
provider. Providers that do not support shared receive contexts will
fail the request.
See the scalable endpoint and shared contexts sections for additional
details.
## auth_key_size - Authorization Key Length
The length of the authorization key in bytes. This field will be 0 if
authorization keys are not available or used. This field is ignored
unless the fabric is opened with API version 1.5 or greater.
## auth_key - Authorization Key
If supported by the fabric, an authorization key (a.k.a. job
key) to associate with the endpoint. An authorization key is used
to limit communication between endpoints. Only peer endpoints that are
programmed to use the same authorization key may communicate.
Authorization keys are often used to implement job keys, to ensure
that processes running in different jobs do not accidentally
cross traffic. The domain authorization key will be used if auth_key_size
is set to 0. This field is ignored unless the fabric is opened with API
version 1.5 or greater.
# TRANSMIT CONTEXT ATTRIBUTES
Attributes specific to the transmit capabilities of an endpoint are
specified using struct fi_tx_attr.
{% highlight c %}
struct fi_tx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t inject_size;
size_t size;
size_t iov_limit;
size_t rma_iov_limit;
uint32_t tclass;
};
{% endhighlight %}
## caps - Capabilities
The requested capabilities of the context. The capabilities must be
a subset of those requested of the associated endpoint. See the
CAPABILITIES section of fi_getinfo(3) for capability details. If
the caps field is 0 on input to fi_getinfo(3), the applicable
capability bits from the fi_info structure will be used.
The following capabilities apply to the transmit attributes: FI_MSG,
FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND, FI_HMEM,
FI_TRIGGER, FI_FENCE, FI_MULTICAST, FI_RMA_PMEM, FI_NAMED_RX_CTX,
FI_COLLECTIVE, and FI_XPU.
Many applications will be able to ignore this field and rely solely
on the fi_info::caps field. Use of this field provides fine grained
control over the transmit capabilities associated with an endpoint.
It is useful when handling scalable endpoints, with multiple transmit
contexts, for example, and allows configuring a specific transmit
context with fewer capabilities than that supported by the endpoint
or other transmit contexts.
## mode
The operational mode bits of the context. The mode bits will be a
subset of those associated with the endpoint. See the MODE section
of fi_getinfo(3) for details. A mode value of 0 will be ignored on
input to fi_getinfo(3), with the mode value of the fi_info structure
used instead. On return from fi_getinfo(3), the mode will be set
only to those constraints specific to transmit operations.
## op_flags - Default transmit operation flags
Flags that control the operation of operations submitted against the
context. Applicable flags are listed in the Operation Flags
section.
## msg_order - Message Ordering
Message ordering refers to the order in which transport layer headers
(as viewed by the application) are identified and processed. Relaxed message
order enables data transfers to be sent and received out of order, which may
improve performance by utilizing multiple paths through the fabric
from the initiating endpoint to a target endpoint. Message order
applies only between a single source and destination endpoint pair.
Ordering between different target endpoints is not defined.
Message order is determined using a set of ordering bits. Each set
bit indicates that ordering is maintained between data transfers of
the specified type. Message order is defined for [read | write |
send] operations submitted by an application after [read | write |
send] operations.
Message ordering only applies to the end to end transmission of transport
headers. Message ordering is necessary, but does not guarantee, the order in
which message data is sent or received by the transport layer. Message
ordering requires matching ordering semantics on the receiving side of a data
transfer operation in order to guarantee that ordering is met.
*FI_ORDER_ATOMIC_RAR*
: Atomic read after read. If set, atomic fetch operations are
transmitted in the order submitted relative to other
atomic fetch operations. If not set, atomic fetches
may be transmitted out of order from their submission.
*FI_ORDER_ATOMIC_RAW*
: Atomic read after write. If set, atomic fetch operations are
transmitted in the order submitted relative to atomic update
operations. If not set, atomic fetches may be transmitted ahead
of atomic updates.
*FI_ORDER_ATOMIC_WAR*
: RMA write after read. If set, atomic update operations are
transmitted in the order submitted relative to atomic fetch
operations. If not set, atomic updates may be transmitted
ahead of atomic fetches.
*FI_ORDER_ATOMIC_WAW*
: RMA write after write. If set, atomic update operations are
transmitted in the order submitted relative to other atomic
update operations. If not atomic updates may be
transmitted out of order from their submission.
*FI_ORDER_NONE*
: No ordering is specified. This value may be used as input in order
to obtain the default message order supported by the provider. FI_ORDER_NONE
is an alias for the value 0.
*FI_ORDER_RAR*
: Read after read. If set, RMA and atomic read operations are
transmitted in the order submitted relative to other
RMA and atomic read operations. If not set, RMA and atomic reads
may be transmitted out of order from their submission.
*FI_ORDER_RAS*
: Read after send. If set, RMA and atomic read operations are
transmitted in the order submitted relative to message send
operations, including tagged sends. If not set, RMA and atomic
reads may be transmitted ahead of sends.
*FI_ORDER_RAW*
: Read after write. If set, RMA and atomic read operations are
transmitted in the order submitted relative to RMA and atomic write
operations. If not set, RMA and atomic reads may be transmitted ahead
of RMA and atomic writes.
*FI_ORDER_RMA_RAR*
: RMA read after read. If set, RMA read operations are
transmitted in the order submitted relative to other
RMA read operations. If not set, RMA reads
may be transmitted out of order from their submission.
*FI_ORDER_RMA_RAW*
: RMA read after write. If set, RMA read operations are
transmitted in the order submitted relative to RMA write
operations. If not set, RMA reads may be transmitted ahead
of RMA writes.
*FI_ORDER_RMA_WAR*
: RMA write after read. If set, RMA write operations are
transmitted in the order submitted relative to RMA read
operations. If not set, RMA writes may be transmitted
ahead of RMA reads.
*FI_ORDER_RMA_WAW*
: RMA write after write. If set, RMA write operations are
transmitted in the order submitted relative to other RMA
write operations. If not set, RMA writes may be
transmitted out of order from their submission.
*FI_ORDER_SAR*
: Send after read. If set, message send operations, including tagged
sends, are transmitted in order submitted relative to RMA and atomic
read operations. If not set, message sends may be transmitted ahead
of RMA and atomic reads.
*FI_ORDER_SAS*
: Send after send. If set, message send operations, including tagged
sends, are transmitted in the order submitted relative to other
message send. If not set, message sends may be transmitted out of
order from their submission.
*FI_ORDER_SAW*
: Send after write. If set, message send operations, including tagged
sends, are transmitted in order submitted relative to RMA and atomic
write operations. If not set, message sends may be transmitted ahead
of RMA and atomic writes.
*FI_ORDER_WAR*
: Write after read. If set, RMA and atomic write operations are
transmitted in the order submitted relative to RMA and atomic read
operations. If not set, RMA and atomic writes may be transmitted
ahead of RMA and atomic reads.
*FI_ORDER_WAS*
: Write after send. If set, RMA and atomic write operations are
transmitted in the order submitted relative to message send
operations, including tagged sends. If not set, RMA and atomic
writes may be transmitted ahead of sends.
*FI_ORDER_WAW*
: Write after write. If set, RMA and atomic write operations are
transmitted in the order submitted relative to other RMA and atomic
write operations. If not set, RMA and atomic writes may be
transmitted out of order from their submission.
## comp_order - Completion Ordering
Completion ordering refers to the order in which completed requests are
written into the completion queue. Completion ordering is similar to
message order. Relaxed completion order may enable faster reporting of
completed transfers, allow acknowledgments to be sent over different
fabric paths, and support more sophisticated retry mechanisms.
This can result in lower-latency completions, particularly when
using connectionless endpoints. Strict completion ordering may require
that providers queue completed operations or limit available optimizations.
For transmit requests, completion ordering depends on the endpoint
communication type. For unreliable communication, completion ordering
applies to all data transfer requests submitted to an endpoint.
For reliable communication, completion ordering only applies to requests
that target a single destination endpoint. Completion ordering of
requests that target different endpoints over a reliable transport
is not defined.
Applications should specify the completion ordering that they support
or require. Providers should return the completion order that they
actually provide, with the constraint that the returned ordering is
stricter than that specified by the application. Supported completion
order values are:
*FI_ORDER_NONE*
: No ordering is defined for completed operations. Requests submitted
to the transmit context may complete in any order.
*FI_ORDER_STRICT*
: Requests complete in the order in which they are submitted to the
transmit context.
## inject_size
The requested inject operation size (see the FI_INJECT flag) that
the context will support. This is the maximum size data transfer that
can be associated with an inject operation (such as fi_inject) or may
be used with the FI_INJECT data transfer flag.
## size
The size of the transmit context. The mapping of the size value to resources
is provider specific, but it is directly related to the number of command
entries allocated for the endpoint. A smaller size value consumes fewer
hardware and software resources, while a larger size allows queuing more
transmit requests.
While the size attribute guides the size of underlying endpoint transmit
queue, there is not necessarily a one-to-one mapping between a transmit
operation and a queue entry. A single transmit operation may consume
multiple queue entries; for example, one per scatter-gather entry.
Additionally, the size field is intended to guide the allocation of the
endpoint's transmit context. Specifically, for connectionless endpoints,
there may be lower-level queues use to track communication on a per peer basis.
The sizes of any lower-level queues may only be significantly smaller than
the endpoint's transmit size, in order to reduce resource utilization.
## iov_limit
This is the maximum number of IO vectors (scatter-gather elements)
that a single posted operation may reference.
## rma_iov_limit
This is the maximum number of RMA IO vectors (scatter-gather elements)
that an RMA or atomic operation may reference. The rma_iov_limit
corresponds to the rma_iov_count values in RMA and atomic operations.
See struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and
fi_atomic.3, for additional details. This limit applies to both the
number of RMA IO vectors that may be specified when initiating an
operation from the local endpoint, as well as the maximum number of
IO vectors that may be carried in a single request from a remote endpoint.
## Traffic Class (tclass)
Traffic classes can be a differentiated services
code point (DSCP) value, one of the following defined labels, or a
provider-specific definition. If tclass is unset or set to FI_TC_UNSPEC,
the endpoint will use the default traffic class associated with the
domain.
*FI_TC_BEST_EFFORT*
: This is the default in the absence of any other local or fabric configuration.
This class carries the traffic for a number of applications executing
concurrently over the same network infrastructure. Even though it is shared,
network capacity and resource allocation are distributed fairly across the
applications.
*FI_TC_BULK_DATA*
: This class is intended for large data transfers associated with I/O and
is present to separate sustained I/O transfers from other application
inter-process communications.
*FI_TC_DEDICATED_ACCESS*
: This class operates at the highest priority, except the management class.
It carries a high bandwidth allocation, minimum latency targets, and the
highest scheduling and arbitration priority.
*FI_TC_LOW_LATENCY*
: This class supports low latency, low jitter data patterns typically caused by
transactional data exchanges, barrier synchronizations, and collective
operations that are typical of HPC applications. This class often requires
maximum tolerable latencies that data transfers must achieve for correct or
performance operations. Fulfillment of such requests in this class will
typically require accompanying bandwidth and message size limitations so
as not to consume excessive bandwidth at high priority.
*FI_TC_NETWORK_CTRL*
: This class is intended for traffic directly related to fabric (network)
management, which is critical to the correct operation of the network.
Its use is typically restricted to privileged network management applications.
*FI_TC_SCAVENGER*
: This class is used for data that is desired but does not have strict delivery
requirements, such as in-band network or application level monitoring data.
Use of this class indicates that the traffic is considered lower priority
and should not interfere with higher priority workflows.
*fi_tc_dscp_set / fi_tc_dscp_get*
: DSCP values are supported via the DSCP get and set functions. The
definitions for DSCP values are outside the scope of libfabric. See
the fi_tc_dscp_set and fi_tc_dscp_get function definitions for details
on their use.
# RECEIVE CONTEXT ATTRIBUTES
Attributes specific to the receive capabilities of an endpoint are
specified using struct fi_rx_attr.
{% highlight c %}
struct fi_rx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t total_buffered_recv;
size_t size;
size_t iov_limit;
};
{% endhighlight %}
## caps - Capabilities
The requested capabilities of the context. The capabilities must be
a subset of those requested of the associated endpoint. See the
CAPABILITIES section if fi_getinfo(3) for capability details. If
the caps field is 0 on input to fi_getinfo(3), the applicable
capability bits from the fi_info structure will be used.
The following capabilities apply to the receive attributes: FI_MSG,
FI_RMA, FI_TAGGED, FI_ATOMIC, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_RECV,
FI_HMEM, FI_TRIGGER, FI_RMA_PMEM, FI_DIRECTED_RECV, FI_VARIABLE_MSG,
FI_MULTI_RECV, FI_SOURCE, FI_RMA_EVENT, FI_SOURCE_ERR, FI_COLLECTIVE,
and FI_XPU.
Many applications will be able to ignore this field and rely solely
on the fi_info::caps field. Use of this field provides fine grained
control over the receive capabilities associated with an endpoint.
It is useful when handling scalable endpoints, with multiple receive
contexts, for example, and allows configuring a specific receive
context with fewer capabilities than that supported by the endpoint
or other receive contexts.
## mode
The operational mode bits of the context. The mode bits will be a
subset of those associated with the endpoint. See the MODE section
of fi_getinfo(3) for details. A mode value of 0 will be ignored on
input to fi_getinfo(3), with the mode value of the fi_info structure
used instead. On return from fi_getinfo(3), the mode will be set
only to those constraints specific to receive operations.
## op_flags - Default receive operation flags
Flags that control the operation of operations submitted against the
context. Applicable flags are listed in the Operation Flags
section.
## msg_order - Message Ordering
For a description of message ordering, see the msg_order field in
the _Transmit Context Attribute_ section. Receive context message
ordering defines the order in which received transport message headers
are processed when received by an endpoint. When ordering is set, it
indicates that message headers will be processed in order, based on
how the transmit side has identified the messages. Typically, this means
that messages will be handled in order based on a message level sequence
number.
The following ordering flags, as defined for transmit ordering, also
apply to the processing of received operations: FI_ORDER_NONE,
FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW,
FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR,
FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR,
FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, and FI_ORDER_ATOMIC_WAW.
## comp_order - Completion Ordering
For a description of completion ordering, see the comp_order field in
the _Transmit Context Attribute_ section.
*FI_ORDER_DATA*
: When set, this bit indicates that received data is written into memory
in order. Data ordering applies to memory accessed as part of a single
operation and between operations if message ordering is guaranteed.
*FI_ORDER_NONE*
: No ordering is defined for completed operations. Receive operations may
complete in any order, regardless of their submission order.
*FI_ORDER_STRICT*
: Receive operations complete in the order in which they are processed by
the receive context, based on the receive side msg_order attribute.
## total_buffered_recv
This field is supported for backwards compatibility purposes.
It is a hint to the provider of the total available space
that may be needed to buffer messages that are received for which there
is no matching receive operation. The provider may adjust or ignore
this value. The allocation of internal network buffering among received
message is provider specific. For instance, a provider may limit the size
of messages which can be buffered or the amount of buffering allocated to
a single message.
If receive side buffering is disabled (total_buffered_recv = 0)
and a message is received by an endpoint, then the behavior is dependent on
whether resource management has been enabled (FI_RM_ENABLED has be set or not).
See the Resource Management section of fi_domain.3 for further clarification.
It is recommended that applications enable resource management if they
anticipate receiving unexpected messages, rather than modifying this value.
## size
The size of the receive context. The mapping of the size value to resources
is provider specific, but it is directly related to the number of command
entries allocated for the endpoint. A smaller size value consumes fewer
hardware and software resources, while a larger size allows queuing more
transmit requests.
While the size attribute guides the size of underlying endpoint receive
queue, there is not necessarily a one-to-one mapping between a receive
operation and a queue entry. A single receive operation may consume
multiple queue entries; for example, one per scatter-gather entry.
Additionally, the size field is intended to guide the allocation of the
endpoint's receive context. Specifically, for connectionless endpoints,
there may be lower-level queues use to track communication on a per peer basis.
The sizes of any lower-level queues may only be significantly smaller than
the endpoint's receive size, in order to reduce resource utilization.
## iov_limit
This is the maximum number of IO vectors (scatter-gather elements)
that a single posted operating may reference.
# SCALABLE ENDPOINTS
A scalable endpoint is a communication portal that supports multiple
transmit and receive contexts. Scalable endpoints are loosely modeled
after the networking concept of transmit/receive side scaling, also
known as multi-queue. Support for scalable endpoints is domain
specific. Scalable endpoints may improve the performance of
multi-threaded and parallel applications, by allowing threads to
access independent transmit and receive queues. A scalable endpoint
has a single transport level address, which can reduce the memory
requirements needed to store remote addressing data, versus using
standard endpoints. Scalable endpoints cannot be used directly for
communication operations, and require the application to explicitly
create transmit and receive contexts as described below.
## fi_tx_context
Transmit contexts are independent transmit queues. Ordering and
synchronization between contexts are not defined. Conceptually a
transmit context behaves similar to a send-only endpoint. A transmit
context may be configured with fewer capabilities than the base
endpoint and with different attributes (such as ordering requirements
and inject size) than other contexts associated with the same scalable
endpoint. Each transmit context has its own completion queue. The
number of transmit contexts associated with an endpoint is specified
during endpoint creation.
The fi_tx_context call is used to retrieve a specific context,
identified by an index (see above for details on transmit context
attributes). Providers may dynamically allocate contexts when
fi_tx_context is called, or may statically create all contexts when
fi_endpoint is invoked. By default, a transmit context inherits the
properties of its associated endpoint. However, applications may
request context specific attributes through the attr parameter.
Support for per transmit context attributes is provider specific and
not guaranteed. Providers will return the actual attributes assigned
to the context through the attr parameter, if provided.
## fi_rx_context
Receive contexts are independent receive queues for receiving incoming
data. Ordering and synchronization between contexts are not
guaranteed. Conceptually a receive context behaves similar to a
receive-only endpoint. A receive context may be configured with fewer
capabilities than the base endpoint and with different attributes
(such as ordering requirements and inject size) than other contexts
associated with the same scalable endpoint. Each receive context has
its own completion queue. The number of receive contexts associated
with an endpoint is specified during endpoint creation.
Receive contexts are often associated with steering flows, that
specify which incoming packets targeting a scalable endpoint to
process. However, receive contexts may be targeted directly by the
initiator, if supported by the underlying protocol. Such contexts are
referred to as 'named'. Support for named contexts must be indicated
by setting the caps FI_NAMED_RX_CTX capability when the corresponding
endpoint is created. Support for named receive contexts is
coordinated with address vectors. See fi_av(3) and fi_rx_addr(3).
The fi_rx_context call is used to retrieve a specific context,
identified by an index (see above for details on receive context
attributes). Providers may dynamically allocate contexts when
fi_rx_context is called, or may statically create all contexts when
fi_endpoint is invoked. By default, a receive context inherits the
properties of its associated endpoint. However, applications may
request context specific attributes through the attr parameter.
Support for per receive context attributes is provider specific and
not guaranteed. Providers will return the actual attributes assigned
to the context through the attr parameter, if provided.
# SHARED CONTEXTS
Shared contexts are transmit and receive contexts explicitly shared
among one or more endpoints. A shareable context allows an application
to use a single dedicated provider resource among multiple transport
addressable endpoints. This can greatly reduce the resources needed
to manage communication over multiple endpoints by multiplexing
transmit and/or receive processing, with the potential cost of
serializing access across multiple endpoints. Support for shareable
contexts is domain specific.
Conceptually, shareable transmit contexts are transmit queues that may be
accessed by many endpoints. The use of a shared transmit context is
mostly opaque to an application. Applications must allocate and bind
shared transmit contexts to endpoints, but operations are posted
directly to the endpoint. Shared transmit contexts are not associated
with completion queues or counters. Completed operations are posted
to the CQs bound to the endpoint. An endpoint may only
be associated with a single shared transmit context.
Unlike shared transmit contexts, applications interact directly with
shared receive contexts. Users post receive buffers directly to a
shared receive context, with the buffers usable by any endpoint bound
to the shared receive context. Shared receive contexts are not
associated with completion queues or counters. Completed receive
operations are posted to the CQs bound to the endpoint. An endpoint
may only be associated with a single receive context, and all
connectionless endpoints associated with a shared receive context must
also share the same address vector.
Endpoints associated with a shared transmit context may use dedicated
receive contexts, and vice-versa. Or an endpoint may use shared
transmit and receive contexts. And there is no requirement that the
same group of endpoints sharing a context of one type also share the
context of an alternate type. Furthermore, an endpoint may use a
shared context of one type, but a scalable set of contexts of the
alternate type.
## fi_stx_context
This call is used to open a shareable transmit context (see above for
details on the transmit context attributes). Endpoints associated
with a shared transmit context must use a subset of the transmit
context's attributes. Note that this is the reverse of the
requirement for transmit contexts for scalable endpoints.
## fi_srx_context
This allocates a shareable receive context (see above for details on
the receive context attributes). Endpoints associated with a shared
receive context must use a subset of the receive context's attributes.
Note that this is the reverse of the requirement for receive contexts
for scalable endpoints.
# SOCKET ENDPOINTS
The following feature and description should be considered experimental.
Until the experimental tag is removed, the interfaces, semantics, and data
structures associated with socket endpoints may change between library
versions.
This section applies to endpoints of type FI_EP_SOCK_STREAM and
FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
Socket endpoints are defined with semantics that allow them to more
easily be adopted by developers familiar with the UNIX socket API, or
by middleware that exposes the socket API, while still taking advantage
of high-performance hardware features.
The key difference between socket endpoints and other active endpoints
are socket endpoints use synchronous data transfers. Buffers passed
into send and receive operations revert to the control of the application
upon returning from the function call. As a result, no data transfer
completions are reported to the application, and socket endpoints are not
associated with completion queues or counters.
Socket endpoints support a subset of message operations: fi_send,
fi_sendv, fi_sendmsg, fi_recv, fi_recvv, fi_recvmsg, and fi_inject.
Because data transfers are synchronous, the return value from send and receive
operations indicate the number of bytes transferred on success, or a negative
value on error, including -FI_EAGAIN if the endpoint cannot send or
receive any data because of full or empty queues, respectively.
Socket endpoints are associated with event queues and address vectors, and
process connection management events asynchronously, similar to other endpoints.
Unlike UNIX sockets, socket endpoint must still be declared as either active
or passive.
Socket endpoints behave like non-blocking sockets. In order to support
select and poll semantics, active socket endpoints are associated with a
file descriptor that is signaled whenever the endpoint is ready to send
and/or receive data. The file descriptor may be retrieved using fi_control.
# OPERATION FLAGS
Operation flags are obtained by OR-ing the following flags together.
Operation flags define the default flags applied to an endpoint's data
transfer operations, where a flags parameter is not available. Data
transfer operations that take flags as input override the op_flags
value of transmit or receive context attributes of an endpoint.
*FI_COMMIT_COMPLETE*
: Indicates that a completion should not be generated (locally or at the
peer) until the result of an operation have been made persistent.
See [`fi_cq`(3)](fi_cq.3.html) for additional details on completion
semantics.
*FI_COMPLETION*
: Indicates that a completion queue entry should be written for data
transfer operations. This flag only applies to operations issued on an
endpoint that was bound to a completion queue with the
FI_SELECTIVE_COMPLETION flag set, otherwise, it is ignored. See the
fi_ep_bind section above for more detail.
*FI_DELIVERY_COMPLETE*
: Indicates that a completion should be generated when the operation has been
processed by the destination endpoint(s). See [`fi_cq`(3)](fi_cq.3.html)
for additional details on completion semantics.
*FI_INJECT*
: Indicates that all outbound data buffers should be returned to the
user's control immediately after a data transfer call returns, even
if the operation is handled asynchronously. This may require that
the provider copy the data into a local buffer and transfer out of
that buffer. A provider can limit the total amount of send data
that may be buffered and/or the size of a single send that can use
this flag. This limit is indicated using inject_size (see inject_size
above).
*FI_INJECT_COMPLETE*
: Indicates that a completion should be generated when the
source buffer(s) may be reused. See [`fi_cq`(3)](fi_cq.3.html) for
additional details on completion semantics.
*FI_MULTICAST*
: Indicates that data transfers will target multicast addresses by default.
Any fi_addr_t passed into a data transfer operation will be treated as a
multicast address.
*FI_MULTI_RECV*
: Applies to posted receive operations. This flag allows the user to
post a single buffer that will receive multiple incoming messages.
Received messages will be packed into the receive buffer until the
buffer has been consumed. Use of this flag may cause a single
posted receive operation to generate multiple completions as
messages are placed into the buffer. The placement of received data
into the buffer may be subjected to provider specific alignment
restrictions. The buffer will be released by the provider when the
available buffer space falls below the specified minimum (see
FI_OPT_MIN_MULTI_RECV).
*FI_TRANSMIT_COMPLETE*
: Indicates that a completion should be generated when the transmit
operation has completed relative to the local provider. See
[`fi_cq`(3)](fi_cq.3.html) for additional details on completion semantics.
# NOTES
Users should call fi_close to release all resources allocated to the
fabric endpoint.
Endpoints allocated with the FI_CONTEXT or FI_CONTEXT2 mode bits set must
typically provide struct fi_context(2) as their per operation context parameter.
(See fi_getinfo.3 for details.) However, when FI_SELECTIVE_COMPLETION is
enabled to suppress CQ completion entries, and an operation is initiated
without the FI_COMPLETION flag set, then the context parameter is ignored.
An application does not need to pass in a valid struct fi_context(2) into
such data transfers.
Operations that complete in error that are not associated with valid
operational context will use the endpoint context in any error
reporting structures.
Although applications typically associate individual completions with
either completion queues or counters, an endpoint can be attached to
both a counter and completion queue. When combined with using
selective completions, this allows an application to use counters to
track successful completions, with a CQ used to report errors.
Operations that complete with an error increment the error counter
and generate a CQ completion event.
As mentioned in fi_getinfo(3), the ep_attr structure can be used to
query providers that support various endpoint attributes. fi_getinfo
can return provider info structures that can support the minimal set
of requirements (such that the application maintains correctness).
However, it can also return provider info structures that exceed
application requirements. As an example, consider an application
requesting msg_order as FI_ORDER_NONE. The resulting output from
fi_getinfo may have all the ordering bits set. The application can reset
the ordering bits it does not require before creating the endpoint.
The provider is free to implement a stricter ordering than is
required by the application.
# RETURN VALUES
Returns 0 on success. On error, a negative value corresponding to
fabric errno is returned. For fi_cancel, a return value of 0
indicates that the cancel request was submitted for processing.
For fi_setopt/fi_getopt, a return value of -FI_ENOPROTOOPT
indicates the provider does not support the requested option.
Fabric errno values are defined in `rdma/fi_errno.h`.
# ERRORS
*-FI_EDOMAIN*
: A resource domain was not bound to the endpoint or an attempt was
made to bind multiple domains.
*-FI_ENOCQ*
: The endpoint has not been configured with necessary event queue.
*-FI_EOPBADSTATE*
: The endpoint's state does not permit the requested operation.
# SEE ALSO
[`fi_getinfo`(3)](fi_getinfo.3.html),
[`fi_domain`(3)](fi_domain.3.html),
[`fi_cq`(3)](fi_cq.3.html)
[`fi_msg`(3)](fi_msg.3.html),
[`fi_tagged`(3)](fi_tagged.3.html),
[`fi_rma`(3)](fi_rma.3.html)
[`fi_peer`(3)](fi_peer.3.html)
|