1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350
|
Open Fabrics Enterprise Distribution (OFED)
IPoIB in OFED 1.4.1 Release Notes
May 2009
===============================================================================
Table of Contents
===============================================================================
1. Overview
2. New Features
3. Known Issues
4. DHCP Support of IPoIB
5. The ib-bonding driver
6. Bug Fixes and Enhancements Since OFED 1.3
7. Bug Fixes and Enhancements Since OFED 1.3.1
8. Bug Fixes and Enhancements Since OFED 1.4
9. Performance tuning
===============================================================================
1. Overview
===============================================================================
IPoIB is a network driver implementation that enables transmitting IP and ARP
protocol packets over an InfiniBand UD channel. The implementation conforms to
the relevant IETF working group's RFCs (http://www.ietf.org).
===============================================================================
2. New Features
===============================================================================
1. This version of ofed introduces improvements to IPOIB by cutting the CPU
overhead in handling receive packets. This will improve operation
in datagram mode:
Large Receive Offload (LRO) - aggregating multiple incoming packets from a
single stream into a larger buffer before they are passed higher up the
networking stack, thus reducing the number of packets that have to be
processed.
This feature is enabled on HCAs that can support LRO, e.g. ConnectX.
2. Datagram mode: LSO (large send offload) allows the networking stack to pass
SKBs with data size larger than the MTU to the IPoIB driver and have the HCA
HW fragment the data to multiple MSS-sized packets. Add a device capability
flag IB_DEVICE_UD_TSO for devices that can perform TCP segmentation offload,
a new send work request opcode IB_WR_LSO, header, hlen and mss fields for
the work request structure, and a new IB_WC_LSO completion type.
This feature is enabled on HCAs that can support LSO, e.g. ConnectX.
Usage and configuration:
========================
1. To check the current mode used for outgoing connections, enter:
cat /sys/class/net/ib0/mode
2. To disable IPoIB CM at compile time, enter:
cd OFED-1.4
export OFA_KERNEL_PARAMS="--without-ipoib-cm"
./install.pl
3. To change the run-time configuration for IPoIB, enter:
edit /etc/infiniband/openib.conf, change the following parameters:
# Enable IPoIB Connected Mode
SET_IPOIB_CM=yes
# Set IPoIB MTU
IPOIB_MTU=65520
4. You can also change the mode and MTU for a specific interface manually.
To enable connected mode for interface ib0, enter:
echo connected > /sys/class/net/ib0/mode
To increase MTU, enter:
ifconfig ib0 mtu 65520
5. Switching between CM and UD mode can be done in run time:
echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD
echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM
===============================================================================
3. Known Issues
===============================================================================
1. If a host has multiple interfaces and (a) each interface belongs to a
different IP subnet, (b) they all use the same InfiniBand Partition, and (c)
they are connected to the same IB Switch, then the host violates the IP rule
requiring different broadcast domains. Consequently, the host may build an
incorrect ARP table.
The correct setting of a multi-homed IPoIB host is achieved by using a
different PKEY for each IP subnet. If a host has multiple interfaces on the
same IP subnet, then to prevent a peer from building an incorrect ARP entry
(neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X
stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This
causes the network stack to send ARP replies only on the interface with the
IP address specified in the ARP request:
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_ignore=1
Or, globally,
sysctl -w net.ipv4.conf.all.arp_ignore=1
To learn more about the arp_ignore parameter, see Documentation/networking/ip-sysctl.txt.
Note that distributions have the means to make kernel parameters persistent.
2. There are IPoIB alias lines in modprobe.conf which prevent stopping/
unloading the stack (i.e., '/etc/init.d/openibd stop' will fail).
These alias lines cause the drivers to be loaded again by udev scripts.
Workaround: Change modprobe.conf to set
OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove
the alias lines from modprobe.conf.
3. On SLES 10:
The ib1 interface uses the configuration script of ib0.
Workaround: Invoke ifup/ifdown using both the interface name and the
configuration script name (example: ifup ib1 ib1).
4. After a hotplug event, the IPoIB interface falls back to datagram mode, and
MTU is reduced to 2K.
Workaround: Re-enable connected mode and increase MTU manually:
echo connected > /sys/class/net/ib0/mode
ifconfig ib0 mtu 65520
5. Since the IPoIB configuration files (ifcfg-ib<n>) are installed under the
standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/
and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf
does not prevent the loading of IPoIB on boot.
6. If IPoIB connected mode is enabled, it uses a large MTU for connected mode
messages and a small MTU for datagram (in particular, multicast) messages,
and relies on path MTU discovery to adjust MTU appropriately. Packets sent
in the window before MTU discovery automatically reduces the MTU for a
specific destination will be dropped, producing the following message in the
system log:
"packet len <actual length> (> <max allowed length>) too long to send, dropping"
To warn about this, a message is produced in the system log each time MTU is
set to a value higher than 2K.
7. IPoIB IPv6 support is broken for between systems with kernels < 2.6.12 and
kernels >= 2.6.12. The reason for that is that kernel 2.6.12 puts the link
layer address at an offset of two bytes with respect to older kernels. This
causes the other host to misinterpret the hardware address resulting in failure
to resolve path which are based on wrong GIDs. As an example, RH 4.x and RH
5.x cannot interoperate.
8. In connected mode, TCP latency for short messages is larger by approx. 1usec
(~5%) than in datagram mode. As a workaround, use datagram mode.
9. Single-socket TCP bandwidth for kernels < 2.6.18 is lower than with
newer kernels. We recommend kernels from 2.6.18 and up for
best IPoIB performance.
10. Connectivity issues encountered when using IPv6 on ia64 systems.
11. The IPoIB module uses a Linux implementation for Large Receive Offload
(LRO) in kernel 2.6.24 and later. These kernels require installing the
"inet_lro" module.
12. ConnectX only: If you have a port configured as ETH, and are running IPoIB
in connected mode -- and then change the port type to IB, the IPoIB mode
changes to datagram mode.
13. When working with ISCSI, you must disable LRO (even if you are working in
connected mode). This is because there is a bug in older kernels which causes
a kernel panic.
14. IPoIB datagram mode initial packet loss (bug #1287): When the datagram test
gets to packet size 8192 and larger it always loose the first packet in the
sequence.
Workaround: Increase the number of pending skb's before a neighbor is
resolved (default is 3). This value can be changed with:
sysctl net.ipv4.neigh.ib0.unres_qlen.
15. IPoIB multicast support is broken in RH4.x kernels. This is because
ndisc_mc_map() does not handle IPOIB hardware addresses.
===============================================================================
4. IPoIB Configuration Based on DHCP
===============================================================================
Setting an IPoIB interface configuration based on DHCP (v3.1.2 which is available
via www.isc.org) is performed similarly to the configuration of Ethernet
interfaces. In other words, you need to make sure that IPoIB configuration files
include the following line:
For RedHat:
BOOTPROTO=dhcp
For SLES:
BOOTPROTO=dchp
Note: If IPoIB configuration files are included, ifcfg-ib<n> files will be
installed under:
/etc/sysconfig/network-scripts/ on a RedHat machine
/etc/sysconfig/network/ on a SuSE machine
Note: A patch for DHCP is required for supporting IPoIB. The patch file for
DHCP v3.1.2, dhcp.patch, is available under the docs/ directory.
Standard DHCP fields holding MAC addresses are not large enough to contain an
IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages
convey a client identifier field used to identify the DHCP session. This client
identifier field can be used to associate an IP address with a client identifier
value, such that the DHCP server will grant the same IP address to any client
that conveys this client identifier.
Note: Refer to the DHCP documentation for more details how to make this
association.
The length of the client identifier field is not fixed in the specification.
4.1 DHCP Server
In order for the DHCP server to provide configuration records for clients, an
appropriate configuration file needs to be created. By default, the DHCP server
looks for a configuration file called dhcpd.conf under /etc. You can either edit
this file or create a new one and provide its full path to the DHCP server using
the -cf flag. See a file example at docs/dhcpd.conf of this package.
The DHCP server must run on a machine which has loaded the IPoIB module.
To run the DHCP server from the command line, enter:
dhcpd <IB network interface name> -d
Example:
host1# dhcpd ib0 -d
4.2 DHCP Client (Optional)
Note: A DHCP client can be used if you need to prepare a diskless machine with
an IB driver.
In order to use a DHCP client identifier, you need to first create a
configuration file that defines the DHCP client identifier. Then run the DHCP
client with this file using the following command:
dhclient cf <client conf file> <IB network interface name>
Example of a configuration file for the ConnectX (PCI Device ID 25418), called
dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier 00:02:c9:03:00:00:10:39;
}
Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218), called
dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier 20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92;
}
In order to use the configuration file, run:
host1# dhclient cf dhclient.conf ib1
===============================================================================
5. The ib-bonding driver
===============================================================================
The ib-bonding driver is a High Availability solution for IPoIB interfaces.
It is based on the Linux Ethernet Bonding Driver and was adapted to work with
IPoIB. The ib-bonding package contains a bonding driver and a utility called
ib-bond to manage and control the driver operation.
The ib-bonding driver comes with the ib-bonding package (run rpm -qi ib-bonding
to get the package information).
Using the ib-bonding driver
---------------------------
The ib-bonding driver can be loaded manually or automatically.
1. Manual operation:
Use the utility ib-bond to start, query, or stop the driver. For details on this
utility, read the documentation for the ib-bonding package.
2. Automatic operation:
Use standard OS tools (sysconfig in SuSE and initscripts in Redhat)
to create a configuration that will come up with network restart. For details
on this, read the documentation for the ib-bonding package.
Notes:
* Using /etc/infiniband/openib.conf to create a persistent configuration is
no longer supported
* On RHEL4_U7, cannot set a slave interface as primary.
===============================================================================
6. Bug Fixes and Enhancements Since OFED 1.3
===============================================================================
- There is no default configuration for IPoIB interfaces: One should manually
specify the full IP configuration or use the ofed_net.conf file. See
OFED_Installation_Guide.txt for details on ipoib configuration.
- Don't drop multicast sends when they can be queued
- IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small
SKBs (bug 989)
- IPoIB failed on stress testing (bug 1004)
- Kernel Oops during "port up/down test" (bug 1040)
- Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel
panic (bug 985)
- Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20
- Set max CM MTU when moving to CM mode, instead of setting it in openibd script
- Fix CQ size calculations for ipoib
- Bonding: Enable build for SLES10 SP2
- Bonding: Fix issue in using the bonding module for Ethernet slaves (see
documentation for details)
===============================================================================
7. Bug Fixes and Enhancements Since OFED 1.3.1
===============================================================================
- IPoIB: Refresh paths instead of flushing them on SM change events to improve
failover respond
- IPoIB: Fix loss of connectivity after bonding failover on both sides
- Bonding: Fix link state detection under RHEL4
- Bonding: Avoid annoying messages from initscripts when starting bond
- Bonding: Set default number of grat. ARP after failover to three (was one)
===============================================================================
8. Bug Fixes and Enhancements Since OFED 1.4
===============================================================================
- Performance tuning is enabled by default for IPOIB CM.
- Clear IPOIB_FLAG_ADMIN_UP if ipoib_open fails
- disable napi while cq is being drained (bugzilla #1587)
- rdma_cm: Use rate from ipoib broadcast when joining ipoib multicast
When joining IPoIB multicast group, use the same rate as in the broadcast
group. Otherwise, if rdma_cm creates this group before IPoIB does, it might get
a different rate. This will cause IPoIB to fail joining to the same group later
on, because IPoIB has a strict rate selection.
- fix unprotected use of priv->broadcast in ipoib_mcast_join_task.
- Do not join broadcast group if interface is brought down
===============================================================================
9. Performance tuning
===============================================================================
When IPoIB is configured to run in connected mode, tcp parameter tuning is
performed at driver startup -- to improve the throughput of medium and large
messages.
The driver startup scripts set the following TCP parameters as follows:
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_sack=0
net.core.netdev_max_backlog=250000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=16777216
net.ipv4.tcp_mem="16777216 16777216 16777216"
net.ipv4.tcp_rmem="4096 87380 16777216"
net.ipv4.tcp_wmem="4096 65536 16777216"
This tuning is effective only for connected mode. If you run in datagram mode,
it actually reduces performance.
If you change the IPoIB run mode to "datagram" while the driver is running,
the tuned parameters do not get reset to their default values. We therefore
recommend that you change the IPoIB mode only while the driver is down
(by setting line "SET_IPOIB_CM=yes" to "SET_IPOIB_CM=no" in file
/etc/infiniband/openib.conf, and then restarting the driver).
|