File: ipoib_release_notes.txt

package info (click to toggle)
ofed-docs 1.4.2-1
  • links: PTS
  • area: main
  • in suites: squeeze, wheezy
  • size: 772 kB
  • ctags: 106
  • sloc: sh: 253; makefile: 32
file content (350 lines) | stat: -rw-r--r-- 16,060 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
	     Open Fabrics Enterprise Distribution (OFED)
		    IPoIB in OFED 1.4.1 Release Notes
			  
			   May 2009


===============================================================================
Table of Contents
===============================================================================
1. Overview
2. New Features
3. Known Issues
4. DHCP Support of IPoIB
5. The ib-bonding driver
6. Bug Fixes and Enhancements Since OFED 1.3
7. Bug Fixes and Enhancements Since OFED 1.3.1
8. Bug Fixes and Enhancements Since OFED 1.4
9. Performance tuning

===============================================================================
1. Overview
===============================================================================
IPoIB is a network driver implementation that enables transmitting IP and ARP
protocol packets over an InfiniBand UD channel. The implementation conforms to
the relevant IETF working group's RFCs (http://www.ietf.org).


===============================================================================
2. New Features
===============================================================================
1. This version of ofed introduces improvements to IPOIB by cutting the CPU
   overhead in handling receive packets. This will improve operation
   in datagram mode:
   Large Receive Offload (LRO) - aggregating multiple incoming packets from a 
   single stream into a larger buffer before they are passed higher up the 
   networking stack, thus reducing the number of packets that have to be 
   processed.
   This feature is enabled on HCAs that can support LRO, e.g. ConnectX.
2. Datagram mode: LSO (large send offload) allows the networking stack to pass 
   SKBs with data size larger than the MTU to the IPoIB driver and have the HCA
   HW fragment the data to multiple MSS-sized packets. Add a device capability
   flag IB_DEVICE_UD_TSO for devices that can perform TCP segmentation offload,
   a new send work request opcode IB_WR_LSO, header, hlen and mss fields for 
   the work request structure, and a new IB_WC_LSO completion type.
   This feature is enabled on HCAs that can support LSO, e.g. ConnectX.


Usage and configuration:
========================
1. To check the current mode used for outgoing connections, enter:
   cat /sys/class/net/ib0/mode
2. To disable IPoIB CM at compile time, enter:
   cd OFED-1.4
   export OFA_KERNEL_PARAMS="--without-ipoib-cm"
   ./install.pl
3. To change the run-time configuration for IPoIB, enter:
   edit /etc/infiniband/openib.conf, change the following parameters:
   # Enable IPoIB Connected Mode
   SET_IPOIB_CM=yes
   # Set IPoIB MTU
   IPOIB_MTU=65520

4. You can also change the mode and MTU for a specific interface manually.
   
   To enable connected mode for interface ib0, enter:
   echo connected > /sys/class/net/ib0/mode
   
   To increase MTU, enter:
   ifconfig ib0 mtu 65520

5. Switching between CM and UD mode can be done in run time:
   echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD
   echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM


===============================================================================
3. Known Issues
===============================================================================
1. If a host has multiple interfaces and (a) each interface belongs to a
   different IP subnet, (b) they all use the same InfiniBand Partition, and (c)
   they are connected to the same IB Switch, then the host violates the IP rule
   requiring different broadcast domains. Consequently, the host may build an
   incorrect ARP table.

   The correct setting of a multi-homed IPoIB host is achieved by using a
   different PKEY for each IP subnet. If a host has multiple interfaces on the
   same IP subnet, then to prevent a peer from building an incorrect ARP entry
   (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X
   stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This
   causes the network stack to send ARP replies only on the interface with the
   IP address specified in the ARP request:

   sysctl -w net.ipv4.conf.ib0.arp_ignore=1
   sysctl -w net.ipv4.conf.ib1.arp_ignore=1

   Or, globally,

   sysctl -w net.ipv4.conf.all.arp_ignore=1

   To learn more about the arp_ignore parameter, see Documentation/networking/ip-sysctl.txt.
   Note that distributions have the means to make kernel parameters persistent.

2. There are IPoIB alias lines in modprobe.conf which prevent stopping/
   unloading the stack (i.e., '/etc/init.d/openibd stop' will fail). 
   These alias lines cause the drivers to be loaded again by udev scripts.

   Workaround: Change modprobe.conf to set
   OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove 
   the alias lines from modprobe.conf.
   
3. On SLES 10:
   The ib1 interface uses the configuration script of ib0.

   Workaround: Invoke ifup/ifdown using both the interface name and the
   configuration script name (example: ifup ib1 ib1).

4. After a hotplug event, the IPoIB interface falls back to datagram mode, and
   MTU is reduced to 2K.
   Workaround: Re-enable connected mode and increase MTU manually:
   echo connected > /sys/class/net/ib0/mode
   ifconfig ib0 mtu 65520

5. Since the IPoIB configuration files (ifcfg-ib<n>) are installed under the
   standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/
   and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf
   does not prevent the loading of IPoIB on boot.

6. If IPoIB connected mode is enabled, it uses a large MTU for connected mode
   messages and a small MTU for datagram (in particular, multicast) messages,
   and relies on path MTU discovery to adjust MTU appropriately. Packets sent
   in the window before MTU discovery automatically reduces the MTU for a
   specific destination will be dropped, producing the following message in the
   system log:
   "packet len <actual length> (> <max allowed length>) too long to send, dropping"

   To warn about this, a message is produced in the system log each time MTU is
   set to a value higher than 2K.

7. IPoIB IPv6 support is broken for between systems with kernels < 2.6.12 and
   kernels >= 2.6.12.  The reason for that is that kernel 2.6.12 puts the link
   layer address at an offset of two bytes with respect to older kernels. This
   causes the other host to misinterpret the hardware address resulting in failure
   to resolve path which are based on wrong GIDs. As an example, RH 4.x and RH
   5.x cannot interoperate.

8. In connected mode, TCP latency for short messages is larger by approx. 1usec
   (~5%) than in datagram mode. As a workaround, use datagram mode.

9. Single-socket TCP bandwidth for kernels < 2.6.18 is lower than with
   newer kernels. We recommend kernels from 2.6.18 and up for
   best IPoIB performance.

10. Connectivity issues encountered when using IPv6 on ia64 systems.

11. The IPoIB module uses a Linux implementation for Large Receive Offload
   (LRO) in kernel 2.6.24 and later. These kernels require installing the
    "inet_lro" module.

12. ConnectX only: If you have a port configured as ETH, and are running IPoIB 
    in connected mode -- and then change the port type to IB, the IPoIB mode 
    changes to datagram mode.

13. When working with ISCSI, you must disable LRO (even if you are working in
    connected mode). This is because there is a bug in older kernels which causes
    a kernel panic.

14. IPoIB datagram mode initial packet loss (bug #1287): When the datagram test
    gets to packet size 8192 and larger it always loose the first packet in the 
    sequence. 
    Workaround: Increase the number of pending skb's before a neighbor is
    resolved (default is 3). This value can be changed with:
    sysctl net.ipv4.neigh.ib0.unres_qlen.
    
15. IPoIB multicast support is broken in RH4.x kernels. This is because
    ndisc_mc_map() does not handle IPOIB hardware addresses.

===============================================================================
4. IPoIB Configuration Based on DHCP
===============================================================================

Setting an IPoIB interface configuration based on DHCP (v3.1.2 which is available 
via www.isc.org) is performed similarly to the configuration of Ethernet 
interfaces. In other words, you need to make sure that IPoIB configuration files 
include the following line:
	For RedHat:
	BOOTPROTO=dhcp
	For SLES:
	BOOTPROTO=dchp
Note: If IPoIB configuration files are included, ifcfg-ib<n> files will be 
installed under:
/etc/sysconfig/network-scripts/ on a RedHat machine
/etc/sysconfig/network/ on a SuSE machine

Note: A patch for DHCP is required for supporting IPoIB. The patch file for 
DHCP v3.1.2, dhcp.patch, is available under the docs/ directory.

Standard DHCP fields holding MAC addresses are not large enough to contain an 
IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages 
convey a client identifier field used to identify the DHCP session. This client
identifier field can be used to associate an IP address with a client identifier 
value, such that the DHCP server will grant the same IP address to any client 
that conveys this client identifier.

Note: Refer to the DHCP documentation for more details how to make this 
association.

The length of the client identifier field is not fixed in the specification. 

4.1 DHCP Server
In order for the DHCP server to provide configuration records for clients, an 
appropriate configuration file needs to be created. By default, the DHCP server 
looks for a configuration file called dhcpd.conf under /etc. You can either edit
this file or create a new one and provide its full path to the DHCP server using
the -cf flag. See a file example at docs/dhcpd.conf of this package.
The DHCP server must run on a machine which has loaded the IPoIB module.

To run the DHCP server from the command line, enter:
dhcpd <IB network interface name> -d
Example:
host1# dhcpd ib0 -d

4.2 DHCP Client (Optional)

Note: A DHCP client can be used if you need to prepare a diskless machine with 
an IB driver. 
In order to use a DHCP client identifier, you need to first create a 
configuration file that defines the DHCP client identifier. Then run the DHCP 
client with this file using the following command:
dhclient cf <client conf file> <IB network interface name>
Example of a configuration file for the ConnectX (PCI Device ID 25418), called 
dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier 00:02:c9:03:00:00:10:39;
}
Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218), called
dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier 20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92;
}

In order to use the configuration file, run:
host1# dhclient cf dhclient.conf ib1


===============================================================================
5. The ib-bonding driver
===============================================================================
The ib-bonding driver is a High Availability solution for IPoIB interfaces. 
It is based on the Linux Ethernet Bonding Driver and was adapted to work with
IPoIB. The ib-bonding package contains a bonding driver and a utility called 
ib-bond to manage and control the driver operation. 
The ib-bonding driver comes with the ib-bonding package (run rpm -qi ib-bonding
to get the package information).

Using the ib-bonding driver
---------------------------
The ib-bonding driver can be loaded manually or automatically.

1. Manual operation:
Use the utility ib-bond to start, query, or stop the driver. For details on this
utility, read the documentation for the ib-bonding package.

2. Automatic operation:
   Use standard OS tools (sysconfig in SuSE and initscripts in Redhat)
   to create a configuration that will come up with network restart. For details
   on this, read the documentation for the ib-bonding package.

Notes:
* Using /etc/infiniband/openib.conf to create a persistent configuration is
  no longer supported
* On RHEL4_U7, cannot set a slave interface as primary.


===============================================================================
6. Bug Fixes and Enhancements Since OFED 1.3
===============================================================================
- There is no default configuration for IPoIB interfaces: One should manually
  specify the full IP configuration or use the ofed_net.conf file. See
  OFED_Installation_Guide.txt for details on ipoib configuration.
- Don't drop multicast sends when they can be queued
- IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small
  SKBs (bug 989)
- IPoIB failed on stress testing (bug 1004)
- Kernel Oops during "port up/down test" (bug 1040)
- Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel
  panic (bug 985)
- Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20
- Set max CM MTU when moving to CM mode, instead of setting it in openibd script
- Fix CQ size calculations for ipoib
- Bonding: Enable build for SLES10 SP2
- Bonding: Fix  issue in using the bonding module for Ethernet slaves (see
  documentation for details)

===============================================================================
7. Bug Fixes and Enhancements Since OFED 1.3.1
===============================================================================
- IPoIB: Refresh paths instead of flushing them on SM change events to improve 
  failover respond
- IPoIB: Fix loss of connectivity after bonding failover on both sides
- Bonding: Fix link state detection under RHEL4
- Bonding: Avoid annoying messages from initscripts when starting bond
- Bonding: Set default number of grat. ARP after failover to three (was one)

===============================================================================
8. Bug Fixes and Enhancements Since OFED 1.4
===============================================================================
- Performance tuning is enabled by default for IPOIB CM.
- Clear IPOIB_FLAG_ADMIN_UP if ipoib_open fails 
- disable napi while cq is being drained (bugzilla #1587)
- rdma_cm: Use rate from ipoib broadcast when joining ipoib multicast
  When joining IPoIB multicast group, use the same rate as in the broadcast
  group. Otherwise, if rdma_cm creates this group before IPoIB does, it might get
  a different rate. This will cause IPoIB to fail joining to the same group later
  on, because IPoIB has a strict rate selection.
- fix unprotected use of priv->broadcast in ipoib_mcast_join_task.
- Do not join broadcast group if interface is brought down

	  
===============================================================================
9. Performance tuning
===============================================================================
When IPoIB is configured to run in connected mode, tcp parameter tuning is
performed at driver startup -- to improve the throughput of medium and large
messages.
The driver startup scripts set the following TCP parameters as follows:

      net.ipv4.tcp_timestamps=0
      net.ipv4.tcp_sack=0
      net.core.netdev_max_backlog=250000
      net.core.rmem_max=16777216
      net.core.wmem_max=16777216
      net.core.rmem_default=16777216
      net.core.wmem_default=16777216
      net.core.optmem_max=16777216
      net.ipv4.tcp_mem="16777216 16777216 16777216"
      net.ipv4.tcp_rmem="4096 87380 16777216"
      net.ipv4.tcp_wmem="4096 65536 16777216"

This tuning is effective only for connected mode.  If you run in datagram mode,
it actually reduces performance.

If you change the IPoIB run mode to "datagram" while the driver is running,
the tuned parameters do not get reset to their default values.  We therefore
recommend that you change the IPoIB mode only while the driver is down
(by setting line "SET_IPOIB_CM=yes" to "SET_IPOIB_CM=no" in file
/etc/infiniband/openib.conf, and then restarting the driver).