File: HOWTO

package info (click to toggle)
raidtools 0.42-21
  • links: PTS
  • area: main
  • in suites: potato
  • size: 400 kB
  • ctags: 128
  • sloc: ansic: 2,056; makefile: 137; sh: 85
file content (2904 lines) | stat: -rw-r--r-- 103,742 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
  Software-RAID HOWTO
  Linas Vepstas, linas@linas.org
  v0.51  27 June 1998

  RAID stands for ''Redundant Array of Inexpensive Disks'', and is meant
  to be a way of creating a fast and reliable disk-drive subsystem out
  of individual disks.  RAID can guard against disk failure, and can
  also improve performance over that of a single disk drive.  This docu
  ment is a tutorial/HOWTO/FAQ for users of the Linux MD kernel exten
  sion, the associated tools, and their use.  The MD extension imple
  ments RAID-0 (striping), RAID-1 (mirroring), RAID-4 and RAID-5 in
  software. That is, with MD, no special hardware or disk controllers
  are required to get many of the benefits of RAID.
  ______________________________________________________________________

  Table of Contents


  1. Introduction

  2. Understanding RAID

  3. Setup & Installation Considerations

  4. Error Recovery

  5. Troubleshooting Install Problems

  6. Supported Hardware & Software

  7. Modifying an Existing Installation

  8. Performance, Tools & General Bone-headed Questions

  9. High Availability RAID

  10. Questions Waiting for Answers

  11. Wish List of Enhancements to MD and Related Software



  ______________________________________________________________________


     Preamble
        This document is copyrighted and GPL'ed by Linas Vepstas
        (linas@linas.org).  Permission to use, copy, distribute this
        document for any purpose is hereby granted, provided that the
        author's / editor's name and this notice appear in all copies
        and/or supporting documents; and that an unmodified version of
        this document is made freely available.  This document is
        distributed in the hope that it will be useful, but WITHOUT ANY
        WARRANTY, either expressed or implied.  While every effort has
        been taken to ensure the accuracy of the information documented
        herein, the author / editor / maintainer assumes NO
        RESPONSIBILITY for any errors, or for any damages, direct or
        consequential, as a result of the use of the information
        documented herein.


        RAID, although designed to improve system reliability by adding
        redundancy, can also lead to a false sense of security and
        confidence when used improperly.  This false confidence can lead
        to even greater disasters.  In particular, note that RAID is
        designed to protect against *disk* failures, and not against
        *power* failures or *operator* mistakes.  Power failures, buggy
        development kernels, or operator/admin errors can lead to
        damaged data that it is not recoverable!  RAID is *not* a
        substitute for proper backup of your system.  Know what you are
        doing, test, be knowledgeable and aware!

  1.  Introduction


  1. Q: What is RAID?

       A: RAID stands for "Redundant Array of Inexpensive Disks",
       and is meant to be a way of creating a fast and reliable
       disk-drive subsystem out of individual disks.  In the PC
       world, "I" has come to stand for "Independent", where mar
       keting forces continue to differentiate IDE and SCSI.  In
       it's original meaning, "I" meant "Inexpensive as compared to
       refrigerator-sized mainframe 3380 DASD", monster drives
       which made nice houses look cheap, and diamond rings look
       like trinkets.



  2. Q: What is this document?

       A: This document is a tutorial/HOWTO/FAQ for users of the
       Linux MD kernel extension, the associated tools, and their
       use.  The MD extension implements RAID-0 (striping), RAID-1
       (mirroring), RAID-4 and RAID-5 in software.   That is, with
       MD, no special hardware or disk controllers are required to
       get many of the benefits of RAID.


       This document is NOT an introduction to RAID; you must find
       this elsewhere.




  3. Q: What levels of RAID does the Linux kernel implement?

       A: Striping (RAID-0) and linear concatenation are a part of
       the stock 2.x series of kernels.  This code is of production
       quality; it is well understood and well maintained.  It is
       being used in some very large USENET news servers.


       RAID-1, RAID-4 & RAID-5 are a part of the 2.1.63 and greater
       kernels.  For earlier 2.0.x and 2.1.x kernels, patches exist
       that will provide this function.  Don't feel obligated to
       upgrade to 2.1.63; upgrading the kernel is hard; it is
       *much* easier to patch an earlier kernel.  Most of the RAID
       user community is running 2.0.x kernels, and that's where
       most of the historic RAID development has focused.   The
       current snapshots should be considered near-production
       quality; that is, there are no known bugs but there are some
       rough edges and untested system setups.  There are a large
       number of people using Software RAID in a production
       environment.


       RAID-1 hot reconstruction has been recently introduced
       (August 1997) and should be considered alpha quality.
       RAID-5 hot reconstruction will be alpha quality any day now.


  A word of caution about the 2.1.x development kernels: these
  are less than stable in a variety of ways.  Some of the
  newer disk controllers (e.g. the Promise Ultra's) are
  supported only in the 2.1.x kernels.  However, the 2.1.x
  kernels have seen frequent changes in the block device
  driver, in the DMA and interrupt code, in the PCI, IDE and
  SCSI code, and in the disk controller drivers.  The
  combination of these factors, coupled to cheapo hard drives
  and/or low-quality ribbon cables can lead to considerable
  heartbreak.   The ckraid tool, as well as fsck and mount put
  considerable stress on the RAID subsystem.  This can lead to
  hard lockups during boot, where even the magic alt-SysReq
  key sequence won't save the day.  Use caution with the 2.1.x
  kernels, and expect trouble.  Or stick to the 2.0.34 kernel.




  4. Q: I'm running an older kernel. Where do I get patches?

       A: Software RAID-0 and linear mode are a stock part of all
       current Linux kernels.  Patches for Software RAID-1,4,5 are
       available from
       <http://luthien.nuclecu.unam.mx/~miguel/raid>.  See also the
       quasi-mirror <ftp://linux.kernel.org/pub/linux/dae
       mons/raid/> for patches, tools and other goodies.



  5. Q: Are there other Linux RAID references?

       A:

         Generic RAID overview:
          <http://www.dpt.com/uraiddoc.html>.

         General Linux RAID options:
          <http://linas.org/linux/raid.html>.

         Latest version of this document:
          <http://linas.org/linux/Software-RAID/Software-
          RAID.html>.

         Linux-RAID mailing list archive:
          <http://www.linuxhq.com/lnxlists/>.

         Linux Software RAID Home Page:
          <http://luthien.nuclecu.unam.mx/~miguel/raid>.

         Linux Software RAID tools:
          <ftp://linux.kernel.org/pub/linux/daemons/raid/>.

         How to setting up linear/stripped Software RAID:
          <http://www.ssc.com/lg/issue17/raid.html>.

         Bootable RAID mini-HOWTO:
          <ftp://ftp.bizsystems.com/pub/raid/bootable-raid>.

         Root RAID HOWTO: <ftp://ftp.bizsystems.com/pub/raid/Root-
          RAID-HOWTO>.

         Linux RAID-Geschichten:
          <http://www.infodrom.north.de/~joey/Linux/raid/>.



  6. Q: Who do I blame for this document?

       A: Linas Vepstas slapped this thing together.  However, most
       of the information, and some of the words were supplied by

         Bradley Ward Allen <ulmo@Q.Net>

         Luca Berra <bluca@comedia.it>

         Brian Candler <B.Candler@pobox.com>

         Bohumil Chalupa <bochal@apollo.karlov.mff.cuni.cz>

         Rob Hagopian <hagopiar@vu.union.edu>

         Anton Hristozov <anton@intransco.com>

         Miguel de Icaza <miguel@luthien.nuclecu.unam.mx>

         Ingo Molnar <mingo@pc7537.hil.siemens.at>

         Alvin Oga <alvin@planet.fef.com>

         Gadi Oxman <gadio@netvision.net.il>

         Vaughan Pratt <pratt@cs.Stanford.EDU>

         Steven A. Reisman <sar@pressenter.com>

         Michael Robinton <michael@bzs.org>

         Martin Schulze <joey@finlandia.infodrom.north.de>

         Geoff Thompson <geofft@cs.waikato.ac.nz>

         Edward Welbon <welbon@bga.com>

         Rod Wilkens <rwilkens@border.net>

         Johan Wiltink <j.m.wiltink@pi.net>

         Leonard N. Zubkoff <lnz@dandelion.com>

         Marc ZYNGIER <zyngier@ufr-info-p7.ibp.fr>


          Copyrights

         Copyright (C) 1994-96 Marc ZYNGIER

         Copyright (C) 1997 Gadi Oxman, Ingo Molnar, Miguel de
          Icaza

         Copyright (C) 1997, 1998 Linas Vepstas

         By copyright law, additional copyrights are implicitly
          held by the contributors listed above.

          Thanks all for being there!


  2.  Understanding RAID


  1. Q: What is RAID?  Why would I ever use it?

  A: RAID is a way of combining multiple disk drives into a
  single entity to improve performance and/or reliability.
  There are a variety of different types and implementations
  of RAID, each with its own advantages and disadvantages.
  For example, by putting a copy of the same data on two disks
  (called disk mirroring, or RAID level 1), read performance
  can be improved by reading alternately from each disk in the
  mirror.  On average, each disk is less busy, as it is han
  dling only 1/2 the reads (for two disks), or 1/3 (for three
  disks), etc.  In addition, a mirror can improve reliability:
  if one disk fails, the other disk(s) have a copy of the
  data.  Different ways of combining the disks into one,
  referred to as RAID levels,  can provide greater storage
  efficiency than simple mirroring, or can alter latency
  (access-time) performance, or throughput (transfer rate)
  performance, for reading or writing, while still retaining
  redundancy that is useful for guarding against failures.

  Although RAID can protect against disk failure, it does not
  protect against operator and administrator (human) error, or
  against loss due to programming bugs (possibly due to bugs
  in the RAID software itself).  The net abounds with tragic
  tales of system administrators who have bungled a RAID
  installation, and have lost all of their data.  RAID is not
  a substitute for frequent, regularly scheduled backup.

  RAID can be implemented in hardware, in the form of special
  disk controllers, or in software, as a kernel module that is
  layered in between the low-level disk driver, and the file
  system which sits above it.  RAID hardware is always a "disk
  controller", that is, a device to which one can cable up the
  disk drives. Usually it comes in the form of an adapter card
  that will plug into a ISA/EISA/PCI/S-Bus/MicroChannel slot.
  However, some RAID controllers are in the form of a box that
  connects into the cable in between the usual system disk
  controller, and the disk drives.  Small ones may fit into a
  drive bay; large ones may be built into a storage cabinet
  with its own drive bays and power supply.  The latest RAID
  hardware used with the latest & fastest CPU will usually
  provide the best overall performance, although at a
  significant price.  This is because most RAID controllers
  come with on-board DSP's and memory cache that can off-load
  a considerable amount of processing from the main CPU, as
  well as allow high transfer rates into the large controller
  cache.  Old RAID hardware can act as a "de-accelerator" when
  used with newer CPU's: yesterday's fancy DSP and cache can
  act as a bottleneck, and it's performance is often beaten by
  pure-software RAID and new but otherwise plain, run-of-the-
  mill disk controllers.  RAID hardware can offer an advantage
  over pure-software RAID, if it can makes use of disk-spindle
  synchronization and its knowledge of the disk-platter
  position with regard to the disk head, and the desired disk-
  block.  However, most modern (low-cost) disk drives do not
  offer this information and level of control anyway, and
  thus, most RAID hardware does not take advantage of it.
  RAID hardware is usually not compatible across different
  brands, makes and models: if a RAID controller fails, it
  must be replaced by another controller of the same type.  As
  of this writing (June 1998), a broad variety of hardware
  controllers will operate under Linux; however, none of them
  currently come with configuration and management utilities
  that run under Linux.

  Software-RAID is a set of kernel modules, together with
  management utilities that implement RAID purely in software,
  and require no extraordinary hardware.  The Linux RAID
  subsystem is implemented as a layer in the kernel that sits
  above the low-level disk drivers (for IDE, SCSI and Paraport
  drives), and the block-device interface.  The filesystem, be
  it ext2fs, DOS-FAT, or other, sits above the block-device
  interface.  Software-RAID, by its very software nature,
  tends to be more flexible than a hardware solution.  The
  downside is that it of course requires more CPU cycles and
  power to run well than a comparable hardware system.  Of
  course, the cost can't be beat.  Software RAID has one
  further important distinguishing feature: it operates on a
  partition-by-partition basis, where a number of individual
  disk partitions are ganged together to create a RAID
  partition.  This is in contrast to most hardware RAID
  solutions, which gang together entire disk drives into an
  array.  With hardware, the fact that there is a RAID array
  is transparent to the operating system, which tends to
  simplify management.  With software, there are far more
  configuration options and choices, tending to complicate
  matters.

  As of this writing (June 1998), the administration of RAID
  under Linux is far from trivial, and is best attempted by
  experienced system administrators.  The theory of operation
  is complex.  The system tools require modification to
  startup scripts.  And recovery from disk failure is non-
  trivial, and prone to human error.   RAID is not for the
  novice, and any benefits it may bring to reliability and
  performance can be easily outweighed by the extra
  complexity.  Indeed, modern disk drives are incredibly
  reliable and modern CPU's and controllers are quite
  powerful.  You might more easily obtain the desired
  reliability and performance levels by purchasing higher-
  quality and/or faster hardware.



  2. Q: What are RAID levels?  Why so many? What distinguishes them?

       A: The different RAID levels have different performance,
       redundancy, storage capacity, reliability and cost charac
       teristics.   Most, but not all levels of RAID offer redun
       dancy against disk failure.  Of those that offer redundancy,
       RAID-1 and RAID-5 are the most popular.  RAID-1 offers bet
       ter performance, while RAID-5 provides for more efficient
       use of the available storage space.  However, tuning for
       performance is an entirely different matter, as performance
       depends strongly on a large variety of factors, from the
       type of application, to the sizes of stripes, blocks, and
       files.  The more difficult aspects of performance tuning are
       deferred to a later section of this HOWTO.

       The following describes the different RAID levels in the
       context of the Linux software RAID implementation.


         RAID-linear is a simple concatenation of partitions to
          create a larger virtual partition.  It is handy if you
          have a number small drives, and wish to create a single,
          large partition.  This concatenation offers no
          redundancy, and in fact decreases the overall
          reliability: if any one disk fails, the combined
          partition will fail.




    RAID-1 is also referred to as "mirroring".  Two (or more)
     partitions, all of the same size, each store an exact
     copy of all data, disk-block by disk-block.  Mirroring
     gives strong protection against disk failure: if one disk
     fails, there is another with the an exact copy of the
     same data. Mirroring can also help improve performance in
     I/O-laden systems, as read requests can be divided up
     between several disks.   Unfortunately, mirroring is also
     the least efficient in terms of storage: two mirrored
     partitions can store no more data than a single
     partition.



    Striping is the underlying concept behind all of the
     other RAID levels.  A stripe is a contiguous sequence of
     disk blocks.  A stripe may be as short as a single disk
     block, or may consist of thousands.  The RAID drivers
     split up their component disk partitions into stripes;
     the different RAID levels differ in how they organize the
     stripes, and what data they put in them. The interplay
     between the size of the stripes, the typical size of
     files in the file system, and their location on the disk
     is what determines the overall performance of the RAID
     subsystem.



    RAID-0 is much like RAID-linear, except that the
     component partitions are divided into stripes and then
     interleaved.  Like RAID-linear, the result is a single
     larger virtual partition.  Also like RAID-linear, it
     offers no redundancy, and therefore decreases overall
     reliability: a single disk failure will knock out the
     whole thing.  RAID-0 is often claimed to improve
     performance over the simpler RAID-linear.  However, this
     may or may not be true, depending on the characteristics
     to the file system, the typical size of the file as
     compared to the size of the stripe, and the type of
     workload.  The ext2fs file system already scatters files
     throughout a partition, in an effort to minimize
     fragmentation. Thus, at the simplest level, any given
     access may go to one of several disks, and thus, the
     interleaving of stripes across multiple disks offers no
     apparent additional advantage. However, there are
     performance differences, and they are data, workload, and
     stripe-size dependent.



    RAID-4 interleaves stripes like RAID-0, but it requires
     an additional partition to store parity information.  The
     parity is used to offer redundancy: if any one of the
     disks fail, the data on the remaining disks can be used
     to reconstruct the data that was on the failed disk.
     Given N data disks, and one parity disk, the parity
     stripe is computed by taking one stripe from each of the
     data disks, and XOR'ing them together.  Thus, the storage
     capacity of a an (N+1)-disk RAID-4 array is N, which is a
     lot better than mirroring (N+1) drives, and is almost as
     good as a RAID-0 setup for large N.  Note that for N=1,
     where there is one data drive, and one parity drive,
     RAID-4 is a lot like mirroring, in that each of the two
     disks is a copy of each other.  However, RAID-4 does NOT
     offer the read-performance of mirroring, and offers
     considerably degraded write performance. In brief, this
     is because updating the parity requires a read of the old
     parity, before the new parity can be calculated and
     written out.  In an environment with lots of writes, the
     parity disk can become a bottleneck, as each write must
     access the parity disk.



    RAID-5 avoids the write-bottleneck of RAID-4 by
     alternately storing the parity stripe on each of the
     drives.  However, write performance is still not as good
     as for mirroring, as the parity stripe must still be read
     and XOR'ed before it is written.  Read performance is
     also not as good as it is for mirroring, as, after all,
     there is only one copy of the data, not two or more.
     RAID-5's principle advantage over mirroring is that it
     offers redundancy and protection against single-drive
     failure, while offering far more storage capacity  when
     used with three or more drives.



    RAID-2 and RAID-3 are seldom used anymore, and to some
     degree are have been made obsolete by modern disk
     technology.  RAID-2 is similar to RAID-4, but stores ECC
     information instead of parity.  Since all modern disk
     drives incorporate ECC under the covers, this offers
     little additional protection.  RAID-2 can offer greater
     data consistency if power is lost during a write;
     however, battery backup and a clean shutdown can offer
     the same benefits.  RAID-3 is similar to RAID-4, except
     that it uses the smallest possible stripe size. As a
     result, any given read will involve all disks, making
     overlapping I/O requests difficult/impossible. In order
     to avoid delay due to rotational latency, RAID-3 requires
     that all disk drive spindles be synchronized. Most modern
     disk drives lack spindle-synchronization ability, or, if
     capable of it, lack the needed connectors, cables, and
     manufacturer documentation.  Neither RAID-2 nor RAID-3
     are supported by the Linux Software-RAID drivers.



    Other RAID levels have been defined by various
     researchers and vendors.  Many of these represent the
     layering of one type of raid on top of another.  Some
     require special hardware, and others are protected by
     patent. There is no commonly accepted naming scheme for
     these other levels. Sometime the advantages of these
     other systems are minor, or at least not apparent until
     the system is highly stressed.  Except for the layering
     of RAID-1 over RAID-0/linear, Linux Software RAID does
     not support any of the other variations.



  3.  Setup & Installation Considerations


  1. Q: What is the best way to configure Software RAID?

       A: I keep rediscovering that file-system planning is one of
       the more difficult Unix configuration tasks.  To answer your
       question, I can describe what we did.

       We planned the following setup:
    two EIDE disks, 2.1.gig each.


       disk partition mount pt.  size    device
         1      1       /        300M   /dev/hda1
         1      2       swap      64M   /dev/hda2
         1      3       /home    800M   /dev/hda3
         1      4       /var     900M   /dev/hda4

         2      1       /root    300M   /dev/hdc1
         2      2       swap      64M   /dev/hdc2
         2      3       /home    800M   /dev/hdc3
         2      4       /var     900M   /dev/hdc4





    Each disk is on a separate controller (& ribbon cable).
     The theory is that a controller failure and/or ribbon
     failure won't disable both disks.  Also, we might
     possibly get a performance boost from parallel operations
     over two controllers/cables.

    Install the Linux kernel on the root (/) partition
     /dev/hda1.  Mark this partition as bootable.

    /dev/hdc1 will contain a ``cold'' copy of /dev/hda1. This
     is NOT a raid copy, just a plain old copy-copy. It's
     there just in case the first disk fails; we can use a
     rescue disk, mark /dev/hdc1 as bootable, and use that to
     keep going without having to reinstall the system.  You
     may even want to put /dev/hdc1's copy of the kernel into
     LILO to simplify booting in case of failure.

     The theory here is that in case of severe failure, I can
     still boot the system without worrying about raid
     superblock-corruption or other raid failure modes &
     gotchas that I don't understand.

    /dev/hda3 and /dev/hdc3 will be mirrors /dev/md0.

    /dev/hda4 and /dev/hdc4 will be mirrors /dev/md1.

    we picked /var and /home to be mirrored, and in separate
     partitions, using the following logic:

    / (the root partition) will contain relatively static,
     non-changing data: for all practical purposes, it will be
     read-only without actually being marked & mounted read-
     only.

    /home will contain ''slowly'' changing data.

    /var will contain rapidly changing data, including mail
     spools, database contents and web server logs.

     The idea behind using multiple, distinct partitions is
     that if, for some bizarre reason, whether it is human
     error, power loss, or an operating system gone wild,
     corruption is limited to one partition.  In one typical
     case, power is lost while the system is writing to disk.
     This will almost certainly lead to a corrupted
     filesystem, which will be repaired by fsck during the
     next boot.  Although fsck does it's best to make the
     repairs without creating additional damage during those
     repairs, it can be comforting to know that any such
     damage has been limited to one partition.  In another
     typical case, the sysadmin makes a mistake during rescue
     operations, leading to erased or destroyed data.
     Partitions can help limit the repercussions of the
     operator's errors.

    Other reasonable choices for partitions might be /usr or
     /opt.  In fact, /opt and /home make great choices for
     RAID-5 partitions, if we had more disks.  A word of
     caution: DO NOT put /usr in a RAID-5 partition.  If a
     serious fault occurs, you may find that you cannot mount
     /usr, and that you want some of the tools on it (e.g. the
     networking tools, or the compiler.)  With RAID-1, if a
     fault has occurred, and you can't get RAID to work, you
     can at least mount one of the two mirrors.  You can't do
     this with any of the other RAID levels (RAID-5, striping,
     or linear append).



     So, to complete the answer to the question:

    install the OS on disk 1, partition 1.  do NOT mount any
     of the other partitions.

    install RAID per instructions.

    configure md0 and md1.

    convince yourself that you know what to do in case of a
     disk failure!  Discover sysadmin mistakes now, and not
     during an actual crisis.  Experiment!  (we turned off
     power during disk activity -- this proved to be ugly but
     informative).

    do some ugly mount/copy/unmount/rename/reboot scheme to
     move /var over to the /dev/md1.  Done carefully, this is
     not dangerous.

    enjoy!




  2. Q: Can I strip/mirror the root partition (/)?  Why can't I boot
     Linux directly from the md disks?


       A: Both LILO and Loadlin need an non-stripped/mirrored par
       tition to read the kernel image from. If you want to
       strip/mirror the root partition (/), then you'll want to
       create an unstriped/mirrored partition to hold the ker
       nel(s).  Typically, this partition is named /boot.  Then you
       either use the initial ramdisk support (initrd), or patches
       from Harald Hoyer <HarryH@Royal.Net> that allow a stripped
       partition to be used as the root device.  (These patches are
       now a standard part of recent 2.1.x kernels)


       There are several approaches that can be used.  One approach
       is documented in detail in the Bootable RAID mini-HOWTO:
       <ftp://ftp.bizsystems.com/pub/raid/bootable-raid>.


       Alternately, use mkinitrd to build the ramdisk image, see
  below.


  Edward Welbon <welbon@bga.com> writes:

    ... all that is needed is a script to manage the boot
     setup.  To mount an md filesystem as root, the main thing
     is to build an initial file system image that has the
     needed modules and md tools to start md.  I have a simple
     script that does this.


    For boot media, I have a small cheap SCSI disk (170MB I
     got it used for $20).  This disk runs on a AHA1452, but
     it could just as well be an inexpensive IDE disk on the
     native IDE.  The disk need not be very fast since it is
     mainly for boot.


    This disk has a small file system which contains the
     kernel and the file system image for initrd.  The initial
     file system image has just enough stuff to allow me to
     load the raid SCSI device driver module and start the
     raid partition that will become root.  I then do an


       echo 0x900 > /proc/sys/kernel/real-root-dev





  (0x900 is for /dev/md0) and exit linuxrc.  The boot proceeds
  normally from there.


    I have built most support as a module except for the
     AHA1452 driver that brings in the initrd filesystem.  So
     I have a fairly small kernel. The method is perfectly
     reliable, I have been doing this since before 2.1.26 and
     have never had a problem that I could not easily recover
     from.  The file systems even survived several 2.1.4[45]
     hard crashes with no real problems.


    At one time I had partitioned the raid disks so that the
     initial cylinders of the first raid disk held the kernel
     and the initial cylinders of the second raid disk hold
     the initial file system image, instead I made the initial
     cylinders of the raid disks swap since they are the
     fastest cylinders (why waste them on boot?).


    The nice thing about having an inexpensive device
     dedicated to boot is that it is easy to boot from and can
     also serve as a rescue disk if necessary. If you are
     interested, you can take a look at the script that builds
     my initial ram disk image and then runs LILO.

       <http://www.realtime.net/~welbon/initrd.md.tar.gz>


  It is current enough to show the picture.  It isn't espe
  cially pretty and it could certainly build a much smaller
  filesystem image for the initial ram disk.  It would be easy
  to a make it more efficient.  But it uses LILO as is.  If
  you make any improvements, please forward a copy to me. 8-)



  3. Q: I have heard that I can run mirroring over striping. Is this
     true?  Can I run mirroring over the loopback device?

       A: Yes, but not the reverse.  That is, you can put a stripe
       over several disks, and then build a mirror on top of this.
       However, striping cannot be put on top of mirroring.


       A brief technical explanation is that the linear and stripe
       personalities use the ll_rw_blk routine for access.  The
       ll_rw_blk routine maps disk devices and  sectors, not
       blocks.  Block devices can be layered one on top of the
       other; but devices that do raw, low-level disk accesses,
       such as ll_rw_blk, cannot.


       Currently (November 1997) RAID cannot be run over the
       loopback devices, although this should be fixed shortly.



  4. Q: I have two small disks and three larger disks.  Can I
     concatenate the two smaller disks with RAID-0, and then create a
     RAID-5 out of that and the larger disks?

       A: Currently (November 1997), for a RAID-5 array, no.  Cur
       rently, one can do this only for a RAID-1 on top of the con
       catenated drives.



  5. Q: What is the difference between RAID-1 and RAID-5 for a two-disk
     configuration (i.e. the difference between a RAID-1 array  built
     out of two disks, and a RAID-5 array built out of two disks)?


       A: There is no difference in storage capacity.  Nor can
       disks be added to either array to increase capacity (see the
       question below for details).


       RAID-1 offers a performance advantage for reads: the RAID-1
       driver uses distributed-read technology to simultaneously
       read two sectors, one from each drive, thus doubling read
       performance.


       The RAID-5 driver, although it contains many optimizations,
       does not currently (September 1997) realize that the parity
       disk is actually a mirrored copy of the data disk.  Thus, it
       serializes data reads.




  6. Q: How can I guard against a two-disk failure?


       A: Some of the RAID algorithms do guard against multiple
       disk failures, but these are not currently implemented for
       Linux.  However, the Linux Software RAID can guard against
       multiple disk failures by layering an array on top of an
  array.  For example, nine disks can be used to create three
  raid-5 arrays.  Then these three arrays can in turn be
  hooked together into a single RAID-5 array on top.  In fact,
  this kind of a configuration will guard against a three-disk
  failure.  Note that a large amount of disk space is
  ''wasted'' on the redundancy information.



           For an NxN raid-5 array,
           N=3, 5 out of 9 disks are used for parity (=55%)
           N=4, 7 out of 16 disks
           N=5, 9 out of 25 disks
           ...
           N=9, 17 out of 81 disks (=~20%)





  In general, an MxN array will use M+N-1 disks for parity.
  The least amount of space is "wasted" when M=N.


  Another alternative is to create a RAID-1 array with three
  disks.  Note that since all three disks contain identical
  data, that 2/3's of the space is ''wasted''.




  7. Q: I'd like to understand  how it'd be possible to have something
     like fsck: if the partition hasn't been cleanly unmounted, fsck
     runs and fixes the filesystem by itself more than 90% of the time.
     Since the machine is capable of fixing it by itself with ckraid
     --fix, why not make it automatic?



       A: This can be done by adding lines like the following to
       /etc/rc.d/rc.sysinit:

           mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {
               ckraid --fix /etc/raid.usr.conf
               mdadd /dev/md0 /dev/hda1 /dev/hdc1
           }



       or

           mdrun -p1 /dev/md0
           if [ $? -gt 0 ] ; then
                   ckraid --fix /etc/raid1.conf
                   mdrun -p1 /dev/md0
           fi



       Before presenting a more complete and reliable script, lets
       review the theory of operation.

       Gadi Oxman writes: In an unclean shutdown, Linux might be in
       one of the following states:


    The in-memory disk cache was in sync with the RAID set
     when the unclean shutdown occurred; no data was lost.

    The in-memory disk cache was newer than the RAID set
     contents when the crash occurred; this results in a
     corrupted filesystem and potentially in data loss.

     This state can be further divided to the following two
     states:


    Linux was writing data when the unclean shutdown
     occurred.

    Linux was not writing data when the crash occurred.


     Suppose we were using a RAID-1 array. In (2a), it might
     happen that before the crash, a small number of data
     blocks were successfully written only to some of the
     mirrors, so that on the next reboot, the mirrors will no
     longer contain the same data.

     If we were to ignore the mirror differences, the
     raidtools-0.36.3 read-balancing code might choose to read
     the above data blocks from any of the mirrors, which will
     result in inconsistent behavior (for example, the output
     of e2fsck -n /dev/md0 can differ from run to run).


     Since RAID doesn't protect against unclean shutdowns,
     usually there isn't any ''obviously correct'' way to fix
     the mirror differences and the filesystem corruption.

     For example, by default ckraid --fix will choose the
     first operational mirror and update the other mirrors
     with its contents.  However, depending on the exact
     timing at the crash, the data on another mirror might be
     more recent, and we might want to use it as the source
     mirror instead, or perhaps use another method for
     recovery.

     The following script provides one of the more robust
     boot-up sequences.  In particular, it guards against
     long, repeated ckraid's in the presence of uncooperative
     disks, controllers, or controller device drivers.  Modify
     it to reflect your config, and copy it to rc.raid.init.
     Then invoke rc.raid.init after the root partition has
     been fsck'ed and mounted rw, but before the remaining
     partitions are fsck'ed.  Make sure the current directory
     is in the search path.















         mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {
             rm -f /fastboot             # force an fsck to occur
             ckraid --fix /etc/raid.usr.conf
             mdadd /dev/md0 /dev/hda1 /dev/hdc1
         }
         # if a crash occurs later in the boot process,
         # we at least want to leave this md in a clean state.
         /sbin/mdstop /dev/md0

         mdadd /dev/md1 /dev/hda2 /dev/hdc2 || {
             rm -f /fastboot             # force an fsck to occur
             ckraid --fix /etc/raid.home.conf
             mdadd /dev/md1 /dev/hda2 /dev/hdc2
         }
         # if a crash occurs later in the boot process,
         # we at least want to leave this md in a clean state.
         /sbin/mdstop /dev/md1

         mdadd /dev/md0 /dev/hda1 /dev/hdc1
         mdrun -p1 /dev/md0
         if [ $? -gt 0 ] ; then
             rm -f /fastboot             # force an fsck to occur
             ckraid --fix /etc/raid.usr.conf
             mdrun -p1 /dev/md0
         fi
         # if a crash occurs later in the boot process,
         # we at least want to leave this md in a clean state.
         /sbin/mdstop /dev/md0

         mdadd /dev/md1 /dev/hda2 /dev/hdc2
         mdrun -p1 /dev/md1
         if [ $? -gt 0 ] ; then
             rm -f /fastboot             # force an fsck to occur
             ckraid --fix /etc/raid.home.conf
             mdrun -p1 /dev/md1
         fi
         # if a crash occurs later in the boot process,
         # we at least want to leave this md in a clean state.
         /sbin/mdstop /dev/md1

         # OK, just blast through the md commands now.  If there were
         # errors, the above checks should have fixed things up.
         /sbin/mdadd /dev/md0 /dev/hda1 /dev/hdc1
         /sbin/mdrun -p1 /dev/md0

         /sbin/mdadd /dev/md12 /dev/hda2 /dev/hdc2
         /sbin/mdrun -p1 /dev/md1




  In addition to the above, you'll want to create a
  rc.raid.halt which should look like the following:

      /sbin/mdstop /dev/md0
      /sbin/mdstop /dev/md1



  Be sure to modify both rc.sysinit and init.d/halt to include
  this everywhere that filesystems get unmounted before a
  halt/reboot.  (Note that rc.sysinit unmounts and reboots if
  fsck returned with an error.)



  8. Q: Can I set up one-half of a RAID-1 mirror with the one disk I
     have now, and then later get the other disk and just drop it in?


       A: With the current tools, no, not in any easy way.  In par
       ticular, you cannot just copy the contents of one disk onto
       another, and then pair them up.  This is because the RAID
       drivers use glob of space at the end of the partition to
       store the superblock.  This decreases the amount of space
       available to the file system slightly; if you just naively
       try to force a RAID-1 arrangement onto a partition with an
       existing filesystem, the raid superblock will overwrite a
       portion of the file system and mangle data.  Since the
       ext2fs filesystem scatters files randomly throughput the
       partition (in order to avoid fragmentation), there is a very
       good chance that some file will land at the very end of a
       partition long before the disk is full.


       If you are clever, I suppose you can calculate how much room
       the RAID superblock will need, and make your filesystem
       slightly smaller, leaving room for it when you add it later.
       But then, if you are this clever, you should also be able to
       modify the tools to do this automatically for you.  (The
       tools are not terribly complex).


       Note:A careful reader has pointed out that the following
       trick may work; I have not tried or verified this: Do the
       mkraid with /dev/null as one of the devices.  Then mdadd -r
       with only the single, true disk (do not mdadd /dev/null).
       The mkraid should have successfully built the raid array,
       while the mdadd step just forces the system to run in
       "degraded" mode, as if one of the disks had failed.



  4.  Error Recovery


  1. Q: I have a RAID-1 (mirroring) setup, and lost power while there
     was disk activity.  Now what do I do?


       A: The redundancy of RAID levels is designed to protect
       against a disk failure, not against a power failure.

       There are several ways to recover from this situation.


         Method (1): Use the raid tools.  These can be used to
          sync the raid arrays.  They do not fix file-system
          damage; after the raid arrays are sync'ed, then the file-
          system still has to be fixed with fsck.  Raid arrays can
          be checked with ckraid /etc/raid1.conf (for RAID-1, else,
          /etc/raid5.conf, etc.)

          Calling ckraid /etc/raid1.conf --fix will pick one of the
          disks in the array (usually the first), and use that as
          the master copy, and copy its blocks to the others in the
          mirror.

          To designate which of the disks should be used as the
          master, you can use the --force-source flag: for example,
          ckraid /etc/raid1.conf --fix --force-source /dev/hdc3

     The ckraid command can be safely run without the --fix
     option to verify the inactive RAID array without making
     any changes.  When you are comfortable with the proposed
     changes, supply the --fix  option.


    Method (2): Paranoid, time-consuming, not much better
     than the first way.  Lets assume a two-disk RAID-1 array,
     consisting of partitions /dev/hda3 and /dev/hdc3.  You
     can try the following:

     a. fsck /dev/hda3

     b. fsck /dev/hdc3

     c. decide which of the two partitions had fewer errors,
        or were more easily recovered, or recovered the data
        that you wanted.  Pick one, either one, to be your new
        ``master'' copy.  Say you picked /dev/hdc3.

     d. dd if=/dev/hdc3 of=/dev/hda3

     e. mkraid raid1.conf -f --only-superblock


     Instead of the last two steps, you can instead run ckraid
     /etc/raid1.conf --fix --force-source /dev/hdc3 which
     should be a bit faster.

    Method (3): Lazy man's version of above.  If you don't
     want to wait for long fsck's to complete, it is perfectly
     fine to skip the first three steps above, and move
     directly to the last two steps.  Just be sure to run fsck
     /dev/md0 after you are done.  Method (3) is actually just
     method (1) in disguise.


     In any case, the above steps will only sync up the raid
     arrays.  The file system probably needs fixing as well:
     for this, fsck needs to be run on the active, unmounted
     md device.


     With a three-disk RAID-1 array, there are more
     possibilities, such as using two disks to ''vote'' a
     majority answer.  Tools to automate this do not currently
     (September 97) exist.



  2. Q: I have a RAID-4 or a RAID-5 (parity) setup, and lost power while
     there was disk activity.  Now what do I do?


       A: The redundancy of RAID levels is designed to protect
       against a disk failure, not against a power failure.

       Since the disks in a RAID-4 or RAID-5 array do not contain a
       file system that fsck can read, there are fewer repair
       options.  You cannot use fsck to do preliminary checking
       and/or repair; you must use ckraid first.


       The ckraid command can be safely run without the --fix
       option to verify the inactive RAID array without making any
       changes.  When you are comfortable with the proposed
  changes, supply the --fix option.


  If you wish, you can try designating one of the disks as a
  ''failed disk''.  Do this with the --suggest-failed-disk-
  mask flag.

  Only one bit should be set in the flag: RAID-5 cannot
  recover two failed disks.  The mask is a binary bit mask:
  thus:

      0x1 == first disk
      0x2 == second disk
      0x4 == third disk
      0x8 == fourth disk, etc.




  Alternately, you can choose to modify the parity sectors, by
  using the --suggest-fix-parity flag.  This will recompute
  the parity from the other sectors.


  The flags --suggest-failed-dsk-mask and --suggest-fix-parity
  can be safely used for verification. No changes are made if
  the --fix flag is not specified.  Thus, you can experiment
  with different possible repair schemes.




  3. Q: My RAID-1 device, /dev/md0 consists of two hard drive
     partitions: /dev/hda3 and /dev/hdc3.  Recently, the disk with
     /dev/hdc3 failed, and was replaced with a new disk.  My best
     friend, who doesn't understand RAID, said that the correct thing to
     do now is to ''dd if=/dev/hda3 of=/dev/hdc3''.  I tried this, but
     things still don't work.


       A: You should keep your best friend away from you computer.
       Fortunately, no serious damage has been done.  You can
       recover from this by running:


       mkraid raid1.conf -f --only-superblock





  By using dd, two identical copies of the partition were cre
  ated. This is almost correct, except that the RAID-1 kernel
  extension expects the RAID superblocks to be different.
  Thus, when you try to reactivate RAID, the software will
  notice the problem, and deactivate one of the two parti
  tions.  By re-creating the superblock, you should have a
  fully usable system.



  4. Q: My RAID-1 device, /dev/md0 consists of two hard drive
     partitions: /dev/hda3 and /dev/hdc3.  My best (girl?)friend, who
     doesn't understand RAID, ran fsck on /dev/hda3 while I wasn't
     looking, and now the RAID won't work. What should I do?

       A: You should re-examine your concept of ``best friend''.
       In general, fsck should never be run on the individual par
       titions that compose a RAID array.  Assuming that neither of
       the partitions are/were heavily damaged, no data loss has
       occurred, and the RAID-1 device can be recovered as follows:

          a. make a backup of the file system on /dev/hda3

          b. dd if=/dev/hda3 of=/dev/hdc3

          c. mkraid raid1.conf -f --only-superblock

       This should leave you with a working disk mirror.



  5. Q: Why does the above work as a recovery procedure?

       A: Because each of the component partitions in a RAID-1 mir
       ror is a perfectly valid copy of the file system.  In a
       pinch, mirroring can be disabled, and one of the partitions
       can be mounted and safely run as an ordinary, non-RAID file
       system.  When you are ready to restart using RAID-1, then
       unmount the partition, and follow the above instructions to
       restore the mirror.   Note that the above works ONLY for
       RAID-1, and not for any of the other levels.


       It may make you feel more comfortable to reverse the
       direction of the copy above: copy from the disk that was
       untouched to the one that was.  Just be sure to fsck the
       final md.



  6. Q: I am confused by the above questions, but am not yet bailing
     out.  Is it safe to run fsck /dev/md0 ?


       A: Yes, it is safe to run fsck on the md devices.  In fact,
       this is the only safe place to run fsck.



  7. Q: If a disk is slowly failing, will it be obvious which one it is?
     I am concerned that it won't be, and this confusion could lead to
     some dangerous decisions by a sysadmin.


       A: Once a disk fails, an error code will be returned from
       the low level driver to the RAID driver.  The RAID driver
       will mark it as ``bad'' in the RAID superblocks of the
       ``good'' disks (so we will later know which mirrors are good
       and which aren't), and continue RAID operation on the
       remaining operational mirrors.


       This, of course, assumes that the disk and the low level
       driver can detect a read/write error, and will not silently
       corrupt data, for example. This is true of current drives
       (error detection schemes are being used internally), and is
       the basis of RAID operation.




  8. Q: What about hot-repair?


       A: Work is underway to complete ``hot reconstruction''.
       With this feature, one can add several ``spare'' disks to
       the RAID set (be it level 1 or 4/5), and once a disk fails,
       it will be reconstructed on one of the spare disks in run
       time, without ever needing to shut down the array.


       However, to use this feature, the spare disk must have been
       declared at boot time, or it must be hot-added, which
       requires the use of special cabinets and connectors that
       allow a disk to be added while the electrical power is on.


       As of October 97, there is a beta version of MD that allows:

         RAID 1 and 5 reconstruction on spare drives

         RAID-5 parity reconstruction after an unclean shutdown

         spare disk to be hot-added to an already running RAID 1
          or 4/5 array

          By default, automatic reconstruction is (Dec 97)
          currently disabled by default, due to the preliminary
          nature of this work.  It can be enabled by changing the
          value of SUPPORT_RECONSTRUCTION in include/linux/md.h.


          If spare drives were configured into the array when it
          was created and kernel-based reconstruction is enabled,
          the spare drive will already contain a RAID superblock
          (written by mkraid), and the kernel will reconstruct its
          contents automatically (without needing the usual mdstop,
          replace drive, ckraid, mdrun steps).


          If you are not running automatic reconstruction, and have
          not configured a hot-spare disk, the procedure described
          by Gadi Oxman <gadio@netvision.net.il> is recommended:

         Currently, once the first disk is removed, the RAID set
          will be running in degraded mode. To restore full
          operation mode, you need to:

         stop the array (mdstop /dev/md0)

         replace the failed drive

         run ckraid raid.conf to reconstruct its contents

         run the array again (mdadd, mdrun).

          At this point, the array will be running with all the
          drives, and again protects against a failure of a single
          drive.

          Currently, it is not possible to assign single hot-spare
          disk to several arrays.   Each array requires it's own
          hot-spare.




  9. Q: I would like to have an audible alarm for ``you schmuck, one
     disk in the mirror is down'', so that the novice sysadmin knows
     that there is a problem.


       A: The kernel is logging the event with a ``KERN_ALERT''
       priority in syslog.  There are several software packages
       that will monitor the syslog files, and beep the PC speaker,
       call a pager, send e-mail, etc. automatically.



  10.
     Q: How do I run RAID-5 in degraded mode (with one disk failed, and
     not yet replaced)?


       A: Gadi Oxman <gadio@netvision.net.il> writes: Normally, to
       run a RAID-5 set of n drives you have to:


       mdadd /dev/md0 /dev/disk1 ... /dev/disk(n)
       mdrun -p5 /dev/md0





  Even if one of the disks has failed, you still have to mdadd
  it as you would in a normal setup.  (?? try using /dev/null
  in place of the failed disk ???  watch out) Then,

  The array will be active in degraded mode with (n - 1)
  drives.  If ``mdrun'' fails, the kernel has noticed an error
  (for example, several faulty drives, or an unclean shut
  down).  Use ``dmesg'' to display the kernel error messages
  from ``mdrun''.  If the raid-5 set is corrupted due to a
  power loss, rather than a disk crash, one can try to recover
  by creating a new RAID superblock:


       mkraid -f --only-superblock raid5.conf





  A RAID array doesn't provide protection against a power
  failure or a kernel crash, and can't guarantee correct
  recovery.  Rebuilding the superblock will simply cause the
  system to ignore the condition by marking all the drives as
  ``OK'', as if nothing happened.



  11.
     Q: How does RAID-5 work when a disk fails?


       A: The typical operating scenario is as follows:

         A RAID-5 array is active.

         One drive fails while the array is active.


    The drive firmware and the low-level Linux
     disk/controller drivers detect the failure and report an
     error code to the MD driver.

    The MD driver continues to provide an error-free /dev/md0
     device to the higher levels of the kernel (with a
     performance degradation) by using the remaining
     operational drives.

    The sysadmin can umount /dev/md0 and mdstop /dev/md0 as
     usual.

    If the failed drive is not replaced, the sysadmin can
     still start the array in degraded mode as usual, by
     running mdadd and mdrun.



  12.
     Q: I just replaced a failed disk in a RAID-5 array.  After
     rebuilding the array, fsck is reporting many, many errors.  Is this
     normal?


       A: No. And, unless you ran fsck in "verify only; do not
       update" mode, its quite possible that you have corrupted
       your data.  Unfortunately, a not-uncommon scenario is one of
       accidentally changing the disk order in a RAID-5 array,
       after replacing a hard drive.  Although the RAID superblock
       stores the proper order, not all tools use this information.
       In particular, the current version of ckraid will use the
       information specified with the -f flag (typically, the file
       /etc/raid5.conf) instead of the data in the superblock.  If
       the specified order is incorrect, then the replaced disk
       will be reconstructed incorrectly.   The symptom of this
       kind of mistake seems to be heavy & numerous fsck errors.


       And, in case you are wondering, yes, someone lost all of
       their data by making this mistake.   Making a tape backup of
       all data before reconfiguring a RAID array is strongly
       recommended.



  13.
     Q:

       A:



  14.
     Q: Why is there no question 13?


       A: If you are concerned about RAID, High Availability, and
       UPS, then its probably a good idea to be superstitious as
       well.



  15.
     Q: The QuickStart says that mdstop is just to make sure that the
     disks are sync'ed. Is this REALLY necessary? Isn't unmounting the
     file systems enough?
       A: The command mdstop /dev/md0 will:

         mark it ''clean''. This allows us to detect unclean
          shutdowns, for example due to a power failure or a kernel
          crash.

         sync the array. This is less important after unmounting a
          filesystem, but is important if the /dev/md0 is accessed
          directly rather than through a filesystem (for example,
          by e2fsck).





  5.  Troubleshooting Install Problems


  1. Q: What is the current best known-stable patch for RAID in the
     2.0.x series kernels?


       A: As of 18 Sept 1997, it is "2.0.30 + pre-9 2.0.31 + Werner
       Fink's swapping patch + the alpha RAID patch".  As of Novem
       ber 1997, it is 2.0.31 + ... !?



  2. Q: The RAID patches will not install cleanly for me.  What's wrong?

       A: Make sure that /usr/include/linux is a symbolic link to
       /usr/src/linux/include/linux.

       Make sure that the new files raid5.c, etc.  have been copied
       to their correct locations.  Sometimes the patch command
       will not create new files.  Try the -f flag on patch.



  3. Q: While compiling raidtools 0.42, compilation stops trying to
     include <pthread.h> but it doesn't exist in my system.  How do I
     fix this?


       A: raidtools-0.42 requires linuxthreads-0.6 from:
       <ftp://ftp.inria.fr/INRIA/Projects/cristal/Xavier.Leroy>
       Alternately, use glibc v2.0.



  4. Q: I get the message: mdrun -a /dev/md0: Invalid argument


       A: Use mkraid to initialize the RAID set prior to the first
       use.  mkraid ensures that the RAID array is initially in a
       consistent state by erasing the RAID partitions. In addi
       tion, mkraid will create the RAID superblocks.



  5. Q: I get the message: mdrun -a /dev/md0: Invalid argument The setup
     was:

    raid build as a kernel module


    normal install procedure followed ... mdcreate, mdadd, etc.

    cat /proc/mdstat shows

         Personalities :
         read_ahead not set
         md0 : inactive sda1 sdb1 6313482 blocks
         md1 : inactive
         md2 : inactive
         md3 : inactive




    mdrun -a generates the error message /dev/md0: Invalid argument



       A: Try lsmod (or, alternately, cat /proc/modules) to see if
       the raid modules are loaded.  If they are not, you can load
       them explicitly with the modprobe raid1 or modprobe raid5
       command.  Alternately,  if you are using the autoloader, and
       expected kerneld to load them and it didn't this is probably
       because your loader is missing the info to load the modules.
       Edit /etc/conf.modules and add the following lines:


           alias md-personality-3 raid1
           alias md-personality-4 raid5






  6. Q: While doing mdadd -a I get the error: /dev/md0: No such file or
     directory.  Indeed, there seems to be no /dev/md0 anywhere.  Now
     what do I do?


       A: The raid-tools package will create these devices when you
       run make install as root.  Alternately, you can do the fol
       lowing:

           cd /dev
           ./MAKEDEV md







  7. Q: After creating a raid array on /dev/md0, I try to mount it and
     get the following error:
      mount: wrong fs type, bad option, bad superblock on /dev/md0, or
     too many mounted file systems. What's wrong?

       A: You need to create a file system on /dev/md0 before you
       can mount it.  Use mke2fs.




  8. Q: Truxton Fulton wrote:

  On my Linux 2.0.30 system, while doing a mkraid for a RAID-1
  device, during the clearing of the two individual parti
  tions, I got "Cannot allocate free page" errors appearing on
  the console, and "Unable to handle kernel paging request at
  virtual address ..." errors in the system log.  At this
  time, the system became quite unusable, but it appears to
  recover after a while.  The operation appears to have com
  pleted with no other errors, and I am successfully using my
  RAID-1 device.  The errors are disconcerting though.  Any
  ideas?




       A: This was a well-known bug in the 2.0.30 kernels.  It is
       fixed in the 2.0.31 kernel; alternately, fall back to
       2.0.29.



  9. Q: I'm not able to mdrun a RAID-1, RAID-4 or RAID-5 device.  If I
     try to mdrun a mdadd'ed device I get the message ''invalid raid
     superblock magic''.


       A: Make sure that you've run the mkraid part of the install
       procedure.



  10.
     Q: When I access /dev/md0, the kernel spits out a lot of errors
     like md0: device not running, giving up !  and I/O error.... I've
     successfully added my devices to the virtual device.


       A: To be usable, the device must be running. Use mdrun -px
       /dev/md0 where x is l for linear, 0 for RAID-0 or 1 for
       RAID-1, etc.



  11.
     Q: I've created a linear md-dev with 2 devices.  cat /proc/mdstat
     shows the total size of the device, but df only shows the size of
     the first physical device.


       A: You must mkfs your new md-dev before using it the first
       time, so that the filesystem will cover the whole device.



  12.
     Q: I've set up /etc/mdtab using mdcreate, I've mdadd'ed, mdrun and
     fsck'ed my two /dev/mdX partitions.  Everything looks okay before a
     reboot.  As soon as I reboot, I get an fsck error on both
     partitions: fsck.ext2: Attempt to read block from filesystem
     resulted in short read while trying too open /dev/md0.  Why?! How
     do I fix it?!


       A: During the boot process, the RAID partitions must be
       started before they can be fsck'ed.  This must be done in
       one of the boot scripts.  For some distributions, fsck is
       called from /etc/rc.d/rc.S, for others, it is called from
  /etc/rc.d/rc.sysinit. Change this file to mdadd -ar *before*
  fsck -A is executed.  Better yet, it is suggested that
  ckraid be run if mdadd returns with an error.  How do do
  this is discussed in greater detail in question 14 of the
  section ''Error Recovery''.



  13.
     Q: I get the message invalid raid superblock magic while trying to
     run an array which consists of partitions which are bigger than
     4GB.


       A: This bug is now fixed. (September 97)  Make sure you have
       the latest raid code.



  14.
     Q: I get the message Warning: could not write 8 blocks in inode
     table starting at 2097175 while trying to run mke2fs on a partition
     which is larger than 2GB.


       A: This seems to be a problem with mke2fs (November 97).  A
       temporary work-around is to get the mke2fs code, and add
       #undef HAVE_LLSEEK to e2fsprogs-1.10/lib/ext2fs/llseek.c
       just before the first #ifdef HAVE_LLSEEK and recompile
       mke2fs.



  15.
     Q: ckraid currently isn't able to read /etc/mdtab


       A: The RAID0/linear configuration file format used in
       /etc/mdtab is obsolete, although it will be supported for a
       while more.  The current, up-to-date config files are cur
       rently named /etc/raid1.conf, etc.



  16.
     Q: The personality modules (raid1.o) are not loaded automatically;
     they have to be manually modprobe'd before mdrun. How can this be
     fixed?


       A: To autoload the modules, we can add the following to
       /etc/conf.modules:

           alias md-personality-3 raid1
           alias md-personality-4 raid5






  17.
     Q: I've mdadd'ed 13 devices, and now I'm trying to mdrun -p5
     /dev/md0 and get the message: /dev/md0: Invalid argument


  A: The default configuration for software RAID is 8 real
  devices. Edit linux/md.h, change #define MAX_REAL=8 to a
  larger number, and rebuild the kernel.



  18.
     Q: I can't make md work with partitions on our latest SPARCstation
     5.  I suspect that this has something to do with disk-labels.


       A: Sun disk-labels sit in the first 1K of a partition.  For
       RAID-1, the Sun disk-label is not an issue since ext2fs will
       skip the label on every mirror.  For other raid levels (0,
       linear and 4/5), this appears to be a problem; it has not
       yet (Dec 97) been addressed.



  6.  Supported Hardware & Software


  1. Q: I have SCSI adapter brand XYZ (with or without several
     channels), and disk brand(s) PQR and LMN, will these work with md
     to create a linear/stripped/mirrored personality?


       A: Yes!  Software RAID will work with any disk controller
       (IDE or SCSI) and any disks.  The disks do not have to be
       identical, nor do the controllers.  For example, a RAID mir
       ror can be created with one half the mirror being a SCSI
       disk, and the other an IDE disk.  The disks do not even have
       to be the same size.  There are no restrictions on the mix
       ing & matching of disks and controllers.


       This is because Software RAID works with disk partitions,
       not with the raw disks themselves.  The only recommendation
       is that for RAID levels 1 and 5, the disk partitions that
       are used as part of the same set be the same size. If the
       partitions used to make up the RAID 1 or 5 array are not the
       same size, then the excess space in the larger partitions is
       wasted (not used).



  2. Q: I have a twin channel BT-952, and the box states that it
     supports hardware RAID 0, 1 and 0+1.   I have made a RAID set with
     two drives, the card apparently recognizes them when it's doing
     it's BIOS startup routine. I've been reading in the driver source
     code, but found no reference to the hardware RAID support.  Anybody
     out there working on that?


       A: The Mylex/BusLogic FlashPoint boards with RAIDPlus are
       actually software RAID, not hardware RAID at all.  RAIDPlus
       is only supported on Windows 95 and Windows NT, not on Net
       ware or any of the Unix platforms.  Aside from booting and
       configuration, the RAID support is actually in the OS
       drivers.


       While in theory Linux support for RAIDPlus is possible, the
       implementation of RAID-0/1/4/5 in the Linux kernel is much
       more flexible and should have superior performance, so
       there's little reason to support RAIDPlus directly.
  3. Q: I want to run RAID with an SMP box.  Is  RAID SMP-safe?

       A: "I think so" is the best answer available at the time I
       write this (April 98).  A number of users report that they
       have been using RAID with SMP for nearly a year, without
       problems.  However, as of April 98 (circa kernel 2.1.9x),
       the following problems have been noted on the mailing list:

         Adaptec AIC7xxx SCSI drivers are not SMP safe (General
          note: Adaptec adapters have a long & lengthly history of
          problems & flakiness in general.  Although they seem to
          be the most easily available, widespread and cheapest
          SCSI adapters, they should be avoided.  After factoring
          for time lost, frustration, and corrupted data, Adaptec's
          will prove to be the costliest mistake you'll ever make.
          That said, if you have SMP problems with 2.1.88, try the
          patch ftp://ftp.bero-
          online.ml.org/pub/linux/aic7xxx-5.0.7-linux21.tar.gz I am
          not sure if this patch has been pulled into later 2.1.x
          kernels.  For further info, take a look at the mail
          archives for March 98 at
          http://www.linuxhq.com/lnxlists/linux-raid/lr_9803_01/ As
          usual, due to the rapidly changing nature of the latest
          experimental 2.1.x kernels, the problems described in
          these mailing lists may or may not have been fixed by the
          time your read this. Caveat Emptor.  )



         IO-APIC with RAID-0 on SMP has been reported to crash in
          2.1.90





  7.  Modifying an Existing Installation


  1. Q: Are linear MD's expandable?  Can a new hard-drive/partition be
     added, and the size of the existing file system expanded?


       A: Miguel de Icaza <miguel@luthien.nuclecu.unam.mx> writes:

       I changed the ext2fs code to be aware of multiple-devices
       instead of the regular one device per file system assump
       tion.


       So, when you want to extend a file system, you run a utility
       program that makes the appropriate changes on the new device
       (your extra partition) and then you just tell the system to
       extend the fs using the specified device.


       You can extend a file system with new devices at system
       operation time, no need to bring the system down (and
       whenever I get some extra time, you will be able to remove
       devices from the ext2 volume set, again without even having
       to go to single-user mode or any hack like that).


       You can get the patch for 2.1.x kernel from my web page:


  <http://www.nuclecu.unam.mx/~miguel/ext2-volume>





  2. Q: Can I add disks to a RAID-5 array?


       A: Currently, (September 1997) no, not without erasing all
       data. A conversion utility to allow this does not yet exist.
       The problem is that the actual structure and layout of a
       RAID-5 array depends on the number of disks in the array.

       Of course, one can add drives by backing up the array to
       tape, deleting all data, creating a new array, and restoring
       from tape.



  3. Q: What would happen to my RAID1/RAID0 sets if I shift one of the
     drives from being /dev/hdb to /dev/hdc?

     Because of cabling/case size/stupidity issues, I had to make my
     RAID sets on the same IDE controller (/dev/hda and /dev/hdb). Now
     that I've fixed some stuff, I want to move /dev/hdb to /dev/hdc.

     What would happen if I just change the /etc/mdtab and
     /etc/raid1.conf files to reflect the new location?

       A: For RAID-0/linear, one must be careful to specify the
       drives in exactly the same order. Thus, in the above exam
       ple, if the original config is


       mdadd /dev/md0 /dev/hda /dev/hdb





  Then the new config *must* be


       mdadd /dev/md0 /dev/hda /dev/hdc






  For RAID-1/4/5, the drive's ''RAID number'' is stored in its
  RAID superblock, and therefore the order in which the disks
  are specified is not important.

  RAID-0/linear does not have a superblock due to it's older
  design, and the desire to maintain backwards compatibility
  with this older design.



  4. Q: Can I convert a two-disk RAID-1 mirror to a three-disk RAID-5
     array?



  A: Yes.  Micheal at BizSystems has come up with a clever,
  sneaky way of doing this.  However, like virtually all
  manipulations of RAID arrays once they have data on them, it
  is dangerous and prone to human error.  Make a backup before
  you start.





























































  I will make the following assumptions:
  ---------------------------------------------
  disks
  original: hda - hdc
  raid1 partitions hda3 - hdc3
  array name /dev/md0

  new hda - hdc - hdd
  raid5 partitions hda3 - hdc3 - hdd3
  array name: /dev/md1

  You must substitute the appropriate disk and partition numbers for
  you system configuration. This will hold true for all config file
  examples.
  --------------------------------------------
  DO A BACKUP BEFORE YOU DO ANYTHING
  1) recompile kernel to include both raid1 and raid5
  2) install new kernel and verify that raid personalities are present
  3) disable the redundant partition on the raid 1 array. If this is a
   root mounted partition (mine was) you must be more careful.

   Reboot the kernel without starting raid devices or boot from rescue
   system ( raid tools must be available )

   start non-redundant raid1
  mdadd -r -p1 /dev/md0 /dev/hda3

  4) configure raid5 but with 'funny' config file, note that there is
    no hda3 entry and hdc3 is repeated. This is needed since the
    raid tools don't want you to do this.
  -------------------------------
  # raid-5 configuration
  raiddev                 /dev/md1
  raid-level              5
  nr-raid-disks           3
  chunk-size              32

  # Parity placement algorithm
  parity-algorithm        left-symmetric

  # Spare disks for hot reconstruction
  nr-spare-disks          0

  device                  /dev/hdc3
  raid-disk               0

  device                  /dev/hdc3
  raid-disk               1

  device                  /dev/hdd3
  raid-disk               2
  ---------------------------------------
   mkraid /etc/raid5.conf
  5) activate the raid5 array in non-redundant mode

  mdadd -r -p5 -c32k /dev/md1 /dev/hdc3 /dev/hdd3

  6) make a file system on the array

  mke2fs -b {blocksize} /dev/md1

  recommended blocksize by some is 4096 rather than the default 1024.
  this improves the memory utilization for the kernel raid routines and
  matches the blocksize to the page size. I compromised and used 2048
  since I have a relatively high number of small files on my system.

  7) mount the two raid devices somewhere

  mount -t ext2 /dev/md0 mnt0
  mount -t ext2 /dev/md1 mnt1

  8) move the data

  cp -a mnt0 mnt1

  9) verify that the data sets are identical
  10) stop both arrays
  11) correct the information for the raid5.conf file
    change /dev/md1 to /dev/md0
    change the first disk to read /dev/hda3

  12) upgrade the new array to full redundant status
   (THIS DESTROYS REMAINING raid1 INFORMATION)

  ckraid --fix /etc/raid5.conf








  8.  Performance, Tools & General Bone-headed Questions


  1. Q: I've created a RAID-0 device on /dev/sda2 and /dev/sda3. The
     device is a lot slower than a single partition. Isn't md a pile of
     junk?

       A: To have a RAID-0 device running a full speed, you must
       have partitions from different disks.  Besides, putting the
       two halves of the mirror on the same disk fails to give you
       any protection whatsoever against disk failure.



  2. Q: How does RAID-0 handle a situation where the different stripe
     partitions are different sizes?  Are the stripes uniformly
     distributed?


       A: To understand this, lets look at an example with three
       partitions; one that is 50MB, one 90MB and one 125MB.

       Lets call D0 the 50MB disk, D1 the 90MB disk and D2 the
       125MB disk.  When you start the device, the driver calcu
       lates 'strip zones'.  In this case, it finds 3 zones,
       defined like this:


                   Z0 : (D0/D1/D2) 3 x 50 = 150MB  total in this zone
                   Z1 : (D1/D2)  2 x 40 = 80MB total in this zone
                   Z2 : (D2) 125-50-40 = 35MB total in this zone.




       You can see that the total size of the zones is the size of
       the virtual device, but, depending on the zone, the striping
       is different.  Z2 is rather inefficient, since there's only
       one disk.
  Since ext2fs and most other Unix file systems distribute
  files all over the disk, you have a  35/265 = 13% chance
  that a fill will end up on Z2, and not get any of the bene
  fits of striping.

  (DOS tries to fill a disk from beginning to end, and thus,
  the oldest files would end up on Z0.  However, this strategy
  leads to severe filesystem fragmentation, which is why no
  one besides DOS does it this way.)



  3. Q: What's the use of having RAID-linear when RAID-0 will do the
     same thing, but provide higher performance?

       A: It's not obvious that RAID-0 will always provide better
       performance; in fact, in some cases, it could make things
       worse.  The ext2fs file system scatters files all over a
       partition, and it attempts to keep all of the blocks of a
       file contiguous, basically in an attempt to prevent fragmen
       tation.  Thus, ext2fs behaves "as if" there were a (vari
       able-sized) stripe per file.  If there are several disks
       concatenated into a single RAID-linear, this will result
       files being statistically distributed on each of the disks.
       Thus, at least for ext2fs, RAID-linear will behave a lot
       like RAID-0 with large stripe sizes.  Conversely, RAID-0
       with small stripe sizes can cause excessive disk activity
       leading to severely degraded performance if several large
       files are accessed simultaneously.  This issue is explored
       further in another question below.



  4. Q: I have some Brand X hard disks and a Brand Y controller.  and am
     considering using md.  Does it significantly increase the
     throughput?  Is the performance really noticeable?


       A: The answer depends on the configuration that you use.


          Linux MD RAID-0 and RAID-linear performance:
             If the system is heavily loaded with lots of I/O,
             statistically, some of it will go to one disk, and
             some to the others.  Thus, performance will improve
             over a single large disk.   The actual improvement
             depends a lot on the actual data, stripe sizes, and
             other factors.   In a system with low I/O usage, the
             performance is equal to that of a single disk.



          Linux MD RAID-1 (mirroring) read performance:
             MD implements read balancing. That is, the  RAID-1
             code will alternate between each of the (two or more)
             disks in the mirror, making alternate reads to each.
             In a low-I/O situation, this won't change performance
             at all: you will have to wait for one disk to complete
             the read.  But, with two disks in a high-I/O
             environment, this could as much as double the read
             performance, since reads can be issued to each of the
             disks in parallel.  For N disks in the mirror, this
             could improve performance N-fold.



     Linux MD RAID-1 (mirroring) write performance:
        Must wait for the write to occur to all of the disks
        in the mirror.  This is because a copy of the data
        must be written to each of the disks in the mirror.
        Thus, performance will be roughly equal to the write
        performance to a single disk.


     Linux MD RAID-4/5 read performance:
        Statistically, a given block can be on any one of a
        number of disk drives, and thus RAID-4/5 read
        performance is a lot like that for RAID-0.  It will
        depend on the data, the stripe size, and the
        application.  It will not be as good as the read
        performance of a mirrored array.


     Linux MD RAID-4/5 write performance:
        This will in general be considerably slower than that
        for a single disk.  This is because the parity must be
        written out to one drive as well as the data to
        another.  However, in order to compute the new parity,
        the old parity and the old data must be read first.
        The old data, new data and old parity must all be
        XOR'ed together to determine the new parity: this
        requires considerable CPU cycles in addition to the
        numerous disk accesses.



  5. Q: What is the optimal RAID-5 configuration for performance?

       A: Since RAID-5 experiences an I/O load that is equally dis
       tributed across several drives, the best performance will be
       obtained when the RAID set is balanced by using identical
       drives, identical controllers,  and the same (low) number of
       drives on each controller.

       Note, however, that using identical components will raise
       the probability of multiple simultaneous failures, for exam
       ple due to a sudden jolt or drop, overheating, or a power
       surge during an electrical storm. Mixing brands and models
       helps reduce this risk.



  6. Q: What is the optimal block size for a RAID-4/5 array?


       A: When using the current (November 1997) RAID-4/5 implemen
       tation, it is strongly recommended that the file system be
       created with mke2fs -b 4096 instead of the default 1024 byte
       filesystem block size.


       This is because the current RAID-5 implementation allocates
       one 4K memory page per disk block; if a disk block were just
       1K in size, then 75% of the memory which RAID-5 is
       allocating for pending I/O would not be used.  If the disk
       block size matches the memory page size, then the driver can
       (potentially) use all of the page.  Thus, for a filesystem
       with a 4096 block size as opposed to a 1024 byte block size,
       the RAID driver will potentially queue 4 times as much
       pending I/O to the low level drivers without allocating
       additional memory.

  Note: the above remarks do NOT apply to Software
  RAID-0/1/linear driver.


  Note: the statements about 4K memory page size apply to the
  Intel x86 architecture.   The page size on Alpha, Sparc, and
  other CPUS are different; I believe they're 8K on
  Alpha/Sparc (????).  Adjust the above figures accordingly.


  Note: if your file system has a lot of small files (files
  less than 10KBytes in size), a considerable fraction of the
  disk space might be wasted.  This is because the file system
  allocates disk space in multiples of the block size.
  Allocating large blocks for small files clearly results in a
  waste of disk space: thus, you may want to stick to small
  block sizes, get a larger effective storage capacity, and
  not worry about the "wasted" memory due to the block-
  size/page-size mismatch.


  Note: most ''typical'' systems do not have that many small
  files.  That is, although there might be thousands of small
  files, this would lead to only some 10 to 100MB wasted
  space, which is probably an acceptable tradeoff for
  performance on a multi-gigabyte disk.

  However, for news servers, there might be tens or hundreds
  of thousands of small files.  In such cases, the smaller
  block size, and thus the improved storage capacity, may be
  more important than the more efficient I/O scheduling.


  Note: there exists an experimental file system for Linux
  which packs small files and file chunks onto a single block.
  It apparently has some very positive performance
  implications when the average file size is much smaller than
  the block size.


  Note: Future versions may implement schemes that obsolete
  the above discussion. However, this is difficult to
  implement, since dynamic run-time allocation can lead to
  dead-locks; the current implementation performs a static
  pre-allocation.



  7. Q: How does the chunk size (stripe size) influence the speed of my
     RAID-0, RAID-4 or RAID-5 device?


       A: The chunk size is the amount of data contiguous on the
       virtual device that is also contiguous on the physical
       device.  In this HOWTO, "chunk" and "stripe" refer to the
       same thing: what is commonly called the "stripe" in other
       RAID documentation is called the "chunk" in the MD man
       pages.  Stripes or chunks apply only to RAID 0, 4 and 5,
       since stripes are not used in mirroring (RAID-1) and simple
       concatenation (RAID-linear).  The stripe size affects both
       read and write latency (delay), throughput (bandwidth), and
       contention between independent operations (ability to simul
       taneously service overlapping I/O requests).

       Assuming the use of the ext2fs file system, and the current
       kernel policies about read-ahead, large stripe sizes are
  almost always better than small stripe sizes, and stripe
  sizes from about a fourth to a full disk cylinder in size
  may be best.  To understand this claim, let us consider the
  effects of large stripes on small files, and small stripes
  on large files.  The stripe size does not affect the read
  performance of small files:  For an array of N drives, the
  file has a 1/N probability of being entirely within one
  stripe on any one of the drives.  Thus, both the read
  latency and bandwidth will be comparable to that of a single
  drive.  Assuming that the small files are statistically well
  distributed around the filesystem, (and, with the ext2fs
  file system, they should be), roughly N times more
  overlapping, concurrent reads should be possible without
  significant collision between them.  Conversely, if very
  small stripes are used, and a large file is read
  sequentially, then a read will issued to all of the disks in
  the array.  For a the read of a single large file, the
  latency will almost double, as the probability of a block
  being 3/4'ths of a revolution or farther away will increase.
  Note, however, the trade-off: the bandwidth could improve
  almost N-fold for reading a single, large file, as N drives
  can be reading simultaneously (that is, if read-ahead is
  used so that all of the disks are kept active).  But there
  is another, counter-acting trade-off:  if all of the drives
  are already busy reading one file, then attempting to read a
  second or third file at the same time will cause significant
  contention, ruining performance as the disk ladder
  algorithms lead to seeks all over the platter.  Thus,  large
  stripes will almost always lead to the best performance. The
  sole exception is the case where one is streaming a single,
  large file at a time, and one requires the top possible
  bandwidth, and one is also using a good read-ahead
  algorithm, in which case small stripes are desired.


  Note that this HOWTO previously recommended small stripe
  sizes for news spools or other systems with lots of small
  files. This was bad advice, and here's why:  news spools
  contain not only many small files, but also large summary
  files, as well as large directories.  If the summary file is
  larger than the stripe size, reading it will cause many
  disks to be accessed, slowing things down as each disk
  performs a seek.  Similarly, the current ext2fs file system
  searches directories in a linear, sequential fashion.  Thus,
  to find a given file or inode, on average half of the
  directory will be read. If this directory is spread across
  several stripes (several disks), the directory read (e.g.
  due to the ls command) could get very slow. Thanks to Steven
  A. Reisman <sar@pressenter.com> for this correction.  Steve
  also adds:

       I found that using a 256k stripe gives much better perfor
       mance.  I suspect that the optimum size would be the size of
       a disk cylinder (or maybe the size of the disk drive's sec
       tor cache).  However, disks nowadays have recording zones
       with different sector counts (and sector caches vary among
       different disk models).  There's no way to guarantee stripes
       won't cross a cylinder boundary.




  The tools accept the stripe size specified in KBytes.  You'll want to
  specify a multiple of if the page size for your CPU (4KB on the x86).


  8. Q: What is the correct stride factor to use when creating the
     ext2fs file system on the RAID partition?  By stride, I mean the -R
     flag on the mke2fs command:

     mke2fs -b 4096 -R stride=nnn  ...



  What should the value of nnn be?

       A: The -R stride flag is used to tell the file system about
       the size of the RAID stripes.  Since only RAID-0,4 and 5 use
       stripes, and RAID-1 (mirroring) and RAID-linear do not, this
       flag is applicable only for RAID-0,4,5.

       Knowledge of the size of a stripe allows mke2fs to allocate
       the block and inode bitmaps so that they don't all end up on
       the same physical drive.  An unknown contributor wrote:

       I noticed last spring that one drive in a pair always had a
       larger I/O count, and tracked it down to the these meta-data
       blocks.  Ted added the -R stride= option in response to my
       explanation and request for a workaround.


  For a 4KB block file system, with stripe size 256KB, one would use -R
  stride=64.

  If you don't trust the -R flag, you can get a similar effect in a
  different way.   Steven A. Reisman <sar@pressenter.com> writes:

       Another consideration is the filesystem used on the RAID-0
       device.  The ext2 filesystem allocates 8192 blocks per
       group.  Each group has its own set of inodes.  If there are
       2, 4 or 8 drives, these inodes cluster on the first disk.
       I've distributed the inodes across all drives by telling
       mke2fs to allocate only 7932 blocks per group.




  9. Q: Where can I put the md commands in the startup scripts, so that
     everything will start automatically at boot time?


       A: Rod Wilkens <rwilkens@border.net> writes:

       What I did is put ``mdadd -ar'' in the
       ``/etc/rc.d/rc.sysinit'' right after the kernel loads the
       modules, and before the ``fsck'' disk check.  This way, you
       can put the ``/dev/md?'' device in the ``/etc/fstab''. Then
       I put the ``mdstop -a'' right after the ``umount -a''
       unmounting the disks, in the ``/etc/rc.d/init.d/halt'' file.


  For raid-5, you will want to look at the return code for mdadd, and if
  it failed, do a


       ckraid --fix /etc/raid5.conf





  to repair any damage.
  10.
     Q: I was wondering if it's possible to setup striping with more
     than 2 devices in md0? This is for a news server, and I have 9
     drives... Needless to say I need much more than two.  Is this
     possible?


       A: Yes. (describe how to do this)



  11.
     Q: When is Software RAID superior to Hardware RAID?

       A: Normally, Hardware RAID is considered superior to Soft
       ware RAID, because hardware controllers often have a large
       cache, and can do a better job of scheduling operations in
       parallel.  However, integrated Software RAID can (and does)
       gain certain advantages from being close to the operating
       system.


       For example, ... ummm. Opaque description of caching of
       reconstructed blocks in buffer cache elided ...


       On a dual PPro SMP system, it has been reported that
       Software-RAID performance exceeds the performance of a well-
       known hardware-RAID board vendor by a factor of 2 to 5.


       Software RAID is also a very interesting option for high-
       availability redundant server systems.  In such a
       configuration, two CPU's are attached to one set or SCSI
       disks.  If one server crashes or fails to respond, then the
       other server can mdadd, mdrun and mount the software RAID
       array, and take over operations.  This sort of dual-ended
       operation is not always possible with many hardware RAID
       controllers, because of the state configuration that the
       hardware controllers maintain.



  12.
     Q: If I upgrade my version of raidtools, will it have trouble
     manipulating older raid arrays?  In short, should I recreate my
     RAID arrays when upgrading the raid utilities?


       A: No, not unless the major version number changes.  An MD
       version x.y.z consists of three sub-versions:

            x:      Major version.
            y:      Minor version.
            z:      Patchlevel version.




       Version x1.y1.z1 of the RAID driver supports a RAID array
       with version x2.y2.z2 in case (x1 == x2) and (y1 >= y2).

       Different patchlevel (z) versions for the same (x.y) version
       are designed to be mostly compatible.


  The minor version number is increased whenever the RAID
  array layout is changed in a way which is incompatible with
  older versions of the driver. New versions of the driver
  will maintain compatibility with older RAID arrays.

  The major version number will be increased if it will no
  longer make sense to support old RAID arrays in the new
  kernel code.


  For RAID-1, it's not likely that the disk layout nor the
  superblock structure will change anytime soon.  Most all Any
  optimization and new features (reconstruction, multithreaded
  tools, hot-plug, etc.) doesn't affect the physical layout.



  13.
     Q: The command mdstop /dev/md0 says that the device is busy.


       A: There's a process that has a file open on /dev/md0, or
       /dev/md0 is still mounted.  Terminate the process or umount
       /dev/md0.



  14.
     Q: Are there performance tools?

       A: There is also a new utility called iotrace in the
       linux/iotrace directory. It reads /proc/io-trace and analy
       ses/plots it's output.  If you feel your system's block IO
       performance is too low, just look at the iotrace output.



  15.
     Q: I was reading the RAID source, and saw the value SPEED_LIMIT
     defined as 1024K/sec.  What does this mean?  Does this limit
     performance?


       A: SPEED_LIMIT is used to limit RAID reconstruction speed
       during automatic reconstruction.  Basically, automatic
       reconstruction allows you to e2fsck and mount immediately
       after an unclean shutdown, without first running ckraid.
       Automatic reconstruction is also used after a failed hard
       drive has been replaced.


       In order to avoid overwhelming the system while
       reconstruction is occurring, the reconstruction thread
       monitors the reconstruction speed and slows it down if its
       too fast.  The 1M/sec limit was arbitrarily chosen as a
       reasonable rate which allows the reconstruction to finish
       reasonably rapidly, while creating only a light load on the
       system so that other processes are not interfered with.



  16.
     Q: What about ''spindle synchronization'' or ''disk
     synchronization''?


  A: Spindle synchronization is used to keep multiple hard
  drives spinning at exactly the same speed, so that their
  disk platters are always perfectly aligned.  This is used by
  some hardware controllers to better organize disk writes.
  However, for software RAID, this information is not used,
  and spindle synchronization might even hurt performance.



  17.
     Q: How can I set up swap spaces using raid 0?  Wouldn't striped
     swap ares over 4+ drives be really fast?

       A: Leonard N. Zubkoff replies: It is really fast, but you
       don't need to use MD to get striped swap.  The kernel auto
       matically stripes across equal priority swap spaces.  For
       example, the following entries from /etc/fstab stripe swap
       space across five drives in three groups:


       /dev/sdg1       swap    swap    pri=3
       /dev/sdk1       swap    swap    pri=3
       /dev/sdd1       swap    swap    pri=3
       /dev/sdh1       swap    swap    pri=3
       /dev/sdl1       swap    swap    pri=3
       /dev/sdg2       swap    swap    pri=2
       /dev/sdk2       swap    swap    pri=2
       /dev/sdd2       swap    swap    pri=2
       /dev/sdh2       swap    swap    pri=2
       /dev/sdl2       swap    swap    pri=2
       /dev/sdg3       swap    swap    pri=1
       /dev/sdk3       swap    swap    pri=1
       /dev/sdd3       swap    swap    pri=1
       /dev/sdh3       swap    swap    pri=1
       /dev/sdl3       swap    swap    pri=1





  18.
     Q: I want to maximize performance.  Should I use multiple
     controllers?

       A: In many cases, the answer is yes.  Using several con
       trollers to perform disk access in parallel will improve
       performance.  However, the actual improvement depends on
       your actual configuration.  For example, it has been
       reported (Vaughan Pratt, January 98) that a single 4.3GB
       Cheetah attached to an Adaptec 2940UW can achieve a rate of
       14MB/sec (without using RAID).  Installing two disks on one
       controller, and using a RAID-0 configuration results in a
       measured performance of 27 MB/sec.


       Note that the 2940UW controller is an "Ultra-Wide" SCSI
       controller, capable of a theoretical burst rate of 40MB/sec,
       and so the above measurements are not surprising.  However,
       a slower controller attached to two fast disks would be the
       bottleneck.  Note also, that most out-board SCSI enclosures
       (e.g. the kind with hot-pluggable trays) cannot be run at
       the 40MB/sec rate, due to cabling and electrical noise
       problems.


       If you are designing a multiple controller system, remember
  that most disks and controllers typically run at 70-85% of
  their rated max speeds.


  Note also that using one controller per disk can reduce the
  likelihood of system outage due to a controller or cable
  failure (In theory -- only if the device driver for the
  controller can gracefully handle a broken controller. Not
  all SCSI device drivers seem to be able to handle such a
  situation without panicking or otherwise locking up).


  9.  High Availability RAID


  1. Q: RAID can help protect me against data loss.  But how can I also
     ensure that the system is up as long as possible, and not prone to
     breakdown?  Ideally, I want a system that is up 24 hours a day, 7
     days a week, 365 days a year.


       A: High-Availability is difficult and expensive.  The harder
       you try to make a system be fault tolerant, the harder and
       more expensive it gets.   The following hints, tips, ideas
       and unsubstantiated rumors may help you with this quest.

         IDE disks can fail in such a way that the failed disk on
          an IDE ribbon can also prevent the good disk on the same
          ribbon from responding, thus making it look as if two
          disks have failed.   Since RAID does not protect against
          two-disk failures, one should either put only one disk on
          an IDE cable, or if there are two disks, they should
          belong to different RAID sets.

         SCSI disks can fail in such a way that the failed disk on
          a SCSI chain can prevent any device on the chain from
          being accessed.  The failure mode involves a short of the
          common (shared) device ready pin; since this pin is
          shared, no arbitration can occur until the short is
          removed.  Thus, no two disks on the same SCSI chain
          should belong to the same  RAID array.

         Similar remarks apply to the disk controllers.  Don't
          load up the channels on one controller; use multiple
          controllers.

         Don't use the same brand or model number for all of the
          disks.  It is not uncommon for severe electrical storms
          to take out two or more disks.  (Yes, we all use surge
          suppressors, but these are not perfect either).   Heat &
          poor ventilation of the disk enclosure are other disk
          killers.  Cheap disks often run hot.  Using different
          brands of disk & controller decreases the likelihood that
          whatever took out one disk (heat, physical shock,
          vibration, electrical surge) will also damage the others
          on the same date.

         To guard against controller or CPU failure, it should be
          possible to build a SCSI disk enclosure that is "twin-
          tailed": i.e. is connected to two computers.  One
          computer will mount the file-systems read-write, while
          the second computer will mount them read-only, and act as
          a hot spare.  When the hot-spare is able to determine
          that the master has failed (e.g.  through a watchdog), it
          will cut the power to the master (to make sure that it's
          really off), and then fsck & remount read-write.   If
     anyone gets this working, let me know.

    Always use an UPS, and perform clean shutdowns.  Although
     an unclean shutdown may not damage the disks, running
     ckraid on even small-ish arrays is painfully slow.   You
     want to avoid running ckraid as much as possible.  Or you
     can hack on the kernel and get the hot-reconstruction
     code debugged ...

    SCSI cables are well-known to be very temperamental
     creatures, and prone to cause all sorts of problems.  Use
     the highest quality cabling that you can find for sale.
     Use e.g. bubble-wrap to make sure that ribbon cables to
     not get too close to one another and cross-talk.
     Rigorously observe cable-length restrictions.

    Take a look at SSI (Serial Storage Architecture).
     Although it is rather expensive, it is rumored to be less
     prone to the failure modes that SCSI exhibits.

    Enjoy yourself, its later than you think.


  10.  Questions Waiting for Answers


  1. Q: I want to use the stock RAID-0 available in the 2.0.34 kernel.
     Where can I find the mdtools I need to run this? The newer tools
     require the raid-1/4/5 patches to be installed in order to compile.



  2. Q: For testing the raw disk thru put...  is there a character
     device for raw read/raw writes instead of /dev/sdaxx that we can
     use to measure performance on the raid drives??  is there a GUI
     based tool to use to watch the disk thru-put??


  11.  Wish List of Enhancements to MD and Related Software

  Bradley Ward Allen <ulmo@Q.Net> wrote:

       Ideas include:

         Boot-up parameters to tell the kernel which devices are
          to be MD devices (no more ``mdadd'')

         Making MD transparent to ``mount''/``umount'' such that
          there is no ``mdrun'' and ``mdstop''

         Integrating ``ckraid'' entirely into the kernel, and
          letting it run as needed

          (So far, all I've done is suggest getting rid of the
          tools and putting them into the kernel; that's how I feel
          about it, this is a filesystem, not a toy.)

         Deal with arrays that can easily survive N disks going
          out simultaneously or at separate moments, where N is a
          whole number > 0 settable by the administrator

         Handle kernel freezes, power outages, and other abrupt
          shutdowns better

         Don't disable a whole disk if only parts of it have
          failed, e.g., if the sector errors are confined to less
     than 50% of access over the attempts of 20 dissimilar
     requests, then it continues just ignoring those sectors
     of that particular disk.

    Bad sectors:

    A mechanism for saving which sectors are bad, someplace
     onto the disk.

    If there is a generalized mechanism for marking degraded
     bad blocks that upper filesystem levels can recognize,
     use that. Program it if not.

    Perhaps alternatively a mechanism for telling the upper
     layer that the size of the disk got smaller, even
     arranging for the upper layer to move out stuff from the
     areas being eliminated.  This would help with a degraded
     blocks as well.

    Failing the above ideas, keeping a small (admin settable)
     amount of space aside for bad blocks (distributed evenly
     across disk?), and using them (nearby if possible)
     instead of the bad blocks when it does happen.  Of
     course, this is inefficient.  Furthermore, the kernel
     ought to log every time the RAID array starts each bad
     sector and what is being done about it with a ``crit''
     level warning, just to get the administrator to realize
     that his disk has a piece of dust burrowing into it (or a
     head with platter sickness).

    Software-switchable disks:

     ``disable this disk''
        would block until kernel has completed making sure
        there is no data on the disk being shut down that is
        needed (e.g., to complete an XOR/ECC/other error
        correction), then release the disk from use (so it
        could be removed, etc.);

     ``enable this disk''
        would mkraid a new disk if appropriate and then start
        using it for ECC/whatever operations, enlarging the
        RAID5 array as it goes;

     ``resize array''
        would respecify the total number of disks and the
        number of redundant disks, and the result would often
        be to resize the size of the array; where no data loss
        would result, doing this as needed would be nice, but
        I have a hard time figuring out how it would do that;
        in any case, a mode where it would block (for possibly
        hours (kernel ought to log something every ten seconds
        if so)) would be necessary;

     ``enable this disk while saving data''
        which would save the data on a disk as-is and move it
        to the RAID5 system as needed, so that a horrific save
        and restore would not have to happen every time
        someone brings up a RAID5 system (instead, it may be
        simpler to only save one partition instead of two, it
        might fit onto the first as a gzip'd file even);
        finally,

     ``re-enable disk''
        would be an operator's hint to the OS to try out a
        previously failed disk (it would simply call disable
        then enable, I suppose).


  Other ideas off the net:


         finalrd analog to initrd, to simplify root raid.

         a read-only raid mode, to simplify the above

         Mark the RAID set as clean whenever there are no "half
          writes" done. -- That is, whenever there are no write
          transactions that were committed on one disk but still
          unfinished on another disk.

          Add a "write inactivity" timeout (to avoid frequent seeks
          to the RAID superblock when the RAID set is relatively
          busy).