File: xfsdump.html

package info (click to toggle)
xfsdump 3.2.0-2
  • links: PTS
  • area: main
  • in suites: forky, sid, trixie
  • size: 4,012 kB
  • sloc: ansic: 45,797; sh: 3,449; makefile: 512
file content (2304 lines) | stat: -rw-r--r-- 78,961 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
<html>
<head><title>xfsdump Internals</title> </head>
<body bgcolor="#ffffff">

<h2>xfsdump Internals<br></h2>
<hr>

<h3>Table Of Contents</h3>
<ul>
  <li><a href="#caveat">Linux Caveats</a>

  <li><a href="#intro">What's in a dump</a>

  <li><a href="#dump_format">Dump Format</a>
  <ul>
    <li><a href="#media_files">Media Files</a>
    <li><a href="#inode_map">Inode Map</a>
    <li><a href="#dirs">Directories</a>
    <li><a href="#non_dirs">Non-directory files</a>
  </ul>

  <li><a href="#tape_format">Format on Tape</a>

  <li><a href="#run_time_structure">Run Time Structure</a>

  <li><a href="#xfsdump">xfsdump</a>
     <ul>
      <li><a href="#control_flow_dump">Control Flow of xfsdump</a>
      <ul>
	<li><a href="#main">The main function of xfsdump</a>
	<ul>
	  <li><a href="#drive_init1">drive_init1</a>
	  <li><a href="#content_init_dump">content_init</a>
	</ul>
	<li><a href="#dump_tape">Dumping to Tape</a>
	<ul>
	   <li><a href="#content_stream_dump">content_stream_dump</a>
	   <li><a href="#dump_file_reg">dump_file_reg</a>
	</ul>
      </ul>
      <li><a href="#reg_split">Splitting a Regular File</a>
      <ul>
	<li><a href="#split_mstream">Splitting a dump over multiple streams</a>
      </ul>
     </ul>

  <li><a href="#xfsrestore">xfsrestore</a>
    <ul>
      <li><a href="#control_flow_restore">Control Flow of xfsrestore</a>
      <li><a href="#pers_inv">Persistent Inventory and State File</a>
      <li><a href="#dirent_tree">Restore's directory entry tree</a>
      <li><a href="#cum_restore">Cumulative Restore</a>
      <ul>
	<li><a href="#tree_post">Cumulative Restore Tree Postprocessing</a>
      </ul>
      <li><a href="#partial_reg">Partial Registry</a>
    </ul>

  <li><a href="#drive_strategy">Drive Strategies</a>
    <ul>
    <li><a href="#drive_scsitape">Drive Scsitape</a>
      <ul>
      <li><a href="#reading">Reading</a>
      </ul>
    <li><a href="#librmt">Librmt</a>
    <li><a href="#drive_minrmt">Drive Minrmt</a>
    <li><a href="#drive_simple">Drive Simple</a>
    </ul>

  <li><a href="#inventory">Online Inventory</a>

  <li><a href="#Q&A">Questions and Answers</a>
  <ul>
    <li><a href="#DMF">How is -a and -z handled by xfsdump ?</a>
    <li><a href="#dump_size_est">How does it compute estimated dump size ?</a>
    <li><a href="#dump_size_ac">Is the dump size message accurate ?</a>
  </ul>

  <li><a href="#out_quest">Outstanding Questions</a>

</ul>

<hr>
<h3><a name="caveat">Linux Caveats</a></h3>
These notes are written for xfsdump and xfsrestore in IRIX. Therefore,
it refers to some features that aren't supported in Linux.
For example, the references to multiple streams/threads/drives do not
pertain to xfsdump/xfsrestore in Linux. Also, the DMF support in xfsdump
is not yet useful for Linux.

<hr>
<h3><a name="intro">What's in a dump</a></h3>
Xfsdump is used to dump out an XFS filesystem to a file, tape
or stdout. The dump includes all the filesystem objects of:
<ul>
<li>directories (S_IFDIR)
<li>regular files (S_IFREG)
<li>sockets (S_IFSOCK)
<li>symlinks (S_IFLNK)
<li>character special files (S_IFCHR)
<li>block special files (S_IFBLK)
<li>named pipes (S_FIFO)
</ul>
It does not dump files from <i>/var/xfsdump</i> which is where the
xfsdump inventory is located.
Other data which is stored:
<ul>
<li> file attributes (stored in stat data) of owner, group, permissions,
and date stamps
<li> any extended attributes associated with these file objects
<li> extent information is stored allowing holes to be reconstructed
on restoral
<li> actual file data of the extents
</ul>

<hr>
<h3><a name="dump_format">Dump Format</a></h3>

The dump format is the layout of the data for storage in a dump.
This is mostly done at an abstraction above the media dump format
(tape or data file).
The tape format, for example, will have extra header records.
The tape format will be done in multiple media files, whereas
the data file format will use 1 media file.
<p>


<h4><a name="media_files">Media Files</a></h4>
<img src="media_files.gif">
<p>
Media files are probably used to provide a way of
recovering more data in xfsrestore(1) should there be
some media error. They provide a self-contained unit
for restoration.
If the dump media is a disk file (drive_simple.c) then I
believe that only one media-file is used. Whereas on tape
media, multiple media files are used depending upon the size
of the media file. The size of the media file is set depending
on the drive type (in IRIX): QIC: 50Mb; DAT: 512Mb; Exabyte: 2Gb; DLT: 4Gb;
others: 256Mb. This value (media file size) is now able to be changed
by the "-d" option.
. Also, on tape, the dump is finished by an inventory
media file followed by a terminating null media file.
<p>
A global header is placed at the start of each media file.
<hr>
<img src="global_hdr.gif" align=right>
<pre>
<b>global_hdr_t</b> (4K bytes)
magic# = "xFSdump0"
version#
checksum
time of dump
ip address
dump id
hostname
dump label
pad to 1K bytes
<b>drive_hdr_t</b> (3K bytes)
    drive count
    drive index
    strategy id = on-file, on-tape, on-rmt-tape
    pad to 512 bytes

    specific (512 bytes)
        tape:
	    <b>rec_hdr</b>
	    magic# - tape magic = 0x13579bdf02468acell
	    version#
	    block size
	    record size
	    drive capabilities
	    record's byte offset in media file
	    byte offset of rirst mark set
	    size (bytes) of record containing user data
	    checksum (if -C used)
	    ischecksum (= 1 if -C used)
	    dump uuid
	    pad to 512 bytes

    upper: (2K bytes)
	<b>media_hdr_t</b>
	media-label
	previous media-label
	media-id
	previous media-id
	5 media indexes - (indices of object/file within stream/media-object)
	strategy id = on-file, on-tape, on-rmt-tape
	strategy specific data:
	  field to denote if media file is a terminator (old fmt)
	upper: (to 2K)
</pre>

<p>
Note that the <i>strategy id</i> is checked on restore so that
the dump strategy and the strategy used by restore
are the same with the exception that drive_scsitape matches with
drive_minrmt. This strategy check has caused problems with customers
in the past.
In particular, if one sends xfsdump's stdout to a tape
(i.e. xfsdump -L test -M test - / >/dev/tape) then one can not
restore this tape using xfsrestore by specifying the tape with the -f option.
There was also a problem for a time where if one used a drive with
the TS tape driver, xfsdump wouldn't recognise this driver and
would select the drive_simple strategy.

<hr>


<h4><a name="inode_map">Inode Map</a></h4>
<img src="inode_map.gif">


<h4><a name="dirs">Directories</a></h4>
<img src="directories.gif">


<h4><a name="non_dirs">Non-directory files</a></h4>
<img src="files.gif">
<br>
Regular files, as can be seen from above, have a list
of extents followed by the file's extended attributes.
If the file is large and/or the dump is to multiple streams,
then the file can be dumped in multiple records or extent groups.
(See <a href="#reg_split">Splitting a Regular File</a>).

<h3><a name="tape_format">Format on Tape</a></h3>
At the beginning of each tape record is a header. However, for
the first record of a media file, the record header is buried
inside the global header at byte offset 1536 (1K + 512), as is shown in
the global header diagram.
Reproduced again:
<pre>
<b>rec_hdr</b>
magic# - tape magic = 0x13579bdf02468acell
version#
block-size
record-size
drive capabilities
record's byte offset in media file
byte offset of rirst mark set
size (bytes) of record containing user data
checksum (if -C used)
ischecksum (= 1 if -C used)
dump uuid
pad to 512 bytes
</pre>
<p>
I can not see where the block-size ("tape_blksz") is ever used !
The record-size ("tape_recsz") is used as the byte count to do
the actual write and read system calls.
<p>
There is another layer of s/ware for the actual data on the tape.
Although, one may write out an inode-map or directory entries,
one doesn't just give these record buffers straight to the
write system call to write out. Instead, these data objects are
written to buffers (akin to &lt;stdio&gt). Another thread reads
from these buffers (unless its running single-threaded) and writes
them to tape.
Specifically, inside a loop,
one calls <b>do_get_write_buf</b>,
copies over the data one wants stored and then
calls <b>do_write_buf</b>, until the entire data buffer
has been copied over.

<hr>

<h3><a name="run_time_structure">Run Time Structure</a></h3>

This section reviews the run time structure and failure handling in
dump/restore (see IRIX PV 784355).

The diagram below gives a schematic of the runtime structure
of a dump/restore session to multiple drives.
<p>
<pre>

1.           main process	main.c
	       /  |   \
	      /   |    \
2.	stream  stream  stream	dump/content.c restore/content.c
       manager  manager manager
	   |      |      |
3.	 drive  drive   drive	common/drive.[hc]
	object  object  object
	   |      |      |
4.	   O      O      O	ring buffers common/ring.[ch]
           |      |      |
5.	worker  worker  worker	ring_create(... ring_worker_entry ...)
	thread  thread  thread
	   |      |      |
6.	 drive  drive   drive	physical drives
	device  device  device

</pre>
<p>
Each stream is broken into two threads of control: a stream manager;
and a drive manager. The drive manager provides an abstraction of the
tape device that allows multiple classes of device to be handled
(including normal files). The stream manager implements the actual
dump or restore functionality. The main process and stream managers
interact with the drive managers through a set of device ops
(e.g.: do_write, do_set_mark, ... etc).
<p>
The process hierachy is shown above. main() first initialises
the drive managers with calls to the drive_init functions. In
addition to choosing and assigning drive strategies and ops for each
drive object, the drive managers intialise a ring buffer and (for
devices other than simple UNIX files) sproc off a worker thread that
that handles IO to the tape device. This initialisation happens in the
drive_manager code and is not directly visible from main().
<p>
main() takes direct responsibility for initialising the stream
managers, calling the child management facility to perform the
sprocs. Each child begins execution in childmain(), runs either
content_stream_dump or content_stream_restore and exits with the
return code from these functions.
<p>
Both the stream manager processes and the drive manager workers
set their signal disposition to ignore HUP, INT, QUIT, PIPE,
ALRM, CLD (and for the stream manager TERM as well).
<p>
The drive manager worker processes are much simpler, and are
initialised with a call to ring_create, and begin execution in
ring_worker_func. The ring structure must also be initialised with
two ops that are called by the spawned thread: a ring read op, and a write op.
The stream manager communicates with the tape manager across this ring
structure using Ring_put's and Ring_get's.
<p>
The worker thread sits in a loop processing messages that come across
the ring buffer. It ignores signals and does not terminate until it
receives a RING_OP_DIE message. It then exits 0.
<p>
The main process sleeps waiting for any of its children to die
(ie. waiting for a SIGCLD). All children that it cares about (stream
managers and ring buffer workers) are registered through the child
manager abstraction. When a child dies wait status and other info is
stored with its entry in the child manager. main() ignores the deaths
of children (and grandchildren) that are not registered through the child
manager. The return status of these subprocesses is checked
and in the case of an error is used to determine the overall exit code.
<p>
We do not expect worker threads to ever die unexpectedly: they ignore
most signals and only exit when they receive a RING_OP_DIE at which
point they drop out of the message processing loop and always signal success.
<p>
Thus the only child processes that can affect the return status of
dump or restore are the stream managers, and these processes take
their exit status from the values returned by
<b>content_stream_dump</b> and <b>content_stream_restore</b>.

<hr>

<h3><a name="xfsdump">xfsdump</a></h3>

<h4><a name="control_flow_dump">Control Flow of xfsdump</a></h4>

Below is a higher level summary of the control flow. Further details
are given later.
<ul>
<li> initialize the drive strategy for a tape, file, minimal remote tape
<li> create the global header
</ul>

<p>
<b>content_init</b> (xfsdump version)
<p>
Do up to 5 phases, which create and prune the inode map,
calculate an estimate of the file data size and using that
create inode-ranges for multi-stream dumps if pertinent.
<ul>
<li> <b>phase 1</b>: create a subtree list based on the -s subtree spec.
<li> <b>phase 2</b>: create the inode map <br>
  The inode map stores the type of the inode: directory or non-directory,
  and a state value to say whether it has changed or not.
  The inode map is built by processing each inode (using bulkstat) and
  in order to work out if it should be marked as changed,
  by comparing its date stamp with the date of the base or interrupted
  dump.
  We also update the size for non-dir regular files (bs_blocks * bs_blksize)
<li><b>phase 3</b>: prune the unneeded subtrees due to the set of
  unchanged directories or the subtrees specified in -s (phase 1).
  This works by marking higher level directories as unchanged
  (MAP_DIR_NOCHNG) in the inode map.
<li><b>phase 4</b>: estimate non-dir (file) size if pruning was done
  since phase 2.
  It calculates this by processing each inode (using bulkstat)
  and looking up the inode map to see if it is a changed non-dir (file).
  If it is then it uses (bs_blocks * bs_blksize) as in phase 2.
<li><b>phase 5</b>: if we have multiple streams, then
  it splits up the dump to try to give each stream a set of inodes
  which has an equal amount of file data.
  See the section on "Splitting a dump over multiple streams" below.
</ul>

<ul>
<li> if 1 stream, then we call <b>content_stream_dump</b> and
  if multi stream, then we create children sprocs which call
  <b>content_stream_dump</b>.
</ul>

<p>
<b>content_stream_dump</b>
<ul>
<li> write global header
<li> loop dumping media files
    <ul>
    <li> dump the changed/needed directories by processing all inodes from bulkstat
      <ul>
      <li> dump the filehdr based on the bulkstat structure
      <li> dump the directory entries (using getdents())
      <li> dump a null dirent terminator
      <li> dump extended attributes on directory if it has them
      </ul>
    <li> dump the changed/needed files by processing all inodes from bulkstat
      (check the multistream range to see if it should be dumped by
       this particular stream)
      <ul>
      <li> dump the filehdr
      <li> dump the extents (called extent groups - max at 16Mb)
        <ul>
	<li> align to page boundary by dumping EXTENTHDR_TYPE_ALIGN records
	<li> dump data as EXTENTHDR_TYPE_DATA records
        </ul>
      <li> dump a null terminator, EXTENTHDR_TYPE_LAST
      </ul>
    <li> if not EOM then write null file header
    <li> end the media file
    <li> update online inventory
    </ul>
<li> if multiple-media dump (i.e. tape dump and not file dump) then
  <ul>
  <li> dump the session inventory to a media file
  <li> dump the terminator to a media file
  </ul>
</ul>

<hr>

<h5><a name="main">The main function of xfsdump</a></h5>

<pre>
* <b><a name="drive_init1">drive_init1</a></b> - initialize drive manager for each stream
  - go thru cmd options looking for -f device
  - each device requires a drive-manager and hence an sproc
    (sproc = IRIX lightweight process)
  - if supposed to run single threaded then can only
    support one device

  - ?? each drive but drive-0 can complete file from other stream
  - allocate drive structures for each one -f d1,d2,d3
  - if "-" specified for std out then only one drive allowed

  - for each drive it tries to pick best strategy manager
    - there are 3 strategies
      1) simple - for dump on file
      2) scsitape - for dump on tape
      3) minrmt - minimal protocol for remote tape (non-SGI)
    - for given drive it is scored by each strategy given
      the drive record which basically has device name,
      and args
    - set drive's strategy to the best one and
      set its strategy's mark separation and media file size
    - instantiate the strategy
      - set flags given the args
      - for drive_scsitape/ds_instantiate
	  - if single-threaded then allocate a buffer of
	    STAPE_MAX_RECSZ page aligned
	  - otherwise, create a ring buffer
      - note if remote tape (has ":" in name)
      - set capabilities of BSF, FSF, etc.

* <b>create global header</b>
  - store magic#, version, date, hostid, uuid, hostname
  - process args for session-id, dump-label, ...

* if have sprocs, then install signal handlers and hold the
  signals (don't deliver but keep 'em pending)

* <b><a name="content_init_dump">content_init</a></b>

  * inomap_build() - stores stream start-points and builds inode map

  - <b>phase1</b>: parsing subtree selections (specified by -s options)
    <b>INPUT</b>:
	- sub directory entries (from -s)
    <b>FLOW</b>:
	- go thru each subtree and
	  call diriter(callback=subtreelist_parse_cb)
	  - diriter on subtreelist_parse_cb
	    - open_by_handle() on dir handle
	    - getdents()
	    - go thru each entry
		- bulkstat for given entry inode
		- gets stat buf for callback - use inode# and mode (type)
		- call callback (subtreelist_parse_cb())
	  * subtreelist_parse_cb
	    - ensure arg subpath matches dir.entry subpath
	    - if so then add to subtreelist
	    - recurse thru rest of subpaths (i.e. each dir in path)
    <b>OUTPUT</b>:
	- linked list of inogrp_t = pagesize of inode nums
	- list of inodes corresponding to subtree path names

  - premptchk: progress report, return if got a signal

  - <b>phase2</b>: creating inode map (initial dump list)
    <b>INPUT</b>:
      - bulkstat records on all the inodes in the file system
    <b>FLOW</b>:
      - bigstat_init on cb_add()
	  - loops doing bulkstats (using syssgi() or ioctl())
	    until system call returns non-zero value
	  - each bulkstat returns a buffer of struct xfs_bstat records
	    (buffer of size bulkreq.ocount)
	  - loop thru each struct xfs_bstat record for an inode
	    calling cb_add()
	  * cb_add
	    - looks at latest mtime|ctime and
	      if inode is resumed:
		 compares with cb_resumetime for change
	      if have cb_last:
		 compares with cb_lasttime for change
	    - add inode to map (map_add) and note if has changed or not
	    - call with state of either
		changed - MAP_DIR_CHANGE, MAP_NDR_CHANGE
		not changed - MAP_DIR_SUPPRT or MAP_NDR_NOCHNG
	    - for changed non-dir REG inode,
	      data size for its dump is added by bs_blocks * bs_blksize
	    - for non-changed dir, it sets flag for &lt;pruneneeded&gt;
	      => we don't want to process this later !
	  * map_add
	    - segment = &lt;base, 64-low, 64-mid, 64-high&gt;
		      = like 64 * 3-bit values (use 0-5)
		      i.e. for 64 inodes, given start inode number
		#define MAP_INO_UNUSED  0 /* ino not in use by fs -
                                             Used for lookup failure */
		#define MAP_DIR_NOCHNG  1 /* dir, ino in use by fs,
                                             but not dumped */
		#define MAP_NDR_NOCHNG  2 /* non-dir, ino in use by fs,
                                             but not dumped */
		#define MAP_DIR_CHANGE  3 /* dir, changed since last dump */

		#define MAP_NDR_CHANGE  4 /* non-dir, changed since last dump */

		#define MAP_DIR_SUPPRT  5 /* dir, unchanged
                                             but needed for hierarchy */
		- hunk = 4 pages worth of segments, max inode#, next ptr in list
	    - i.e. map = linked list of 4 pages of segments of 64 inode states
    <b>OUTPUT</b>:
	- inode map = list of all inodes of file system and
	  for each one there is an associated state variable
	  describing type of inode and whether it has changed
	- the inode numbers are stored in chunks of 64
	  (with only the base inode number explicitly stored)

  - premptchk: progress report, return if got a signal

  - if &lt;pruneneeded&gt; (i.e. non-changed dirs) OR subtrees specified (-s)
    - <b>phase3</b>:  pruning inode map (pruning unneeded subtrees)
	<b>INPUT</b>:
	    - subtree list
	    - inode map
	<b>FLOW</b>:
	- bigstat_iter on cb_prune() per inode
	* cb_prune
	  - if have subtrees and subtree list contains inode
	    -> need to traverse every group (inogrp_t) and
               every page of inode#s
	    - diriter on cb_count_in_subtreelist
	      * cb_count_in_subtreelist:
	      - looks up each inode# (in directory iteration) in subtreelist
	      - if exists then increment counter
	    - if at least one inode in list
	      - diriter on cb_cond_del
	      * cb_cond_del:
            - TODO

	<b>OUTPUT</b>:
            - TODO

- TODO: phase4 and phase5

- if single-threaded (miniroot or pipeline) then
    * drive_init2
	- for each drive
	    * drive_allochdrs
	    * do_init
    * <b>content_stream_dump</b>
    - return

- else (multithreaded std. case)
    * drive_init2 (see above)
    * drive_init3
	- for each drive
	    * do_sync
    - for each stream create a child manager
	* cldmgr_create
	    * childmain
		* <b>content_stream_dump</b>
		* do_quit

- loop waiting for children to die
* content_complete

</pre>

<hr>
<h5><a name="dump_tape">Dumping to Tape</a></h5>

<pre>
* <b><a name="content_stream_dump">content_stream_dump</a></b>
  * Media_mfile_begin
    write out global header (includes media header; see below)

  - loop dumping media files
    * inomap_dump()
      - dumps out the linked list of hunks of state maps of inodes

    * dump_dirs()
      - bulkstat through all inodes of file system

      * dump_dir()
        - lookup inode# in inode map
        - if state is UNSUSED or NOCHANGED then skip inode dump
        - jdm_open() = open_by_handle() on directory
        * dump_filehdr()
          - write out 256 padded file header
          - header = &lt;offset, flags, checksum, 128-byte bulk stat structure &gt;
          - bulkstat struct derived from struct xfs_bstat
            - stnd. stat stuff + extent size, #of extents, DMI stuff
          - if HSM context then
            - modify bstat struct to make it offline
        - loops calling getdents()
          - does a bulkstat or bulkstat-single of dir inode
          * dump_dirent()
            - fill in direnthdr_t record
            - &lt;ino, gen & DENTGENMASK, record size,
                  checksum, variable length name (8-char padded)&gt;
              - gen is from statbuf.bs_gen
            - write out record
        - dump null direnthdr_t record
        - if dumpextattr flag on and it
          has extended attributes (check bs_xflags)
          * dump_extattrs
            * dump_filehdr() with flags of FILEHDR_FLAGS_EXTATTR
              - for root and non-root attributes
                - get attribute list (attr_list_by_handle())
            * dump_extattr_list
              - TODO

    - bigstat iter on dump_file()
      - go thru each inode in file system and apply dump_file
      * dump_file()
	- if file's inode# is less than the start-point then skip it
	  -> presume other sproc handling dumping of that inode
	- if file's inode# is greater than the end-point then stop the loop
	- look-up inode# in inode map
	- if not in inode-map OR hasn't changed then skip it
	- elsif stat is NOT a non-dir then we have an error
	- if have an hsm context then initialize context
	- call dump function depending on file type (S_IFREG, S_IFCHR, etc.)

	  * <b>dump_file_reg</b> (for S_IFREG):
	    -> see below

	  * dump_file_spec (for S_IFCHAR|BLK|FIFO|NAM|LNK|SOCK):
	    - dump file header
	    - if file is S_IFLNK (symlink) then
	      - read link by handle into buffer
	      - dump extent header of type, EXTENTHDR_TYPE_DATA
	      - write out link buffer (i.e. symlink string)

	  - if dumpextattr flag on and it
	    has extended attributes (check bs_xflags)
	    * dump_extattrs (see the same call in the dir case above)

    - set mark

    - if haven't hit EOM (end of media) then
      - write out null file header
      - set mark

    - end media file by do_end_write()

    - if got an inventory stream then
      * inv_put_mediafile
	- create an inventory-media-file struct (invt_mediafile_t)
	  - &lt; media-obj-id, label, index, start-ino#, start-offset,
		 end-ino#, end-offset, size = #recs in media file, flag &gt;
	* stobj_put_mediafile

  - end of loop of media file dumping
  - lock and increment the thread done count

  - if dump supports multiple media files (tapes do but dump-files don't) then
    - if multi-threaded then
      - wait for all threads to have finished dumping
        (loops sleeping for 1 second each iter)
    * dump_session_inv
      * inv_get_sessioninfo
        (get inventory session data buffer)
        * stobj_get_sessinfo
        * stobj_pack_sessinfo
      * Media_mfile_begin
      - write out inventory buffer
      * Media_mfile_end
      * inv_put_mediafile (as described above)
    * dump_terminator
      * Media_mfile_begin
      * Media_mfile_end
</pre>
<hr>

<pre>
* <b><a name="dump_file_reg">dump_file_reg</a></b> (for S_IFREG):
  - if this is the start inode, then set the start offset
  - fixup offset for resumed dump
  * init_extent_group_context
    - init context - reset getbmapx struct fields with offset=0, len=-1
    - open file by handle
    - ensure Mandatory lock not set
  - loop dumping extent group
    - dump file header
    * dump_extent_group() [content.c]
      - set up realtime I/O size
      - loop over all extents
	- dump extent
	  - stop if we reach stop-offset
	  - stop if offset is past file size i.e. reached end
	  - stop if exceeded per-extent size

	  - if next-bmap is at or past end-bmap then get a bmap
	    - fcntl( gcp->eg_fd, F_GETBMAPX, gcp->eg_bmap[] )
	    - if have an hsm context then
	      - call HsmModifyExtentMap()
	    - next-bmap = eg_bmap[1]
	    - end-bmap = eg_bmap[eg_bmap[0].bmv_entries+1]

	  - if bmap entry is a hole (bmv_block == -1) then
	    - if dumping ext.attributes then
	      - dump extent header with bmap's offset,
		extent-size and type EXTENTHDR_TYPE_HOLE

	    - move onto next bmap
	      - if bmap's (offset + len)*512 > next-offset then
		update next-offset to this
	      - inc ptr

	  - if bmap entry has zero length then
	    - move onto next bmap

	  - get extsz and offset from bmap's bmv_offset*512 and bmv_length*512

	  - about 8 different conditions to test for
	    - cause function to return OR
	    - cause extent size to change OR...

	  - if realtime or extent at least a PAGE worth then
	    - align write buffer to a page boundary
	    - dump extent header of type, EXTENTHDR_TYPE_ALIGN

	  - dump extent header of type, EXTENTHDR_TYPE_DATA
	  - loop thru extent data to write extsz worth of bytes
	    - ask for a write buffer of extsz but get back actualsz
	    - lseek to offset
	    - read data of actualsz from file into buffer
	    - write out buffer
	    - if at end of file and have left over space in the extent then
	      - pad out the rest of the extent
	    - if next offset is at or past next-bmap's offset+len then
	      - move onto next bmap
    - dump null extent header of type, EXTENTHDR_TYPE_LAST
    - update bytecount and media file size
  - close the file

</pre>

<hr>

<h4><a name="reg_split">Splitting a Regular File</a></h4>
If a regular file is greater than 16Mb
(maxextentcnt = drivep->d_recmarksep
              = recommended max. separation between marks),
then it is broken up into multiple extent groups each with their
own filehdr_t's.
A regular file can also be split, if we are dumping to multiple
streams and the file would span the stream boundary.

<h4><a name="split_mstream">Splitting a dump over multiple streams (Phase 5)</a></h4>
If one is dumping to multiple streams, then xfsdump calculates an
estimate of the dump size and divides by the number of streams to
determine how much data we should allocate for a stream.
The inodes are processed in order from <i>bulkstat</i> in the function
<i>cb_startpt</i>. Thus we start allocating inodes to the first stream
until we reach the allocated amount and then need to decide how to
proceed on to the next stream. At this point we have 3 actions:
<dl>
<dt>Hold
<dd>Include this file in the current stream.
<dt>Bump
<dd>Start a new stream beginning with this file.
<dt>Split
<dd>Split this file across 2 streams in different extent groups.
</dl>

<p>
<img src="split_algorithm.gif">
<p>

<hr>

<h3><a name="xfsrestore">xfsrestore</a></h3>

<h4><a name="control_flow_restore">Control Flow of xfsrestore</a></h4>

<b>content_init</b> (xfsrestore version)
<p>
Initialize the mmap files of:
<ul>
<li>"$dstdir/xfsrestorehousekeepingdir/state"
<li>"$dstdir/xfsrestorehousekeepingdir/dirattr"
<li>"$dstdir/xfsrestorehousekeepingdir/dirextattr"
<li>"$dstdir/xfsrestorehousekeepingdir/namreg"
<li>"$dstdir/xfsrestorehousekeepingdir/inomap"
<li>"$dstdir/xfsrestorehousekeepingdir/tree"
</ul>

<b>content_stream_restore</b>

<ul>
<li> one stream does while others wait:
  <ul>
  <li> validates command line dump spec against the online inventory
  <li> incorporates the online inventory into the persistent inventory
  </ul>

<li> one stream does while others wait:
  <ul>
  <li> if which session to restore is still unknown then
     <ul>
     <li> search media files of dump to match command args or ask the
       user to select the media file
     <li> add found media file to persistent inventory
     </ul>
  </ul>

<li> one stream does while others wait:
  <ul>
  <li> search for directory dump
  <li> calls <b>dirattr_init</b> if necessary
  <li> calls <b>namreg_init</b> if necessary
  <li> initialize the directory tree (<b>tree_init</b>)
  <li> read the dirents into the tree
       (<a href="#applydirdump"><b>applydirdump</b></a>)
  </ul>

<li> one stream does while others wait:
  <ul>
  <li> do tree post processing (<b>treepost</b>)
    <ul>
    <li> create the directories (<b>mkdir</b>)
    <li> cumulative restore file system fixups
    </ul>
  </ul>

<li> all threads can process each media file of their dumps for
  restoring the non-directory files
  <ul>
  <li>loop over each media file
     <ul>
     <li> read in file header
     <li> call <b>applynondirdump</b> for file hdr
	 <ul>
	 <li> restore extended attributes for file
	      (if it is last extent group of file)
	 <li> restore file
	    <ul>
	    <li>loop thru all hardlink paths from tree for inode
                (<b>tree_cb_links</b>) and call <b>restore_file_cb</b>
                <ul>
                <li> if a hard link then link(path1, path2)
                <li> else restore the non-dir object:
                   <ul>
                   <li> S_IFREG -> <b>restore_reg</b> - restore regular file
                      <ul>
                      <li>truncate file to bs_size
                      <li>set the bs_xflags for extended attributes
                      <li>set DMAPI fields if necessary
                      <li>loop processing the extent headers
                         <ul>
                         <li>if type LAST then exit loop
                         <li>if type ALIGN then eat up the padding
                         <li>if type HOLE then ignore
                         <li>if type DATA then copy the data into
                             the file for the extent;
                             seeking to extent start if necessary
                         </ul>
                      <li>register the extent group in the partial registry
                      <li>set timestamps using utime(2)
                      <li>set permissions using fchmod(2)
                      <li>set owner/group using fchown(2)
                      </ul>
                   <li> S_IFLNK -> <b>restore_symlink</b>
                   <li> else -> <b>restore_spec</b>
                   </ul>
                </ul>
            <li>if no hardlinks references for inode in tree then
                restore file into orphanage directory
	    </ul>
	 <li> update stats
	 <li> loop
	   <ul>
	   <li> get mark
	   <li> read file header
	   <li> if corrupt then go to next mark
	   <li> else exit loop
	   </ul>
	 </ul>
     </ul>
  </ul>

<li> one stream does while others wait:
  <ul>
  <li> finalize
      <ul>
      <li> restore directory attributes
      <li> remove orphanage directory
      <li> remove persistent inode map
      </ul>
  </ul>
</ul>

<hr>

<b>content_init</b> in a bit more detail(xfsrestore version)
<ul>
<li> create house-keeping-directory for persistent mmap file data
  structures. For cumulative and interrupted restores,
  we need to keep restore session data between invocations of xfsrestore.
<li> mmap the "state" file and create if not already existing.
  Initially just mmap the header.  (More details below)
<li> if continuing interrupted session then
  <ul>
  <li> initialize and mmap the directory attribute data
    and dirextattr file (<b>dirattr_init</b>)
  <li> initialize name registry data (<b>namreg_init</b>)
  <li> initialize and mmap the inode map (<b>inomap_sync_pers</b>)
  <li> initialize and mmap the dirent tree (<b>tree_sync</b>)
  <p>
  <li> finalize -> restore directory attributes, delete inode map
  </ul>
<li> mmap the state file for the header and the subtree selections
<li> update the state header with the command line predicates
<li> update the subtree selections via the -s option
<li> create extended attribute buffers for each stream
<li> mmap the state file for the persistent inventory descriptors
<p>
<li> initialize and mmap the directory attribute data
  and dirextattr file (<b>dirattr_init</b>)
<li> initialize name registry data (<b>namreg_init</b>)
<li> initialize and mmap the inode map (<b>inomap_sync_pers</b>)
<li> initialize and mmap the dirent tree (<b>tree_sync</b>)
</ul>

<hr>

<h4><a name="pers_inv">Persistent Inventory and State File</a></h4>

The persistent inventory is found inside the "state" file.
The state file is an mmap'ed file called
<b>$dstdir/xfsrestorehousekeepingdir/state</b>.
The state file (<i>struct pers</i> from content.c) contains
a header of:
<ul>
<li>command line arguments from 1st session,
<li>partial registry data structure for use with multiple streams
    and extended attributes,
<li>various session state such as
    dumpid, dump label, number of inodes restored so far, etc.
</ul>
<br>
Followed by pages for the subtree selections and then
the persistent inventory.
<br>
So the 3 main sections look like:
<pre>
<b>"state" mmap file</b>
---------------------
| State Header      |
| (number of pages  |
|  to hold pers_t)  |
| pers_t:           |
| accum. state      |
|   - cmd opts      |
|   - etc...        |
| session state     |
|   - dumpid        |
|   - accum.time    |
|   - ino count     |
|   - etc...        |
|   - stream head   |
---------------------
| Subtree           |
| Selections        |
| (stpgcnt * pgsz)  |
---------------------
| Persistent        |
| Inventory         |
| Descriptors       |
| (descpgcnt * pgsz)|
|                   |
---------------------
</pre>


<b>Persistent Inventory Tree</b>
<pre>
e.g. drive1         drive2        drive3
|-------------|  |---------|   |---------|
| stream1     |->| stream2 |-->| stream3 |
|(pers_strm_t)|  |         |   |         |
|-------------|  |---------|   |---------|
		    ||
		    \/
                 e.g. tape21        tape22         tape23
		 |------------|   |---------|   |---------|
		 |  obj1      |-->|  obj2   |-->|  obj3   |
		 |(pers_obj_t)|   |         |   |         |
		 |------------|   |---------|   |---------|
				    ||
				    \/
				 |-------------|   |---------|   |---------|
				 | file1       |-->|  file2  |-->|  file3  |
				 |(pers_file_t)|   |         |   |         |
				 |-------------|   |---------|   |---------|
</pre>




[TODO: persistent inventory needs investigation]

<hr>
<h4><a name="dirent_tree">Restore's directory entry tree</a></h4>

As can be seen in the directory dump format above, part of the dump
consists of directories and their associated directory entries.
The other part consists of the files which are just identified by
their inode# which is sourced from <i>bulkstat</i> during the dump.
When restoring a dump, the first step is reconstructing the
tree of directory nodes. This tree can then be used to associate
the file with it's directory and so restored to the correct location
in the directory structure.
<p>
The tree is an mmap'ed file called
<b>$dstdir/xfsrestorehousekeepingdir/tree</b>.
Different sections of it will be mmap'ed separately.
It is of the following format:
<pre>
--------------------
|  Tree Header     | <--- ptr to root of tree, hash size,...
|  (pgsz = 16K)    |
--------------------
|  Hash Table      | <--- inode# ==map==> tree node
--------------------
|  Node Header     | <--- describes allocation of nodes
|  (pgsz = 16K)    |
--------------------
|  Node Segment#1  | <--- typically 1 million tree nodes
--------------------
|  ...             |
|                  |
--------------------
|  Node Segment#N  |
--------------------
</pre>

<p>
The tree header is described by restore/tree.c/treePersStorage,
and it has such things as pointers to the root of the tree and
the size of the hash table.
<pre>
        ino64_t p_rootino - ino of root
        nh_t p_rooth - handle of root node
        nh_t p_orphh - handle to orphanage node
        size64_t p_hashsz - size of hash array
        size_t p_hashmask - hash mask (private to hash abstraction)
        bool_t p_ownerpr - whether to restore directory owner/group attributes
        bool_t p_fullpr - whether restoring a full level 0 non-resumed dump
        bool_t p_ignoreorphpr - set if positive subtree or interactive
</pre>
<p>
The hash table maps the inode number to the tree node. It is a
chained hash table with the "next" link stored in the tree node
in the <i>n_hashh</i> field of struct node in restore/tree.c.
The size of the hash table is based on the number of directories
and non-directories (which will approximate the number of directory
entries - won't include extra hard links). The size of the table
is capped below at 1 page and capped above at virtual-memory-limit/4/8
(i.e. vmsz/32) or the range of 2^32 whichever is the smaller.
<p>
The node header is described by restore/node.c/node_hdr_t and
it contains fields to help in the allocation of nodes.
<pre>
        size_t nh_nodesz -  internal node size
        ix_t nh_nodehkix -
        size_t nh_nodesperseg - num nodes per segment
        size_t nh_segsz - size in bytes of segment
        size_t nh_winmapmax - maximum number of windows
                              based on using up to vmsz/4
        size_t nh_nodealignsz - node alignment
        nix_t nh_freenix - pointer to singly linked freelist
        off64_t nh_firstsegoff - offset to 1st segment
        off64_t nh_virgsegreloff - (see diagram)
                 offset (relative to beginning of first segment) into
                 backing store of segment containing one or
                 more virgin nodes. relative to beginning of segmented
                 portion of backing store. bumped only when all of the
                 nodes in the segment have been placed on the free list.
                 when bumped, nh_virginrelnix is simultaneously set back
                 to zero.
        nix_t nh_virgrelnix - (see diagram)
                 relative node index within the segment identified by
                 nh_virgsegreloff of the next node not yet placed on the
                 free list. never reaches nh_nodesperseg: instead set
                 to zero and bump nh_virgsegreloff by one segment.
</pre>
<p>
All the directory entries are stored in a node segment. Each segment
holds around 1 million nodes (NODESPERSEGMIN). The value is greater
because the size in bytes must be a multiple of the node size and
the page size. However, the code handling the number of nodes was changed
recently due to problems at a site.
The number of nodes is now based on the
value of <i>dircnt+nondircnt</i> in an attempt to
fit most of the entries into 1 segment. As the value of
<i>dircnt+nondircnt</i> is an approximation to the number of directory
entries, we cap below at 1 million entries as was done previously.
<p>
Each segment is mmap'ed separately. In fact, the actual allocation
of nodes is handled by a few abstractions.
There is a <b>node abstraction</b> and a <b>window abstraction</b>.
At the node abstraction when one wants to allocate a node
using <i><b>node_alloc()</b></i>, one first checks the free-list of
nodes. If the free list is empty then a new window is mapped and
a chunk of 8192 nodes are put on the free list by linking
each node using the first 8 bytes (ignoring node fields).
<p>
<pre>

  SEGMENT (default was about 1 million nodes)
|----------|
| |------| |
| |      | |
| | 8192 | |
| | nodes| |   nodes already used in tree
| | used | |
| |      | |
| |------| |
|          |
| |------| |
| |   --------| <-----nh_freenix (ptr to node-freelist)
| |node1 | |  |
| |------| |  | node-freelist (linked list of free nodes)
| |   ----<---|
| |node2 | |
| |------| |
............
|----------|


</pre>


<h5><a name="win_abs">Window Abstraction</a></h5>
The window abstraction manages the mapping and unmapping of the
segments (of nodes) of the dirent tree.
In the node allocation, mentioned above, if our node-freelist is
empty we call <i><b>win_map()</b></i> to map in a chunk of 8192 nodes
for the node-freelist.
<p>
Consider the <i><b>win_map</b>(offset, return_memptr)</i> function:
<pre>
One is asking for an offset within a segment.
It looks up its <i>bag</i> for the segment (given the offset), and
if it's already mapped then
    if the window has a refcnt of zero, then remove it from the win-freelist
    it uses that address within the mmap region and
    increments refcnt.
else if it's not in the bag then
    if win-freelist is not empty then
        munmap the oldest mapped segment
	remove head of win-freelist
        remove the old window from the bag
    else /* empty free-list */
        allocate a new window
    endif
    mmap the segment
    increment refcnt
    insert window into bag of mapped segments
endif
</pre>
<p>
The window abstraction maintains an LRU win-freelist not to be
confused with the node-freelist. The win-freelist consists
of windows (stored in a bag) which are doubly linked ordered by
the time they were used.
Whereas the node-freelist, is used to get a new node
in the node allocation.
<p>
Note that the windows are stored in 2 lists. They are doubly
linked in the LRU win-freelist and are also stored in a <i>bag</i>.
A bag is just a doubly linked searchable list where
the elements are allocated using <i>calloc()</i>.
It uses the bag as a container of mmaped windows which can be
searched using the bag key of window-offset.
<pre>

BAG:  |--------|     |--------|     |--------|     |--------|     |-------|
      | win A  |<--->| win B  |<--->| win C  |<--->| win D  |<--->| win E |
      | ref=2  |     | ref=1  |     | ref=0  |     | ref=0  |     | ref=0 |
      | offset |     | offset |     | offset |     | offset |     | offset|
      |--------|     |--------|     |--------|     |--------|     |-------|
                                      ^                               ^
                                      |                               |
                                      |                               |
                     |----------------|       |-----------------------|
LRU             |----|---|               |----|---|
win-freelist:   | oldest |               | 2nd    |
                | winptr |<------------->| oldest |<----....
                |        |               | winptr |
                |--------|               |--------|

</pre>

<p>
<b>Call Chain</b><br>

Below are some call chain scenarios of how the allocation of
dirent tree nodes are done at different stages.
<p>
<pre>
1st time we allocate a dirent node:

applydirdump()
  Go thru each directory entry (dirent)
    tree_addent()
      if new entry then
         Node_alloc()
           node_alloc()
             win_map()
               mmap 1st segment/window
               insert win into bag
	       refcnt++
             make node-freelist of 8192 nodes (linked list)
             remove list node from freelist
             win_unmap()
               refcnt--
               put win on win-freelist (as refcnt==0)
             return node

2nd time we call tree_addent():

      if new entry then
         Node_alloc()
           node_alloc()
             get node off node-freelist (8190 nodes left now)
             return node

8193th time when we have used up 8192 nodes and node-freelist is emtpy:

      if new entry then
         Node_alloc()
           node_alloc()
             there is no node left on node-freelist
             win_map at the address after the old node-freelist
               find this segment in bag
                 refcnt==0, so remove from LRU win-freelist
                 refcnt++
                 return addr
             make a node-freelist of 8192 nodes from where left off last time
             win_unmap
               refcnt--
               put on LRU win-freelist as refcnt==0
             get node off node-freelist (8191 nodes left now)
             return node

When whole segment used up and thus all remaining node-freelist
nodes are gone then
(i.e. in old scheme would have used up all 1 million nodes
 from first segment):

      if new entry then
         Node_alloc()
           node_alloc()
             if no node-freelist then
               win_map()
                 new segment not already mapped
                 LRU win-freelist is not empty (we have 1st segment)
                 remove head from LRU win-freelist
                 remove win from bag
                 munmap its segment
                 mmap the new segment
                 add to bag
                 refcnt++
               make a new node-freelist of 8192 nodes
               win_unmap()
                 refcnt--
                 put on LRU win-freelist as refcnt==0
               get node off node-freelist (8191 nodes left now)
               return node

</pre>

Pseudo-code of snippets of directory tree creation functions (from notes)
gives one an idea of the flow of control for processing dirents
and adding to the tree and other auxiliary structures:
<pre>

<b>content_stream_restore</b>()
  ...
  Get next media file
  dirattr_init() - initialize directory attribute structure
  namereg_init() - initialize name registry structure
  tree_init() - initialize dirent tree
  applydirdump() - process the directory dump and create tree - see below
  treepost() - tree post processing where mkdirs happen
  ...

<a name="applydirdump"><b>applydirdump</b>()</a>
  ...
  inomap_restore_pers() - read ino map
  read directories and their entries
    loop 'til null hdr
       dirh = <b>tree_begindir</b>(fhdr, dah) - process dir filehdr
       loop 'til null entry
         rv = read_dirent()
         <b>tree_addent</b>(dirh, dhdrp->dh_ino, dh_gen, dh_name, namelen)
       endloop
       tree_enddir(dirh)
    endloop
  ...

<b>tree_beginddir</b>(fhdrp - fileheader, dahp - dirattrhandle)
  ...
  ino = fhdrp->fh_stat.bs_ino
  hardh = link_hardh(ino, gen) - lookup inode in tree
  if (hardh == NH_NULL) then
    new directory - 1st time seen
    dah = dirattr_add(fhdrp) - add dir header to dirattr structure
    hardh = Node_alloc(ino, gen,....,NF_ISDIR|NF_NEWORPH)
    link_in(hardh) - link into tree
    adopt(p_orphh, hardh, NRH_NULL) - put dir in orphanage directory
  else
    ...
  endif

<b>tree_addent</b>(parent, inode, size, name, namelen)
  hardh = link_hardh(ino, gen)
  if (hardh == NH_NULL) then
    new entry - 1st time seen
    nrh = namreg_add(name, namelen)
    hardh = Node_alloc(ino, gen, NRH_NULL, DAH_NULL, NF_REFED)
    link_in(hardh)
    adopt(parent, hardh, nrh)
  else
    ...
  endif

</pre>

<p>

<hr>
<h4><a name="cum_restore">Cumulative Restore</a></h4>
A cumulative restore seems a bit different than one might expect.
It tries to restore the state of the filesystem at the time of
the incremental dump. As the man page states:
"This can involve adding, deleting, renaming, linking,
 and unlinking files and directories." From a coding point of view,
this means we need to know what the dirent tree was like previously
compared with what the dirent tree is like now. We need this so
we can see what was added and deleted. So this means that the
dirent tree, which is stored as an mmap'ed file in
<i>restoredir/xfsrestorehousekeepingdir/tree</i> should not be deleted
between cumulative restores (as we need to keep using it).
<p>
So on the first level 0 restore, the dirent tree is created.
When the directories are restored and the files are restored,
the corresponding tree nodes are marked as <i>NF_REAL</i>.
On the next level cumulative restore, when it is processing the
dirents, it looks them up in the tree (created on previous restore).
If the entry alreadys exists then it marks it as <i>NF_REFED</i>.
<p>
In case a dirent has gone away between times of incremental dumps,
xfsrestore does an extra pass in the tree preprocessing
which traverses the tree looking for non-referenced (not <i>NF_REFED</i>)
nodes so that if they exist in the FS (i.e. are <i>NF_REAL</i>) then
they can be deleted (so that the FS resembles what it was at the time
of the incremental dump).
Note there are more conditionals to the code than just that -
but that is the basic plan.
It is elaborated further below.

<h4><a name="tree_post">Cumulative Restore Tree Postprocessing</a></h4>
After the dirent tree is created or updated from the directory dump
cumulative restoral, it does a 4 step postprocessing (<b>treepost</b>):
<p>
<table border>
<caption><b>Steps of Tree Postprocessing</b></caption>
<tr>
   <th>Function</th><th>What it does</th>
</tr>
<tr>
   <td><b>1. noref_elim_recurse</b></td>
   <td><ul>
   <li>remove deleted dirs
   <li>rename moved dirs to orphanage
   <li>remove extra deleted hard links
   <li>rename moved non-dirs to orphanage
   </ul></td>
</tr>
<tr>
   <td><b>2. mkdirs_recurse</b></td>
   <td><ul>
   <li>mkdirs on (dir & !real & ref & sel)
   </ul></td>
</tr>
<tr>
   <td><b>3. rename_dirs</b></td>
   <td><ul>
   <li>rename moved dirs from orphanage to destination
   </ul></td>
</tr>
<tr>
   <td><b>4. proc_hardlinks</b></td>
   <td><ul>
   <li>rename moved non-dirs from orphanage to destination
   <li>remove deleted non-dirs (real & !ref & sel)
   <li>create a link on rename error (don't understand this one)
   </ul></td>
</tr>
</table>

<p>
Step 1 was changed so that files which are deleted and not moved
are deleted early on, otherwise, it can stop a parent directory
from being deleted.
The new step is:
<p>
<table border>
<tr>
   <th>Function</th><th>What it does</th>
</tr>
<tr>
   <td><b>1. noref_elim_recurse</b></td>
   <td><ul>
   <li>remove deleted dirs
   <li>rename moved dirs to orphanage
   <li>remove extra deleted hard links
   <li>rename moved non-dirs to orphanage
   <li>remove deleted non-dirs which aren't part of a rename
   </ul></td>
</tr>
</table>
<p>
One will notice that renames are not performed directly.
Instead entries are renamed to the orphanage, directories are
created, then entries are moved from the orphanage to the
intended destination. This would be done as renames may not
succeed until directories are created. And the directories
are not created first as we may be able to create the entry
by just moving an existing one.
The step of "removing deleted non-dirs" in <i>proc_hardlinks</i>
should not happen now since it is done earlier.

<p>
<hr>
<h4><a name="partial_reg">Partial Registry</a></h4>

The partial registry is a data structure used in <i>xfsrestore</i>
for ensuring that files which have been split into multiple extent groups,
do not restore the extended attributes until the entire file has been
restored. The reason for this is apparently so that DMAPI attributes
aren't restored until we have the complete file. Each extent group dumped
has the identical copy of the extended attributes (EAs) for that file,
thus without this data-structure we could apply the first EAs we come across.
<p>
The data structure is of the form:
<pre>
Array of M entries:
-------------------
0: inode#
   Array for each drive
     drive1: <start-offset> <end-offset>
     ...
     driveN: <start-offset> <end-offset>
-------------------
1: inode#
   Array for each drive
-------------------
2: inode#
   Array for each drive
-------------------
...
-------------------
M-1: inode#
     Array for each drive
-------------------

Where N = number of drives (streams); M = 2 * N - 1
</pre>

There can only be 2*N-1 entries for the partial registry because
each stream can contribute an entry for its current inode and
one for a previous inode which is split - except for the 1st inode
which cannot have a previous split.
<pre>
      stream 1        stream 2         stream 3      ...  stream N
  |---------------|----------------|-------------------|------------|
  |            ------   -----   ------   -----      -------  -----  |
  |            C  | P     C        | P     C           |  P    C    |
  |---------------|----------------|-------------------|------------|

       current       prev.+curr.        prev.+curr.      prev.+curr.

Where C = current; P = previous
</pre>

So if an extent group is processed which doesn't cover the whole file,
then the extent range for this file is updated with the partial
registry. If the file doesn't exist in the array then a new entry is
added. If the file does exist in the array then the extent group for
the given drive is updated. It is worth remembering that one drive
(stream) can have multiple extent groups (if it is >16Mb) in which
case the extent group is just extended (they are split up in order).
<p>
A bug was discovered in this area of code, for <i>DMF offline</i> files
which have an associated file size but no data blocks allocated and
thus no extents. The Offline files were wrongly added to the partial
registry because on restore they did not complete the size of the
file (because they are offline!). These types of files which do not
restore data are now special cased.
<p>
<hr>


<h3><a name="drive_strategy">Drive Strategies</a></h3>
The I/O which happens when reading and writing the dump
can be to a tape, file, stdout or
to a tape remotely via rsh(1) (or $RSH)  and rmt(1) (or $RMT).
There are 3 pieces of code called strategies which
handle the dump I/O:
<ul>
<li>drive_scsitape
<li>drive_minrmt
<li>drive_simple
</ul>
There is an associated data structure - below is one
for drive_scsitape:
<pre>
    drive_strategy_t drive_strategy_scsitape = {
	    DRIVE_STRATEGY_SCSITAPE,        /* ds_id */
	    "scsi tape (drive_scsitape)",   /* ds_description */
	    ds_match,                       /* ds_match */
	    ds_instantiate,                 /* ds_instantiate */
	    0x1000000ll,                    /* ds_recmarksep  16 MB */
	    0x10000000ll,                   /* ds_recmfilesz 256 MB */
    };
</pre>
The choice of the strategy to use is done by a
scoring scheme which is probably not warranted IMHO.
(A direct cmd option would be simpler and less confusing.)
The scoring function is called ds_match.

<table border>
<tr>
   <th>strategy</th><th>IRIX scoring</th><th>Linux scoring</th>
</tr>
<tr>
   <td>drive_scsitape</td>
   <td>
   score badly with -10 if:
       <ul>
       <li>stdio pathname
       <li>if colon (':') in pathname (assumes remote) and
	    <ul>
	    <li> open on pathname fails
	    <li> MTIOCGET ioctl fails
	    </ul>
       <li>or not colon and drivername is not "tpsc" or "ts_"
       </ul>
   else if syscalls complete ok then we score 10.
   </td>
   <td>
   score like IRIX but instead of checking drivername associated
   with path (not available on Linux), score -10 if the following:
	<ul>
	<li>stat fails
	<li>it is not a character device
	<li>its real path does not contain "/nst", "/st" nor "/mt".
	</ul>
   </td>
</tr>
<tr>
    <td>drive_minrmt</td>
    <td>
	<ul>
	<li>score badly with -10 if stdio pathname
	<li>score 10 if have all of the following:
	    <ul>
	    <li>colon is in the pathname (assumes remote from this)
	    <li>blocksize set with -b option
	    <li>minrmt chosen with -m option
	    </ul>
	<li>otherwise score badly with -10
	</ul>
    </td>
    <td>score like IRIX but do not require a colon in the pathname;
	i.e. one can use this strategy on Linux without requiring a
	remote pathname
    </td>
</tr>
<tr>
    <td>drive_simple</td>
    <td>
	<ul>
	<li>score badly with -1 if
	    <ul>
	    <li>stat fails on local pathname
	    <li>pathname is a local directory
	    </ul>
	<li>otherwise score with 1
	</ul>
    </td>
    <td>identical to IRIX</td>
</tr>
</table>

<p>
Each strategy is organised like a "class" with functions/methods
in the data structure:
<pre>
        do_init,
        do_sync,
        do_begin_read,
        do_read,
        do_return_read_buf,
        do_get_mark,
        do_seek_mark,
        do_next_mark,
        do_end_read,
        do_begin_write,
        do_set_mark,
        do_get_write_buf,
        do_write,
        do_get_align_cnt,
        do_end_write,
        do_fsf,
        do_bsf,
        do_rewind,
        do_erase,
        do_eject_media,
        do_get_device_class,
        do_display_metrics,
        do_quit,
</pre>

<h4><a name="drive_scsitape">Drive Scsitape</a></h4>
This strategy is the main one used for dumps to tape and
dumps to a remote tape. This strategy on IRIX can be used for remote
dumps to another IRIX machine. On Linux, this strategy is
used for remote dumps to Linux or IRIX machines. Remote dumping uses
the librmt library, see below.
<p>
If xfsdump/xfsrestore is running single-threaded (-Z option)
or is running on Linux (which is not multi-threaded) then
records are read/written straight to the tape. If it is running
multi-threaded then a circular buffer is used as an intermediary
between the client and worker threads.
<p>
Initially <i>drive_init1()</i> calls <i>ds_instantiate()</i> which
if dump/restore is running multi-threaded,
creates the ring buffer with <i>ring_create</i> which initialises
the state to RING_STAT_INIT and sets up the worker thread with
ring_worker_entry.
<pre>
ds_instantiate()
  ring_create(...,ring_read, ring_write,...)
    - allocate and init buffers
    - set rm_stat = RING_STAT_INIT
    start up worker thread with ring_worker_entry
</pre>
The worker spends its time in a loop getting items from the
active queue, doing the read or write operation and placing the result
back on the ready queue.
<pre>
worker
======
ring_worker_entry()
  loop
    ring_worker_get() - get from active queue
    case rm_op
      RING_OP_READ -> ringp->r_readfunc
      RING_OP_WRITE -> ringp->r_writefunc
      ..
    endcase
    ring_worker_put() - puts on ready queue
  endloop
</pre>


<p>
<h5><a name="reading">Reading</a></h5>

Prior to reading, one needs to call <i>do_begin_read()</i>,
which calls <i>prepare_drive()</i>. <i>prepare_drive()</i> opens
the tape drive if necessary and gets its status.
It then works out the tape record size to use
(<i>set_best_blk_and_rec_sz</i>) using
current max blksize (mtinfo.maxblksz from ioctl(fd,MTIOCGETBLKINFO,minfo))
on the scsi tape device in IRIX.

<p>
On IRIX (from <i>set_best_blk_and_rec_sz</i>):
<ul>
<li>
local tape -> tape_recsz = min(STAPE_MAX_RECSZ = 2 Mb, mtinfo.maxblksz)<br>
         which typically would mean 2 Mb.
<li>
remote tape -> tape_recsz = STAPE_MIN_MAX_BLKSZ = 240 Kb
</ul>
<p>
On Linux:
<ul>
<li>
local tape ->
    <ul>
    <li>
    tape_recsz = STAPE_MAX_LINUX_RECSZ = 1 Mb<br>
    <li> or if -b cmdlineblksize specified then<br>
    tape_recsz = min(STAPE_MAX_RECSZ = 2 Mb, cmdlineblksize)<br>
     which typically would mean cmdlineblksize.
    </ul>
<li>
remote tape -> tape_recsz = STAPE_MIN_MAX_BLKSZ = 240 Kb
</ul>
<p>
If we have a fixed size device, then it tries to read
initially at minimum(2Mb, current max blksize)
but if it reads in a smaller number of bytes than this,
then it will try again for STAPE_MIN_MAX_BLKSZ = 240 Kb data.

<p>
<pre>
prepare_drive()
  open drive (repeat & timeout if EBUSY)
  get tape status (repeat 'til timeout or online)
  set up tape rec size to try
  loop trying to read a record using straight Read()
      if variable blksize then
	 ok = nread>0 & !EOD & !EOT & !FileMark
      else fixed blksize then
	 ok = nread==tape_recsz & !EOD & !EOT & !FileMark
      endif
      if ok then
	validate_media_file_hdr()
      else
        could be an error or try again with newsize
        (complicated logic in this code!)
      endif
  endloop
</pre>

<p>
For each <i>do_read</i> call in the multi-threaded case,
we have two sides to the story: the client which is coming
from code in <i>content.c</i> and the worker which is a simple
thread just satisfying I/O requests.
From the point of view of the ring buffer, these are the steps
which happen for reading:
<ol>
<li>client removes msg from ready queue
<li>client wants to read, so sets op field to READ (RING_OP_READ)
   and puts on active queue
<li>worker removes msg from active queue,
   invokes client read function,
   sets status field: OK/ERROR,
   puts msg on ready queue
<li>client removes this msg from ready queue
</ol>

<p>

The client read code looks like the following:
<pre>
client
======
do_read()
  getrec()
    singlethreaded -> read_record() -> Read()
    else ->
      loop 'til contextp->dc_recp is set to a buffer
	Ring_get() -> ring.c/ring_get()
	  remove msg from ready queue
	      block on ready queue - qsemP( ringp->r_ready_qsemh )
	      msgp = &ringp->r_msgp[ ringp->r_ready_out_ix ];
	      cyclic_inc(ringp->r_ready_out_ix)
        case rm_stat:
	  RING_STAT_INIT, RING_STAT_NOPACK, RING_STAT_IGNORE
            put read msg on active queue
		contextp->dc_msgp->rm_op = RING_OP_READ
		Ring_put(contextp->dc_ringp,contextp->dc_msgp);
          RING_STAT_OK
            contextp->dc_recp = contextp->dc_msgp->rm_bufp
          ...
        endcase
      endloop
</pre>

<h4><a name="librmt">Librmt</a></h4>
Librmt is a standard library on IRIX which provides a set of
remote I/O functions:
<ul>
<li>rmtopen
<li>rmtclose
<li>rmtioctl
<li>rmtread
<li>rmtwrite
</ul>
On linux, a librmt library is provided as part of the
xfsdump distribution.
The remote functions are used to dump/restore to remote
tape drives on remote machines. It does this by using
rsh or ssh to run rmt(1) on the remote machine.
The main caveat, however, comes into play for the <i>rmtioctl</i>
function.  Unfortunately, the values for mt operations and status
codes are different on different machines.
For example, the offline command op
on IRIX is 6 and on Linux it is 7. On Linux, 6 is rewind and
on IRIX 7 is a no-op.
So for the Linux xfsdump, the <i>rmtiocl</i> function has been rewritten
to check what the remote OS is (e.g. <i>rsh host uname</i>)
and do appropriate mappings of codes.
As well as the different mt op codes, the mtget structures
differ for IRIX and Linux and for Linux 32 bit and Linux 64 bit.
The size of the mtget structure is used to determine which
structure it is and the value of <i>mt_type</i> is used to
determine if endian conversion needs to be done.
<p>

<h4><a name="drive_minrmt">Drive Minrmt</a></h4>
The minrmt strategy was written based (copied) on the scsitape
strategy. It has been simplified so that the state of the
tape driver is not needed (i.e. status of EOT, BOT, EOD, FMK,...
are not used) and the current blk size of the tape driver
is not used. Instead error handling is based on the return
codes from reading and writing and the blksize must be give
as a parameter. It was designed for talking
to remote NON-IRIX hosts where the status codes can vary.
However, as was mentioned in the discussion of librmt on Linux,
the mt operations vary on foreign hosts as well as the status
codes. So this is only a limited solution.

<h4><a name="drive_simple">Drive Simple</a></h4>
The simple strategy was designed for dumping to files
or stdout. It is simpler in that it does <b>NOT</b> have to worry
about:
<ul>
<li>the ring buffer
<li>talking to the scsitape driver with various operations and status
<li>multiple media files
</ul>

<p>
<hr>
<h3><a name="inventory">Online Inventory</a></h3>
xfsdump keeps a record of previous xfsdump executions in the online inventory
stored in /var/xfsdump/inventory or for Linux, /var/lib/xfsdump/inventory.
This inventory is used to determine which previous dump a incremental dump
should be based on.  That is, when doing a level > 0 dump for a filesystem,
xfsdump will refer to the online inventory to work out when the last dump for
that filesystem was performed in order to work out which files will be
included in the current dump.  I believe the online inventory is also used
by xfsrestore in order to determine which tapes will be needed to completely
restore a dump.
<p>
xfsinvutil is a utility originally designed to remove unwanted information
from the online inventory.  Recently it has been beefed up to allow interactive
browsing of the inventory and the ability to merge/import one inventory into
another.  (See Bug 818332.)
<p>
The inventory consists of three types of files:
<p>
<table border width="100%">
<caption><b>Inventory files</b></caption>
<tr>
	<th>Filename</th>
  <th>Description</th>
</tr>
<tr>
	<td>fstab</td>
  <td>There is one fstab file which contains the list of filesystems that are referenced in the
  inventory.</td>
</tr>
<tr>
	<td>*.InvIndex</td>
  <td>There is one InvIndex file per filesystem which contain pointers to the StObj files sorted
  temporaly.</td>
</tr>
<tr>
	<td>*.StObj</td>
  <td>There may be many StObj files per filesystem.  Each file contains information about, up to five,
  individual xfsdump executions.  The information relates to what tapes were used, which inodes are
  stored in which media files, etc.</td>
</tr>
</table>
<p>
The files are constructed like so:
<h4>fstab</h4>
<table border width="100%">
<caption><b>fstab structure</b></caption>
<tr>
	<th>Quantity</th>
  <th>Data structure</th>
</tr>
	<tr>
	<td>1</td>
    <td>
<pre>
typedef struct invt_counter {
    INVT_COUNTER_FIELDS
        uint32_t      ic_vernum;/* on disk version number for posterity */\
        u_int         ic_curnum;/* number of sessions/invindices recorded \
                                   so far */                              \
        u_int         ic_maxnum;/* maximum number of sessions/inv_indices \
                                   that we can record on this stobj */

    char              ic_padding[0x20 - INVT_COUNTER_FIELDS_SIZE];
} invt_counter_t;
</pre>
		</td>
	</tr>
	<tr>
	<td>1 per filesystem</td>
	<td>
<pre>
typedef struct invt_fstab {
    uuid_t  ft_uuid;
    char    ft_mountpt[INV_STRLEN];
    char    ft_devpath[INV_STRLEN];
    char    ft_padding[16];
} invt_fstab_t;
</pre>
		</td>
	</tr>
</table>


<h4>InvIndex</h4>
<table border width="100%">
<caption><b>InvIndex structure</b></caption>
<tr>
	<th>Quantity</th>
  <th>Data structure</th>
</tr>
	<tr>
	<td>1</td>
    <td>
<pre>
typedef struct invt_counter {
    INVT_COUNTER_FIELDS
        uint32_t      ic_vernum;/* on disk version number for posterity */\
        u_int         ic_curnum;/* number of sessions/invindices recorded \
                                   so far */                              \
        u_int         ic_maxnum;/* maximum number of sessions/inv_indices \
                                   that we can record on this stobj */
    char              ic_padding[0x20 - INVT_COUNTER_FIELDS_SIZE];
} invt_counter_t;
</pre>
		</td>
	</tr>
	<tr>
	<td>1 per StObj file</td>
	<td>
<pre>
typedef struct invt_entry {
    invt_timeperiod_t ie_timeperiod;
    char              ie_filename[INV_STRLEN];
    char              ie_padding[16];
} invt_entry_t;
</pre>
		</td>
	</tr>
</table>

<h4>StObj</h4>
<table border width="100%">
<caption><b>StObj structure</b></caption>
<tr>
	<th>Quantity</th>
  <th>Data structure</th>
</tr>
	<tr>
	<td>1</td>
    <td>
<pre>
typedef struct invt_sescounter {
    INVT_COUNTER_FIELDS
        uint32_t      ic_vernum;/* on disk version number for posterity */\
        u_int         ic_curnum;/* number of sessions/invindices recorded \
                                   so far */                              \
        u_int         ic_maxnum;/* maximum number of sessions/inv_indices \
                                   that we can record on this stobj */
    off64_t  ic_eof;   /* current end of the file, where the next
                          media file or stream will be written to */
    char     ic_padding[0x20 - ( INVT_COUNTER_FIELDS_SIZE + sizeof( off64_t) )];
} invt_sescounter_t;
</pre>
		</td>
	</tr>
	<tr>
	<td>fixed space for<br>
        INVT_STOBJ_MAXSESSIONS (ie. 5)</td>
	<td>
<pre>
typedef struct invt_seshdr {
    off64_t    sh_sess_off;    /* offset to rest of the sessioninfo */
    off64_t    sh_streams_off; /* offset to start of the set of
                                  stream hdrs */
    time_t     sh_time;        /* time of the dump */
    uint32_t   sh_flag;        /* for misc flags */
    u_char     sh_level;       /* dump level */
    u_char     sh_pruned;      /* pruned by invutil flag */
    char       sh_padding[22];
} invt_seshdr_t;
</pre>
		</td>
	</tr>
	<tr>
	<td>fixed space for<br>
        INVT_STOBJ_MAXSESSIONS (ie. 5)</td>
	<td>
<pre>
typedef struct invt_session {
    uuid_t   s_sesid;	/* this session's id: 16 bytes*/
    uuid_t   s_fsid;	/* file system id */
    char     s_label[INV_STRLEN];  /* session label */
    char     s_mountpt[INV_STRLEN];/* path to the mount point */
    char     s_devpath[INV_STRLEN];/* path to the device */
    u_int    s_cur_nstreams;/* number of streams created under
                               this session so far */
    u_int    s_max_nstreams;/* number of media streams in
                               the session */
    char     s_padding[16];
} invt_session_t;</pre>
		</td>
	</tr>
  <tr>
	<td rowspan=2>any number</td>
	  <td>
<pre>
typedef struct invt_stream {
    /* duplicate info from mediafiles for speed */
    invt_breakpt_t  st_startino;   /* the starting pt */
    invt_breakpt_t  st_endino;     /* where we actually ended up. this
                                      means we've written upto but not
                                      including this breakpoint. */
    off64_t         st_firstmfile;  /*offsets to the start and end of*/
    off64_t         st_lastmfile;	  /* .. linked list of mediafiles */
    char            st_cmdarg[INV_STRLEN]; /* drive path */
    u_int           st_nmediafiles; /* number of mediafiles */
    bool_t          st_interrupted;	/* was this stream interrupted ? */
    char            st_padding[16];
} invt_stream_t;
</pre>
		</td>
	</tr>
	<tr>
		<td>
<pre>
typedef struct invt_mediafile {
    uuid_t           mf_moid;	    /* media object id */
    char             mf_label[INV_STRLEN];	/* media file label */
    invt_breakpt_t   mf_startino; /* file that we started out with */
    invt_breakpt_t   mf_endino;	  /* the dump file we ended this
                                     media file with */
    off64_t          mf_nextmf;   /* links to other mfiles */
    off64_t          mf_prevmf;
    u_int            mf_mfileidx; /* index within the media object */
    u_char           mf_flag;     /* Currently MFILE_GOOD, INVDUMP */
    off64_t          mf_size;     /* size of the media file */
    char             mf_padding[15];
} invt_mediafile_t;
</pre>
	</td>
  </tr>
</table>

<p>
The data structures above converted to a block diagram look something
like this:
<p>
<img src="inventory.gif">

<p>
The source code for accessing the inventory is contained in the inventory
directory.  The source code for xsfinvutil is contained in the invutil
directory.  xfsinvutil only uses some header files from the inventory
directory for data structure definitions -- it uses its own code to access
and modify the inventory.
<p>
<hr>
<h3><a name="Q&A">Questions and Answers</a></h3>

<dl>

<dt><b><a name="DMF">How is -a and -z handled by xfsdump ?</a></b>
<dd>
If -a is NOT used then it looks like nothing special happens
for files which have dmf state attached to them.
So if the file uses too many blocks compared to our maxsize param (-z)
then it will not get dumped. No inode nor data.
The only evidence will be its entry in the inode
map (which is dumped) which says its the state of a no-change-non-dir and
the directory entry in the directories dump. The latter will mean
that an <i>ls</i> in xfsrestore will show the file but it can
not be restored.
<p>
If -a <b>is</b> used and the file has some DMF state then we do some magic.
However, the magic really only seems to occur for dual-state files
(or possibly also unmigrating files).
<p>
A file is marked as dual-state/unmigrating by looking at the DMF attribute,
dmfattrp->state[1]. i.e = DMF_ST_DUALSTATE or DMF_ST_UNMIGRATING
If this is the case, then we set, dmf_f_ctxtp->candidate = 1.
If we have such a changed dual-state file then we
mark it as changed in the inode-map so it can be dumped.
If it is a dual state file, then its apparent size will be zero, so it
will go onto the dumping stage.
<p>
When we go to dump the extents of the dual-state file, we
do something different. We store the extents as only 1 extent
which is a hole. I.e. this is the "NOT dumping data" bit.
<p>
When we go to dump the file-hdr of the dual-state file, we
set, statp->bs_dmevmask |= (1<<DM_EVENT_READ);
<p>
When we go to dump the extended-attributes of the dual-state file, we
skip dumping the DMF attribute ones !
However, at the end of dumping the attributes, we then go
and add a new DMF attribute for it:
<pre>
        dmfattrp->state[1] = DMF_ST_OFFLINE;
        *valuepp = (char *)dmfattrp;
        *namepp = DMF_ATTR_NAME;
        *valueszp = DMF_ATTR_LEN;
</pre>
<br>
<b>Summary:</b>
<ul>
<li>dual state files (and unmigrating files) dumped with -a,
    cause magic to happen:
    <ul>
    <li>if file has changed then it will _always_ be marked
       to be dumped out (irrespective of file size/blocks)
    <li>its extent data will be dumped as 1 extent with a hole
    <li>its DMF attributes won't be dumped but a replacement
       DMF attribute will be dumped in its place
    <li>the stat buf's bs_devmask will be or'ed with DM_EVENT_READ
    </ul>
<li>for all other cases,
     if the file has changed and its blocks cause it to exceed the
     maxsize param (-z) then the file will be marked as NOT-CHANGED
     in the inode map and so will NOT be dumped at all
</ul>
<p>

<dt><b><a name="dump_size_est">How does it compute estimated dump size ?</a></b>
<dd>
A dump consists of media files (only 1 in the case of a dump to a file,
and usually many when dumped to a tape (depending on device type)).
A media file consists of:
<ul>
  <li> global header
  <li> inode map (inode# + state(e.g.dump or not?) )
  <li> directories
  <li> non-directory files
</ul>
<p>
A directory consists of a header, directory-entry-headers for
its entries <inode#,gen#,entry-sz,csum,entry-name>
and extended-attribute header and attributes.
<p>
A non-directory file consists of a file header, extent-headers
(for each extent), file data and extended-attribute header
and attributes. Some types of files don't have extent headers or data.
<p>
The xfsdump code says:
<pre>
        size_estimate = GLOBAL_HDR_SZ
                        +
                        inomap_getsz( )
                        +
                        inocnt * ( u_int64_t )( FILEHDR_SZ + EXTENTHDR_SZ )
                        +
                        inocnt * ( u_int64_t )( DIRENTHDR_SZ + 8 )
                        +
                        datasz;
</pre>

So this accounts for the:
<ul>
  <li>global header
  <li>inode map
  <li>all the files
  <li>all the direntory entries
     ( "+8" presumably to account for average file name length range,
       where 8 chars already included in header; as this structure
       is padded to the next 8 byte boundary, it accounts for names
       with lengths between 8-15 chars)
  <li>data
</ul>

<p>
What estimate doesn't seem to account for (that I can think of):
<ul>
  <li> no extended attributes
  <li> assumes that a file will only have one extent
  <li> no tape block headers (for tape media)
</ul>

<p>
"Datasz" is calculated by adding up for every regular inode file,
its (number of data blocks) * (block size).
However, if "-a" is used, then instead of doing this,
if the file is dualstate/offline then the file's
data won't be dumped and it adds zero for it.
<p>


<dt><b><a name="dump_size_ac">Is the "dump size (non-dir files) : 910617928 bytes" the actual number of bytes it wrote to that tape ?</a></b>

<dd>
It is the number of bytes it wrote to the dump for the non-directory
files' extents (not including file header nor extent header terminator).
(I don't think this includes the tape block headers for a tape dump
either.)
It includes for each file:
<ul>
  <li>any hole hdrs
  <li>alignment hdrs
  <li>alignment padding
  <li>extent headers for data
  <li>actual _data_ of extents
</ul>

From code:
<pre>
    bytecnt += sizeof( filehdr_t );
    dump_extent_group(...,&bc,...);
	bytecnt = 0;
	bytecnt += sizeof( extenthdr_t );  /* extent header for hole */
	bytecnt += sizeof( extenthdr_t );  /* ext. alignment header */
	bytecnt += ( off64_t )cnt_to_align /* alignment padding */
	bytecnt += sizeof( extenthdr_t );  /* extent header for data */
	bytecnt += ( off64_t )actualsz;    /* actual extent data in file */
	bytecnt += ( off64_t )reqsz; /* write padding to make up extent size */
    sc_stat_datadone += ( size64_t )bc;
</pre>


It doesn't include the initial file header:
<pre>
    rv = dump_filehdr( ... );
    bytecnt += sizeof( filehdr_t );
</pre>
nor the extent hdr terminator:
<pre>
    rv = dump_extenthdr( ..., EXTENTHDR_TYPE_LAST,...);
    bytecnt += sizeof( extenthdr_t );
    contextp->cc_mfilesz += bytecnt;
</pre>
It only adds this data size into the media file size.

</dl>
<p>
<hr>
<h3><a name="out_quest">Outstanding Questions</a></h3>
<ul>
<li>How is the inode map on the tape used by xfsrestore ?
<li>Is the final inventory media file on the media ever used/restored ?
<li>How are tape marks used and written ?
<li>What is the difference between a record and a block ?
    <ul><li>I don't think there is a difference.</ul>
<li>Where are tape_recsz and tape_blksz used ?
    <ul><li>Tape_recsz is used for the read/write byte cnt but
    I don't think tape_blksz is used.</ul>
<li>What is the persistent inventory used for ?
</ul>

</body>
</html>