File: ocfs2.7.in

package info (click to toggle)
ocfs2-tools 1.8.6-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, sid
  • size: 6,232 kB
  • sloc: ansic: 86,865; sh: 5,781; python: 2,380; makefile: 1,305
file content (1497 lines) | stat: -rw-r--r-- 62,902 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
.TH "OCFS2" "7" "January 2012" "Version @VERSION@" "OCFS2 Manual Pages"
.SH "NAME"
OCFS2 \- A Shared-Disk Cluster File System for Linux

.SH "INTRODUCTION"
.PP

\fBOCFS2\fR is a \fBfile system\fR. It allows users to store and retrieve data. The data
is stored in files that are organized in a hierarchical directory tree. It is a \fBPOSIX compliant\fR
file system that supports the standard interfaces and the behavioral semantics as spelled out
by that specification.

It is also a \fBshared disk cluster\fR file system, one that allows multiple nodes to access the
same disk at the same time. This is where the fun begins as allowing a file system to be
accessible on multiple nodes opens a can of worms. What if the nodes are of different
architectures? What if a node dies while writing to the file system? What data consistency
can one expect if processes on two nodes are reading and writing concurrently? What if
one node removes a file while it is still being used on another node?

Unlike most shared file systems where the answer is fuzzy, the answer in OCFS2 is very
well defined. It behaves on all nodes exactly like a \fBlocal\fR file system. If a file is
removed, the directory entry is removed but the inode is kept as long as it is in use across
the cluster. When the last user closes the descriptor, the inode is marked for deletion.

The data consistency model follows the same principle. It works as if the two processes
that are running on two different nodes are running on the same node. A read on a node
gets the last write irrespective of the IO mode used. The modes can be \fIbuffered\fR, \fIdirect\fR,
\fIasynchronous\fR, \fIsplice\fR or \fImemory mapped\fR IOs. It is fully \fBcache coherent\fR.

Take for example the REFLINK feature that allows a user to create multiple write-able
snapshots of a file. This feature, like all others, is fully cluster-aware. A file
being written to on multiple nodes can be safely reflinked on another. The snapshot
created is a point-in-time image of the file that includes both the file data and all its
attributes (including extended attributes).

It is a \fBjournaling\fR file system. When a node dies, a surviving node transparently replays
the journal of the dead node. This ensures that the file system metadata is always
consistent. It also defaults to ordered data journaling to ensure the file data is flushed
to disk before the journal commit, to remove the small possibility of stale data appearing
in files after a crash.

It is \fBarchitecture\fR and \fBendian neutral\fR. It allows concurrent mounts on nodes with
different processors like x86, x86_64, IA64 and PPC64. It handles little and big endian,
32-bit and 64-bit architectures.

It is \fBfeature rich\fR. It supports \fIindexed directories\fR, \fImetadata checksums\fR,
\fIextended attributes\fR, \fIPOSIX ACLs\fR, \fIquotas\fR, \fIREFLINKs\fR, \fIsparse files\fR,
\fIunwritten extents\fR and \fIinline-data\fR.

It is \fBfully integrated\fR with the mainline Linux kernel. The file system was merged
into Linux kernel 2.6.16 in early 2006.

It is \fBquickly installed\fR. It is available with almost all Linux distributions.
The file system is \fBon-disk compatible\fR across all of them.

It is \fBmodular\fR. The file system can be configured to operate with other cluster
stacks like \fIPacemaker\fR and \fICMAN\fR along with its own stack, \fIO2CB\fR.

It is \fBeasily configured\fR. The O2CB cluster stack configuration involves editing two
files, one for cluster layout and the other for cluster timeouts.

It is \fBvery efficient\fR. The file system consumes very little resources. It is used
to store virtual machine images in limited memory environments like Xen and KVM.

In summary, OCFS2 is an efficient, easily configured, modular, quickly installed, fully
integrated and compatible, feature-rich, architecture and endian neutral, cache coherent,
ordered data journaling, POSIX-compliant, shared disk cluster file system.

.SH "OVERVIEW"
.PP

OCFS2 is a general-purpose shared-disk cluster file system for Linux capable of providing
both high performance and high availability.

As it provides local file system semantics, it can be used with almost all applications.
Cluster-aware applications can make use of cache-coherent parallel I/Os from multiple nodes
to scale out applications easily. Other applications can make use of the clustering
facilities to fail-over running application in the event of a node failure.

The notable features of the file system are:
.TP
\fBTunable Block size\fR
The file system supports block sizes of 512, 1K, 2K and 4K bytes. 4KB is almost always
recommended. This feature is available in all releases of the file system.

.TP
\fBTunable Cluster size\fR
A cluster size is also referred to as an allocation unit. The file system supports
cluster sizes of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M bytes. For most use
cases, 4KB is recommended. However, a larger value is recommended for volumes hosting
mostly very large files like database files, virtual machine images, etc. A large
cluster size allows the file system to store large files more efficiently. This feature
is available in all releases of the file system.

.TP
\fBEndian and Architecture neutral\fR
The file system can be mounted concurrently on nodes having different architectures.
Like 32-bit, 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64, s390x).
This feature is available in all releases of the file system.

.TP
\fBBuffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes\fR
The file system supports all modes of I/O for maximum flexibility and performance.
It also supports cluster-wide \fBshared writeable mmap(2)\fR. The support for bufferred,
direct and asynchronous I/O is available in all releases. The support for splice I/O
was added in Linux kernel \fB2.6.20\fR and for shared writeable map(2) in \fB2.6.23\fR.

.TP
\fBMultiple Cluster Stacks\fR
The file system includes a flexible framework to allow it to function with userspace
cluster stacks like Pacemaker (\fBpcmk\fR) and CMAN (\fBcman\fR), its own in-kernel
cluster stack \fBo2cb\fR and \fIno\fR cluster stack.

The support for \fBo2cb\fR cluster stack is available in all releases.

The support for \fIno\fR cluster stack, or \fBlocal\fR mount, was added in Linux
kernel \fB2.6.20\fR.

The support for userspace cluster stack was added in Linux kernel \fB2.6.26\fR.

.TP
\fBJournaling\fR
The file system supports both \fBordered\fR (default) and \fBwriteback\fR data journaling
modes to provide file system consistency in the event of power failure or system crash.
It uses \fBJBD2\fR in Linux kernel \fB2.6.28\fR and later. It used \fBJBD\fR in earlier
kernels.

.TP
\fBExtent-based Allocations\fR
The file system allocates and tracks space in ranges of clusters. This is unlike block
based file systems that have to track each and every block. This feature allows the
file system to be very efficient when dealing with both large volumes and large files.
This feature is available in all releases of the file system.

.TP
\fBSparse files\fR
Sparse files are files with holes. With this feature, the file system delays allocating
space until a write is issued to a cluster. This feature was added in Linux kernel \fB2.6.22\fR
and requires enabling on-disk feature \fBsparse\fR.

.TP
\fBUnwritten Extents\fR
An unwritten extent is also referred to as user pre-allocation. It allows an application
to request a range of clusters to be allocated, but not initialized, within a file.
Pre-allocation allows the file system to optimize the data layout with fewer, larger
extents. It also provides a performance boost, delaying initialization until the user
writes to the clusters. This feature was added in Linux kernel \fB2.6.23\fR and requires
enabling on-disk feature \fBunwritten\fR.

.TP
\fBHole Punching\fR
Hole punching allows an application to remove arbitrary allocated regions within a
file. Creating holes, essentially. This is more efficient than zeroing the same extents.
This feature is especially useful in virtualized environments as it allows a block discard
in a guest file system to be converted to a hole punch in the host file system thus
allowing users to reduce disk space usage. This feature was added in Linux kernel \fB2.6.23\fR
and requires enabling on-disk features \fBsparse\fR and \fBunwritten\fR.

.TP
\fBInline-data\fR
Inline data is also referred to as data-in-inode as it allows storing small files
and directories in the inode block. This not only saves space but also has a positive
impact on cold-cache directory and file operations. The data is transparently moved
out to an extent when it no longer fits inside the inode block. This feature was added
in Linux kernel \fB2.6.24\fR and requires enabling on-disk feature \fBinline-data\fR.

.TP
\fBREFLINK\fR
REFLINK is also referred to as fast copy. It allows users to atomically (and instantly)
copy regular files. In other words, create multiple writeable snapshots of regular files.
It is called REFLINK because it looks and feels more like a (hard) \fBlink(2)\fR than a
traditional snapshot. Like a link, it is a regular user operation, subject to the security
attributes of the inode being reflinked and not to the super user privileges typically
required to create a snapshot. Like a link, it operates within a file system. But unlike
a link, it links the inodes at the data extent level allowing each reflinked inode to grow
independently as and when written to. Up to four billion inodes can share a data extent.
This feature was added in Linux kernel \fB2.6.32\fR and requires enabling on-disk feature
\fBrefcount\fR.

.TP
\fBAllocation Reservation\fR
File contiguity plays an important role in file system performance. When a file is
fragmented on disk, reading and writing to the file involves many seeks, leading to
lower throughput. Contiguous files, on the other hand, minimize seeks, allowing the
disks to perform IO at the maximum rate.

With allocation reservation, the file system reserves a window in the bitmap for all
extending files allowing each to grow as contiguously as possible. As this extra space
is not actually allocated, it is available for use by other files if the need arises.
This feature was added in Linux kernel \fB2.6.35\fR and can be tuned using the mount
option \fBresv_level\fR.

.TP
\fBIndexed Directories\fR
An indexed directory allows users to perform quick lookups of a file in very large
directories. It also results in faster creates and unlinks and thus provides better
overall performance. This feature was added in Linux kernel \fB2.6.30\fR and requires
enabling on-disk feature \fBindexed-dirs\fR.

.TP
\fBFile Attributes\fR
This refers to EXT2-style file attributes, such as immutable, modified using
\fBchattr(1)\fR and queried using \fBlsattr(1)\fR. This feature was added in Linux
kernel \fB2.6.19\fR.

.TP
\fBExtended Attributes\fR
An extended attribute refers to a name:value pair than can be associated with file
system objects like regular files, directories, symbolic links, etc. \fIOCFS2\fR allows
associating an \fIunlimited\fR number of attributes per object. The attribute names can be
up to 255 bytes in length, terminated by the first NUL character. While it is not
required, printable names (ASCII) are recommended. The attribute values can be up
to 64 KB of arbitrary binary data. These attributes can be modified and listed using
standard Linux utilities \fBsetfattr(1)\fR and \fBgetfattr(1)\fR. This feature was
added in Linux kernel \fB2.6.29\fR and requires enabling on-disk feature \fBxattr\fR.

.TP
\fBMetadata Checksums\fR
This feature allows the file system to detect silent corruptions in all metadata blocks
like inodes and directories. This feature was added in Linux kernel \fB2.6.29\fR and
requires enabling on-disk feature \fBmetaecc\fR.

.TP
\fBPOSIX ACLs and Security Attributes\fR
POSIX ACLs allows assigning fine-grained discretionary access rights for files and
directories. This security scheme is a lot more flexible than the traditional file
access permissions that imposes a strict user-group-other model.

Security attributes allow the file system to support other security regimes like SELinux,
SMACK, AppArmor, etc.

Both these security extensions were added in Linux kernel \fB2.6.29\fR and requires
enabling on-disk feature \fBxattr\fR.

.TP
\fBUser and Group Quotas\fR
This feature allows setting up usage quotas on user and group basis by using the
standard utilities like \fBquota(1)\fR, \fBsetquota(8)\fR, \fBquotacheck(8)\fR, and
\fBquotaon(8)\fR. This feature was added in Linux kernel \fB2.6.29\fR and requires
enabling on-disk features \fBusrquota\fR and \fBgrpquota\fR.

.TP
\fBUnix File Locking\fR
The Unix operating system has historically provided two system calls to lock files.
\fBflock(2)\fR or BSD locking and \fBfcntl(2)\fR or POSIX locking. \fIOCFS2\fR
extends both file locks to the cluster. File locks taken on one node interact with those
taken on other nodes.

The support for clustered \fBflock(2)\fR was added in Linux kernel \fB2.6.26\fR.
All \fBflock(2)\fR options are supported, including the kernels ability to cancel
a lock request when an appropriate kill signal is received by the user. This feature
is supported with all cluster-stacks including \fBo2cb\fR.

The support for clustered \fBfcntl(2)\fR was added in Linux kernel \fB2.6.28\fR.
But because it requires group communication to make the locks coherent, it is only
supported with userspace cluster stacks, \fBpcmk\fR and \fBcman\fR and \fInot\fR
with the default cluster stack \fBo2cb\fR.

.TP
\fBComprehensive Tools Support\fR
The file system has a comprehensive EXT3-style toolset that tries to use similar
parameters for ease-of-use. It includes mkfs.ocfs2(8) (format), tunefs.ocfs2(8)
(tune), fsck.ocfs2(8) (check), debugfs.ocfs2(8) (debug), etc.

.TP
\fBOnline Resize\fR
The file system can be dynamically grown using \fBtunefs.ocfs2(8)\fR. This feature
was added in Linux kernel \fB2.6.25\fR.

.SH "RECENT CHANGES"
.PP

The O2CB cluster stack has a \fBglobal heartbeat\fR mode. It allows users to specify
heartbeat regions that are consistent across all nodes. The cluster stack also allows
online addition and removal of both nodes and heartbeat regions.

\fBo2cb(8)\fR is the new cluster configuration utility. It is an easy to use utility
that allows users to create the cluster configuration on a node that is not part of the
cluster. It replaces the older utility \fBo2cb_ctl(8)\fR which has being deprecated.

\fBocfs2console(8)\fR has been obsoleted.

\fBo2info(8)\fR is a new utility that can be used to provide file system information.
It allows non-privileged users to see the enabled file system features, block and
cluster sizes, extended file stat, free space fragmentation, etc.

\fBo2hbmonitor(8)\fR is a \fBo2hb\fR heartbeat monitor. It is an extremely light weight
utility that logs messages to the system logger once the heartbeat delay exceeds the
warn threshold. This utility is useful in identifying volumes encountering I/O delays.

\fBdebugfs.ocfs2(8)\fR has some new commands. \fInet_stats\fR shows the \fBo2net\fR
message times between various nodes. This is useful in identifying nodes are that slowing
down the cluster operations. \fIstat_sysdir\fR allows the user to dump the entire system
directory that can be used to debug issues. \fIgrpextents\fR dumps the complete free space
fragmentation in the cluster group allocator.

\fBmkfs.ocfs2(8)\fR now enables \fIxattr\fB, \fIindexed-dirs\fR, \fIdiscontig-bg\fR,
\fIrefcount\fR, \fIextended-slotmap\fR and \fIclusterinfo\fR feature flags by default,
in addition to the older defaults, \fIsparse\fR, \fIunwritten\fR and \fIinline-data\fR.

\fBmount.ocfs2(8)\fR allows users to specify the level of cache coherency between nodes.
By default the file system operates in full coherency mode that also serializes the
direct I/Os. While this mode is technically correct, it limits the I/O thruput in a
clustered database. This mount option allows the user to limit the cache coherency
to only the buffered I/Os to allow multiple nodes to do concurrent direct writes to
the same file. This feature works with Linux kernel \fB2.6.37\fR and later.

.SH "COMPATIBILITY"
.PP

The OCFS2 development teams goes to great lengths to maintain compatibility. It attempts
to maintain both on-disk and network protocol compatibility across all releases of the
file system. It does so even while adding new features that entail on-disk format and
network protocol changes. To do this successfully, it follows a few rules:

.in +4n
\fB1\fR. The on-disk format changes are managed by a set of feature flags that can be
turned on and off. The file system in kernel detects these features during mount and
continues only if it understands all the features. Users encountering this have the
option of either disabling that feature or upgrading the file system to a newer release.

\fB2\fR. The latest release of ocfs2-tools is compatible with all versions of the file
system. All utilities detect the features enabled on disk and continue only if it
understands all the features. Users encountering this have to upgrade the tools to
a newer release.

\fB3\fR. The network protocol version is negotiated by the nodes to ensure all nodes
understand the active protocol version.
.in

.TP
\fBFEATURE FLAGS\fR
The feature flags are split into three categories, namely, \fBCompat\fR, \fBIncompat\fR
and \fBRO Compat\fR.

\fBCompat\fR, or compatible, is a feature that the file system does not need to fully
understand to safely read/write to the volume. An example of this is the backup-super
feature that added the capability to backup the super block in multiple locations in the
file system. As the backup super blocks are typically not read nor written to by the file
system, an older file system can safely mount a volume with this feature enabled.

\fBIncompat\fR, or incompatible, is a feature that the file system needs to fully
understand to read/write to the volume. Most features fall under this category.

\fBRO Compat\fR, or read-only compatible, is a feature that the file system needs to
fully understand to write to the volume. Older software can safely read a volume with
this feature enabled. An example of this would be user and group quotas. As quotas are
manipulated only when the file system is written to, older software can safely mount
such volumes in read-only mode.

The list of feature flags, the version of the kernel it was added in, the earliest
version of the tools that understands it, etc., is as follows:

.TS
CENTER ALLBOX;
LB LB LB LB LB
LI C C C C.
Feature Flags	Kernel Version	Tools Version	Category	Hex Value
backup-super	All	ocfs2-tools 1.2	Compat	1
strict-journal-super	All	All	Compat	2
local	Linux 2.6.20	ocfs2-tools 1.2	Incompat	8
sparse	Linux 2.6.22	ocfs2-tools 1.4	Incompat	10
inline-data	Linux 2.6.24	ocfs2-tools 1.4	Incompat	40
extended-slotmap	Linux 2.6.27	ocfs2-tools 1.6	Incompat	100
xattr	Linux 2.6.29	ocfs2-tools 1.6	Incompat	200
indexed-dirs	Linux 2.6.30	ocfs2-tools 1.6	Incompat	400
metaecc	Linux 2.6.29	ocfs2-tools 1.6	Incompat	800
refcount	Linux 2.6.32	ocfs2-tools 1.6	Incompat	1000
discontig-bg	Linux 2.6.35	ocfs2-tools 1.6	Incompat	2000
clusterinfo	Linux 2.6.37	ocfs2-tools 1.8	Incompat	4000
unwritten	Linux 2.6.23	ocfs2-tools 1.4	RO Compat	1
grpquota	Linux 2.6.29	ocfs2-tools 1.6	RO Compat	2
usrquota	Linux 2.6.29	ocfs2-tools 1.6	RO Compat	4
.TE
.BR

To query the features enabled on a volume, do:

.nf
.ps 8
.ft 6
$ o2info --fs-features /dev/sdf1
backup-super strict-journal-super sparse extended-slotmap inline-data xattr
indexed-dirs refcount discontig-bg clusterinfo unwritten
.ft
.ps
.fi

.TP
\fBENABLING AND DISABLING FEATURES\fR

The format utility, \fBmkfs.ocfs2(8)\fR, allows a user to enable and disable specific
features using the fs-features option. The features are provided as a comma separated
list. The enabled features are listed as is. The disabled features are prefixed with
\fBno\fR.  The example below shows the file system being formatted with sparse disabled
and inline-data enabled.

.nf
.ft 6
# mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1
.ft
.fi

After formatting, the users can toggle features using the tune utility, \fBtunefs.ocfs2(8)\fR.
This is an \fIoffline\fR operation. The volume needs to be umounted across the cluster.
The example below shows the sparse feature being enabled and inline-data disabled.

.nf
.ft 6
# tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1
.ft
.fi

Care should be taken before enabling and disabling features. Users planning to use a
volume with an older version of the file system will be better of not enabling newer
features as turning disabling may not succeed.

An example would be disabling the sparse feature; this requires filling every hole.
The operation can only succeed if the file system has enough free space.

.TP
\fBDETECTING FEATURE INCOMPATIBILITY\fR

Say one tries to mount a volume with an incompatible feature. What happens then? How
does one detect the problem? How does one know the name of that incompatible feature?

To begin with, one should look for error messages in \fBdmesg(8)\fR. Mount failures that
are due to an incompatible feature will always result in an error message like the following:

.nf
.ps 9
.ft 6
ERROR: couldn't mount because of unsupported optional features (200).
.ft
.ps
.fi

Here the file system is unable to mount the volume due to an unsupported optional
feature. That means that that feature is an \fBIncompat\fR feature. By referring to the
table above, one can then deduce that the user failed to mount a volume with the \fBxattr\fR
feature enabled. (The value in the error message is in hexadecimal.)

Another example of an error message due to incompatibility is as follows:

.nf
.ps 9
.ft 6
ERROR: couldn't mount RDWR because of unsupported optional features (1).
.ft
.ps
.fi

Here the file system is unable to mount the volume in the RW mode. That means that
that feature is a \fBRO Compat\fR feature. Another look at the table and it becomes
apparent that the volume had the \fBunwritten\fR feature enabled.

In both cases, the user has the option of disabling the feature. In the second case,
the user has the choice of mounting the volume in the RO mode.

.SH "GETTING STARTED"
.PP

The OCFS2 software is split into two components, namely, kernel and tools. The kernel
component includes the core file system and the cluster stack, and is packaged along
with the kernel. The tools component is packaged as \fBocfs2-tools\fR and needs to
be specifically installed. It provides utilities to format, tune, mount, debug and
check the file system.

To install \fBocfs2-tools\fR, refer to the package handling utility in in your distributions.

The next step is selecting a cluster stack. The options include:

.in +4n
\fBA\fR. No cluster stack, or \fBlocal mount\fR.

\fBB\fR. In-kernel \fBo2cb\fR cluster stack with \fBlocal\fR or \fBglobal\fR heartbeat.

\fBC\fR. Userspace cluster stacks \fBpcmk\fR or \fBcman\fR.
.in

The file system allows changing cluster stacks easily using \fBtunefs.ocfs2(8)\fR.
To list the cluster stacks stamped on the OCFS2 volumes, do:

.nf
.ps 9
.ft 6
# mounted.ocfs2 -d
Device     Stack  Cluster     F  UUID                              Label
/dev/sdb1  o2cb   webcluster  G  DCDA2845177F4D59A0F2DCD8DE507CC3  hbvol1
/dev/sdc1  None                  23878C320CF3478095D1318CB5C99EED  localmount
/dev/sdd1  o2cb   webcluster  G  8AB016CD59FC4327A2CDAB69F08518E3  webvol
/dev/sdg1  o2cb   webcluster  G  77D95EF51C0149D2823674FCC162CF8B  logsvol
/dev/sdh1  o2cb   webcluster  G  BBA1DBD0F73F449384CE75197D9B7098  scratch
.ft
.ps
.fi

.TP
\fBNON-CLUSTERED OR LOCAL MOUNT\fR

To format a \fIOCFS2\fR volume as a non-clustered (\fBlocal\fR) volume, do:

.nf
.ft 6
# mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1
.ft
.fi

To convert an existing clustered volume to a non-clustered volume, do:

.nf
.ft 6
# tunefs.ocfs2 --fs-features=local /dev/sda1
.ft
.fi

Non-clustered volumes do not interact with the cluster stack. One can have both
clustered and non-clustered volumes mounted at the same time.

While formatting a non-clustered volume, users should consider the possibility of later
converting that volume to a clustered one. If there is a possibility of that, then the
user should add enough node-slots using the -N option. Adding node-slots during format
creates journals with large extents. If created later, then the journals will be
fragmented which is not good for performance.

.TP
\fBCLUSTERED MOUNT WITH O2CB CLUSTER STACK\fR

Only one of the two heartbeat mode can be active at any one time. Changing heartbeat
modes is an offline operation.

Both heartbeat modes require /etc/ocfs2/cluster.conf and /etc/sysconfig/o2cb to
be populated as described in \fBocfs2.cluster.conf(5)\fR and \fBo2cb.sysconfig(5)\fR
respectively. The only difference in set up between the two modes is that \fBglobal\fR
requires heartbeat devices to be configured whereas \fBlocal\fR does not.

Refer \fBo2cb(7)\fR for more information.

.RS
.TP
\fBLOCAL HEARTBEAT\fR
This is the default heartbeat mode. The user needs to populate the configuration files
as described in \fBocfs2.cluster.conf(5)\fR and \fBo2cb.sysconfig(5)\fR. In this mode,
the cluster stack heartbeats on all mounted volumes. Thus, one does not have to specify
heartbeat devices in cluster.conf.

Once configured, the \fBo2cb\fR cluster stack can be onlined and offlined as follows:

.nf
.ft 6
# service o2cb online
Setting cluster stack "o2cb": OK
Registering O2CB cluster "webcluster": OK
Setting O2CB cluster timeouts : OK

# service o2cb offline
Clean userdlm domains: OK
Stopping O2CB cluster webcluster: OK
Unregistering O2CB cluster "webcluster": OK
.ft
.fi

.TP
\fBGLOBAL HEARTBEAT\fR
The configuration is similar to \fBlocal\fR heartbeat. The one additional step in
this mode is that it requires heartbeat devices to be also configured.

These heartbeat devices are OCFS2 formatted volumes with global heartbeat enabled
on disk. These volumes can later be mounted and used as clustered file systems.

The steps to format a volume with global heartbeat enabled is listed in \fBo2cb(7)\fR.
Also listed there is listing all volumes with the cluster stack stamped on disk.

In this mode, the heartbeat is started when the cluster is onlined and stopped when
the cluster is offlined.

.nf
.ft 6
# service o2cb online
Setting cluster stack "o2cb": OK
Registering O2CB cluster "webcluster": OK
Setting O2CB cluster timeouts : OK
Starting global heartbeat for cluster "webcluster": OK

# service o2cb offline
Clean userdlm domains: OK
Stopping global heartbeat on cluster "webcluster": OK
Stopping O2CB cluster webcluster: OK
Unregistering O2CB cluster "webcluster": OK

# service o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster "webcluster": Online
  Heartbeat dead threshold: 31
  Network idle timeout: 30000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
  Heartbeat mode: Global
Checking O2CB heartbeat: Active
  77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
Nodes in O2CB cluster: 92 96
.ft
.fi

.RE

.TP
\fBCLUSTERED MOUNT WITH USERSPACE CLUSTER STACK\fR

Configure and online the userspace stack \fBpcmk\fR or \fBcman\fR before using
\fBtunefs.ocfs2(8)\fR to update the cluster stack on disk.

.nf
.ft 6
# tunefs.ocfs2 --update-cluster-stack /dev/sdd1
Updating on-disk cluster information to match the running cluster.
DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
Update the on-disk cluster information? y
.ft
.fi

Refer to the cluster stack documentation for information on starting and stopping
the cluster stack.

.SH "FILE SYSTEM UTILITIES"
.PP

This sections lists the utilities that are used to manage the \fIOCFS2\fR file systems.
This includes tools to format, tune, check, mount, debug the file system. Each utility
has a man page that lists its capabilities in detail.

.TP
\fBmkfs.ocfs2(8)\fR
This is the file system \fIformat\fR utility. All volumes have to be formatted prior to
its use.  As this utility overwrites the volume, use it with care. Double check to ensure
the volume is not in use on any node in the cluster.

As a precaution, the utility will abort if the volume is locally mounted. It also
detects use across the cluster if used by OCFS2. But these checks are not comprehensive
and can be overridden. So use it with care.

While it is not always required, the cluster should be online.

.TP
\fBtunefs.ocfs2(8)\fR
This is the file system \fItune\fR utility. It allows users to change certain on-disk
parameters like label, uuid, number of node-slots, volume size and the size of the
journals. It also allows turning on and off the file system features as listed above.

This utility requires the cluster to be online.

.TP
\fBfsck.ocfs2(8)\fR
This is the file system \fIcheck\fR utility. It detects and fixes on-disk errors. All the
check codes and their fixes are listed in \fBfsck.ocfs2.checks(8)\fR.

This utility requires the cluster to be online to ensure the volume is not in use on
another node and to prevent the volume from being mounted for the duration of the check.

.TP
\fBmount.ocfs2(8)\fR
This is the file system \fImount\fR utility. It is invoked indirectly by the \fBmount(8)\fR
utility.

This utility detects the cluster status and aborts if the cluster is offline or does
not match the cluster stamped on disk.

.TP
\fBo2cluster(8)\fR
This is the file system \fIcluster stack update\fR utility. It allows the users to update
the on-disk cluster stack to the one provided.

This utility only updates the disk if the utility is reasonably assured that the file system
is not in use on any node.

.TP
\fBo2info(1)\fR
This is the file system \fIinformation\fR utility. It provides information like the features
enabled on disk, block size, cluster size, free space fragmentation, etc.

It can be used by both privileged and non-privileged users. Users having read permission
on the device can provide the path to the device. Other users can provide the path to a
file on a mounted file system.

.TP
\fBdebugfs.ocfs2(8)\fR
This is the file system \fIdebug\fR utility. It allows users to examine all file system
structures including walking directory structures, displaying inodes, backing up files,
etc., without mounting the file system.

This utility requires the user to have read permission on the device.

.TP
\fBo2image(8)\fR
This is the file system \fIimage\fR utility. It allows users to copy the file system metadata
skeleton, including the inodes, directories, bitmaps, etc. As it excludes data, it
shrinks the size of the file system tremendously.

The image file created can be used in debugging on-disk corruptions.

.TP
\fBmounted.ocfs2(8)\fR
This is the file system \fIdetect\fR utility. It detects all \fIOCFS2\fR volumes in the
system and lists its label, uuid and cluster stack. 

.SH "O2CB CLUSTER STACK UTILITIES"
.PP

This sections lists the utilities that are used to manage \fIO2CB\fR cluster stack.
Each utility has a man page that lists its capabilities in detail.
.TP
\fBo2cb(8)\fR
This is the cluster \fIconfiguration\fR utility. It allows users to update the cluster
configuration by adding and removing nodes and heartbeat regions. This utility is used
by the \fIo2cb\fR init script to online and offline the cluster.

This is a \fBnew\fR utility and replaces \fBo2cb_ctl(8)\fR which has been deprecated.

.TP
\fBocfs2_hb_ctl(8)\fR
This is the cluster heartbeat utility. It allows users to start and stop \fBlocal\fR
heartbeat. This utility is invoked by \fBmount.ocfs2(8)\fR and should not be invoked
directly by the user.

.TP
\fBo2hbmonitor(8)\fR
This is the disk heartbeat monitor. It tracks the elapsed time since the last heartbeat
and logs warnings once that time exceeds the warn threshold.

.SH "FILE SYSTEM NOTES"
.PP

This section includes some useful notes that may prove helpful to the user.
.TP
\fBBALANCED CLUSTER\fR
A cluster is a computer. This is a fact and not a slogan. What this means is that an errant
node in the cluster can affect the behavior of other nodes. If one node is slow, the cluster
operations will slow down on all nodes. To prevent that, it is best to have a balanced
cluster. This is a cluster that has equally powered and loaded nodes.

The standard recommendation for such clusters is to have identical hardware and
software across all the nodes. However, that is not a hard and fast rule. After all,
we have taken the effort to ensure that OCFS2 works in a mixed architecture environment.

If one uses OCFS2 in a mixed architecture environment, try to ensure that the nodes are
equally powered and loaded. The use of a load balancer can assist with the latter. Power
refers to the number of processors, speed, amount of memory, I/O throughput, network
bandwidth, etc. In reality, having equally powered heterogeneous nodes is not always
practical. In that case, make the lower node numbers more powerful than the higher
node numbers. The O2CB cluster stack favors lower node numbers in all of its tiebreaking logic.

This is not to suggest you should add a single core node in a cluster of quad cores. No
amount of node number juggling will help you there.

.TP
\fBFILE DELETION\fR
In Linux, rm(1) removes the directory entry. It does not necessarily delete the corresponding
inode. But by removing the directory entry, it gives the illusion that the inode has been deleted.
This puzzles users when they do not see a corresponding up-tick in the reported free space.
The reason is that inode deletion has a few more hurdles to cross.

First is the hard link count, that indicates the number of directory entries pointing to that
inode. As long as an inode has one or more directory entries pointing to it, it cannot be deleted.
The file system has to wait for the removal of all those directory entries. In other words, wait
for that count to drop to zero.

The second hurdle is the POSIX semantics allowing files to be unlinked even while they are
in-use. In OCFS2, that translates to in-use across the cluster. The file system has to wait
for all processes across the cluster to stop using the inode.

Once these conditions are met, the inode is deleted and the freed space is visible after the
next sync.

Now the amount of space freed depends on the allocation. Only space that is actually allocated
to that inode is freed. The example below shows a sparsely allocated file of size 51TB of which
only 2.4GB is actually allocated.

.nf
.ft 6
$ ls -lsh largefile 
2.4G -rw-r--r-- 1 mark mark 51T Sep 29 15:04 largefile
.ft
.fi

Furthermore, for reflinked files, only private extents are freed. Shared extents are freed
when the last inode accessing it, is deleted. The example below shows a 4GB file that shares
3GB with other reflinked files. Deleting it will increase the free space by 1GB. However, if
it is the only remaining file accessing the shared extents, the full 4G will be freed.
(More information on the shared-du(1) utility is provided below.)

.nf
.ft 6
$ shared-du -m -c --shared-size reflinkedfile
4000    (3000)  reflinkedfile
.ft
.fi

The deletion itself is a multi-step process. Once the hard link count falls to zero, the
inode is moved to the orphan_dir system directory where it remains until the last process,
across the cluster, stops using the inode. Then the file system frees the extents and adds
the freed space count to the truncate_log system file where it remains until the next sync.
The freed space is made visible to the user only after that sync.

.TP
\fBDIRECTORY LISTING\fR
ls(1) may be a simple command, but it is not cheap. What is expensive is not the part
where it reads the directory listing, but the second part where it reads all the inodes, also
referred as an inode stat(2). If the inodes are not in cache, this can entail disk I/O.
Now, while a cold cache inode stat(2) is expensive in all file systems, it is especially so in
a clustered file system as it needs to take a cluster lock on each inode.

A hot cache stat(2), on the other hand, has shown to perform on OCFS2 like it does on
EXT3.

In other words, the second ls(1) will be quicker than the first. However, it is not
guaranteed. Say you have a million files in a file system and not enough kernel memory
to cache all the inodes. In that case, each ls(1) will involve some cold cache stat(2)s.

.TP
\fBALLOCATION RESERVATION\fR
Allocation reservation allows multiple concurrently extending files to grow as contiguously
as possible. One way to demonstrate its functioning is to run a script that extends
multiple files in a circular order. The script below does that by writing one hundred
4KB chunks to four files, one after another.

.nf
.ft 6
$ for i in $(seq 0 99);
> do
>   for j in $(seq 4);
>   do
>     dd if=/dev/zero of=file$j bs=4K count=1 seek=$i;
>   done;
> done;
.ft
.fi

When run on a system running Linux kernel 2.6.34 or earlier, we end up with files with
100 extents each. That is full fragmentation. As the files are being extended one after
another, the on-disk allocations are fully interleaved.

.nf
.ft 6
$ filefrag file1 file2 file3 file4
file1: 100 extents found
file2: 100 extents found
file3: 100 extents found
file4: 100 extents found
.ft
.fi

When run on a system running Linux kernel 2.6.35 or later, we see files with 7 extents
each. That is a lot fewer than before. Fewer extents mean more on-disk contiguity and
that always leads to better overall performance.

.nf
.ft 6
$ filefrag file1 file2 file3 file4
file1: 7 extents found
file2: 7 extents found
file3: 7 extents found
file4: 7 extents found
.ft
.fi

.TP
\fBREFLINK OPERATION\fR
This feature allows a user to create a writeable snapshot of a regular file. In this
operation, the file system creates a new inode with the same extent pointers as the
original inode. Multiple inodes are thus able to share data extents. This adds a twist
in file system administration because none of the existing file system utilities in
Linux expect this behavior. du(1), a utility to used to compute file space usage,
simply adds the blocks allocated to each inode. As it does not know about shared
extents, it over estimates the space used.  Say, we have a 5GB file in a volume having
42GB free.

.nf
.ft 6
$ ls -l
total 5120000
-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile

$ du -m myfile*
5000    myfile

$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdd1             50G   8.2G   42G  17% /ocfs2
.ft
.fi

If we were to reflink it 4 times, we would expect the directory listing to report five 5GB
files, but the df(1) to report no loss of available space. du(1), on the other hand, would
report the disk usage to climb to 25GB.

.nf
.ft 6
$ reflink myfile myfile-ref1
$ reflink myfile myfile-ref2
$ reflink myfile myfile-ref3
$ reflink myfile myfile-ref4

$ ls -l
total 25600000
-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile
-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref1
-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref2
-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref3
-rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref4

$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdd1             50G   8.2G   42G  17% /ocfs2

$ du -m myfile*
5000    myfile
5000    myfile-ref1
5000    myfile-ref2
5000    myfile-ref3
5000    myfile-ref4
25000 total
.ft
.fi

Enter \fBshared-du(1)\fR, a shared extent-aware du. This utility reports the shared
extents per file in parenthesis and the overall footprint. As expected, it lists the
overall footprint at 5GB. One can view the details of the extents using \fBshared-filefrag(1)\fR.
Both these utilities are available at http://oss.oracle.com/~smushran/reflink-tools/. 
We are currently in the process of pushing the changes to the upstream maintainers of
these utilities.

.nf
.ft 6
$ shared-du -m -c --shared-size myfile*
5000    (5000)  myfile
5000    (5000)  myfile-ref1
5000    (5000)  myfile-ref2
5000    (5000)  myfile-ref3
5000    (5000)  myfile-ref4
25000 total
5000 footprint

# shared-filefrag -v myfile
Filesystem type is: 7461636f
File size of myfile is 5242880000 (1280000 blocks, blocksize 4096)
ext logical physical expected length flags
0         0  2247937            8448
1      8448  2257921  2256384  30720
2     39168  2290177  2288640  30720
3     69888  2322433  2320896  30720
4    100608  2354689  2353152  30720
7    192768  2451457  2449920  30720
 . . .
37  1073408  2032129  2030592  30720 shared
38  1104128  2064385  2062848  30720 shared
39  1134848  2096641  2095104  30720 shared
40  1165568  2128897  2127360  30720 shared
41  1196288  2161153  2159616  30720 shared
42  1227008  2193409  2191872  30720 shared
43  1257728  2225665  2224128  22272 shared,eof
myfile: 44 extents found
.ft
.fi

.TP
\fBDATA COHERENCY\fR
One of the challenges in a shared file system is data coherency when multiple nodes are
writing to the same set of files. NFS, for example, provides close-to-open data coherency
that results in the data being flushed to the server when the file is closed on the client.
This leaves open a wide window for stale data being read on another node.

A simple test to check the data coherency of a shared file system involves concurrently
appending the same file. Like running "uname -a >>/dir/file" using a parallel distributed
shell like dsh or pconsole. If coherent, the file will contain the results from all nodes.

.nf
.ft 6
.ps 8
# dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test"
# cat /ocfs2/test
Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
.ps
.ft
.fi

OCFS2 is a \fBfully cache coherent\fR cluster file system.

.TP
\fBDISCONTIGUOUS BLOCK GROUP\fR
Most file systems pre-allocate space for inodes during format. OCFS2 dynamically
allocates this space when required.

However, this dynamic allocation has been problematic when the free space is very
fragmented, because the file system required the inode and extent allocators to
grow in contiguous fixed-size chunks.

The discontiguous block group feature takes care of this problem by allowing the
allocators to grow in smaller, variable-sized chunks.

This feature was added in Linux kernel \fB2.6.35\fR and requires enabling on-disk
feature \fBdiscontig-bg\fR.

.TP
\fBBACKUP SUPER BLOCKS\fR
A file system super block stores critical information that is hard to recreate.
In OCFS2, it stores the block size, cluster size, and the locations of the root and
system directories, among other things. As this block is close to the start of the
disk, it is very susceptible to being overwritten by an errant write.
Say, dd if=file of=/dev/sda1.

Backup super blocks are copies of the super block. These blocks are dispersed in the
volume to minimize the chances of being overwritten. On the small chance that the
original gets corrupted, the backups are available to scan and fix the corruption.

\fBmkfs.ocfs2(8)\fR enables this feature by default. Users can disable this by
specifying \fB--fs-features=nobackup-super\fR during format.

\fBo2info(1)\fR can be used to view whether the feature has been enabled on a device.

.nf
.ps 8
.ft 6
# o2info --fs-features /dev/sdb1
backup-super strict-journal-super sparse extended-slotmap inline-data xattr
indexed-dirs refcount discontig-bg clusterinfo unwritten
.ft
.ps
.fi

In OCFS2, the super block is on the third block. The backups are located at the \fB1G,
4G, 16G, 64G, 256G and 1T\fB byte offsets. The actual number of backup blocks depends
on the size of the device. The super block is not backed up on devices smaller than 1GB.

\fBfsck.ocfs2(8)\fR refers to these six offsets by numbers, 1 to 6. Users can specify
any backup with the -r option to recover the volume. The example below uses the second
backup. If successful, \fBfsck.ocfs2(8)\fR overwrites the corrupted super block with
the backup.

.nf
.ps 8
.ft 6
# fsck.ocfs2 -f -r 2 /dev/sdb1
fsck.ocfs2 1.8.0
[RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y
Checking OCFS2 filesystem in /dev/sdb1:
  Label:              webhome
  UUID:               B3E021A2A12B4D0EB08E9E986CDC7947
  Number of blocks:   13107196
  Block size:         4096
  Number of clusters: 13107196
  Cluster size:       4096
  Number of slots:    8

/dev/sdb1 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
Pass 2: Checking directory entries.
Pass 3: Checking directory connectivity.
Pass 4a: checking for orphaned inodes
Pass 4b: Checking inodes link counts.
All passes succeeded.
.ft
.ps
.fi

.TP
\fBSYNTHETIC FILE SYSTEMS\fR
The OCFS2 development effort included two synthetic file systems, configfs and dlmfs. It
also makes use of a third, debugfs.

.RS
.TP
\fBconfigfs\fR
configfs has since been accepted as a generic kernel component and is also used by
netconsole and fs/dlm. OCFS2 tools use it to communicate the list of nodes in the
cluster, details of the heartbeat device, cluster timeouts, and so on to the in-kernel
node manager. The o2cb init script mounts this file system at /sys/kernel/config.

.TP
\fBdlmfs\fR
dlmfs exposes the in-kernel o2dlm to the user-space. While it was developed
primarily for OCFS2 tools, it has seen usage by others looking to add a cluster
locking dimension in their applications. Users interested in doing the same should
look at the libo2dlm library provided by ocfs2-tools. The o2cb init script mounts this
file system at /dlm.

.TP
\fBdebugfs\fR
OCFS2 uses debugfs to expose its in-kernel information to user space. For example,
listing the file system cluster locks, dlm locks, dlm state, o2net state, etc. Users can
access the information by mounting the file system at /sys/kernel/debug. To automount,
add the following to /etc/fstab:
debugfs /sys/kernel/debug debugfs defaults 0 0
.RE

.TP
\fBDISTRIBUTED LOCK MANAGER\fR
One of the key technologies in a cluster is the lock manager, which maintains the locking
state of all resources across the cluster. An easy implementation of a lock manager
involves designating one node to handle everything. In this model, if a node wanted to
acquire a lock, it would send the request to the lock manager. However, this model has a
weakness: lock manager’s death causes the cluster to seize up.

A better model is one where all nodes manage a subset of the lock resources. Each node
maintains enough information for all the lock resources it is interested in. On event
of a node death, the remaining nodes pool in the information to reconstruct the lock
state maintained by the dead node. In this scheme, the locking overhead is distributed
amongst all the nodes. Hence, the term distributed lock manager.

O2DLM is a distributed lock manager. It is based on the specification titled "Programming
Locking Application" written by Kristin Thomas and is available at the following link.
http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf

.TP
\fBDLM DEBUGGING\fR
O2DLM has a rich debugging infrastructure that allows it to show the state of the lock
manager, all the lock resources, among other things.
The figure below shows the dlm state of a nine-node cluster that has just lost three
nodes: 12, 32, and 35. It can be ascertained that node 7, the recovery master, is
currently recovering node 12 and has received the lock states of the dead node from all
other live nodes.

.nf
.ps 9
.ft 6
# cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state
Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001  Key: 0x10748e61
Thread Pid: 24542  Node: 7  State: JOINED
Number of Joins: 1  Joining Node: 255
Domain Map: 7 31 33 34 40 50
Live Map: 7 31 33 34 40 50
Lock Resources: 48850 (439879)
MLEs: 0 (1428625)
  Blocking: 0 (1066000)
  Mastery: 0 (362625)
  Migration: 0 (0)
Lists: Dirty=Empty  Purge=Empty  PendingASTs=Empty  PendingBASTs=Empty
Purge Count: 0  Refs: 1
Dead Node: 12
Recovery Pid: 24543  Master: 7  State: ACTIVE
Recovery Map: 12 32 35
Recovery Node State:
        7 - DONE
        31 - DONE
        33 - DONE
        34 - DONE
        40 - DONE
        50 - DONE
.ft
.ps
.fi

The figure below shows the state of a dlm lock resource that is mastered (owned) by
node 25, with 6 locks in the granted queue and node 26 holding the EX (writelock) lock
on that resource.

.nf
.ps 9
.ft 6
# debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1
Lockres: M000000000000000022d63c00000000   Owner: 25    State: 0x0
Last Used: 0      ASTs Reserved: 0    Inflight: 0    Migration Pending: No
Refs: 8    Locks: 6    On Lists: None
Reference Map: 26 27 28 94 95
 Lock-Queue  Node  Level  Conv  Cookie           Refs  AST  BAST  Pending-Action
 Granted     94    NL     -1    94:3169409       2     No   No    None
 Granted     28    NL     -1    28:3213591       2     No   No    None
 Granted     27    NL     -1    27:3216832       2     No   No    None
 Granted     95    NL     -1    95:3178429       2     No   No    None
 Granted     25    NL     -1    25:3513994       2     No   No    None
 Granted     26    EX     -1    26:3512906       2     No   No    None
.ft
.ps
.fi

The figure below shows a lock from the file system perspective. Specifically, it shows a
lock that is in the process of being upconverted from a NL to EX. Locks in this state are
are referred to in the file system as busy locks and can be listed using the debugfs.ocfs2
command, "fs_locks -B".

.nf
.ps 9
.ft 6
# debugfs.ocfs2 -R "fs_locks -B" /dev/sda1
Lockres: M000000000000000000000b9aba12ec  Mode: No Lock
Flags: Initialized Attached Busy
RO Holders: 0  EX Holders: 0
Pending Action: Convert  Pending Unlock Action: None
Requested Mode: Exclusive  Blocking Mode: No Lock
PR > Gets: 0  Fails: 0    Waits Total: 0us  Max: 0us  Avg: 0ns
EX > Gets: 1  Fails: 0    Waits Total: 544us  Max: 544us  Avg: 544185ns
Disk Refreshes: 1
.ft
.ps
.fi

With this debugging infrastructure in place, users can debug hang issues as follows:

.in +4n
* Dump the busy fs locks for all the OCFS2 volumes on the node with hanging
processes. If no locks are found, then the problem is not related to O2DLM.

* Dump the corresponding dlm lock for all the busy fs locks. Note down the
owner (master) of all the locks.

* Dump the dlm locks on the master node for each lock.
.in

At this stage, one should note that the hanging node is waiting to get an AST from the
master. The master, on the other hand, cannot send the AST until the current holder has
down converted that lock, which it will do upon receiving a Blocking AST. However, a
node can only down convert if all the lock holders have stopped using that lock.
After dumping the dlm lock on the master node, identify the current lock holder and
dump both the dlm and fs locks on that node.

The trick here is to see whether the Blocking AST message has been relayed to file
system. If not, the problem is in the dlm layer. If it has, then the most common reason
would be a lock holder, the count for which is maintained in the fs lock.

At this stage, printing the list of process helps.

.nf
.ft 6
$ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
.ft
.fi

Make a note of all D state processes. At least one of them is responsible for the hang on
the first node.

The challenge then is to figure out why those processes are hanging. Failing that, at
least get enough information (like alt-sysrq t output) for the kernel developers to review.
What to do next depends on where the process is hanging. If it is waiting for the I/O to
complete, the problem could be anywhere in the I/O subsystem, from the block device
layer through the drivers to the disk array. If the hang concerns a user lock (flock(2)),
the problem could be in the user’s application. A possible solution could be to kill the
holder. If the hang is due to tight or fragmented memory, free up some memory by
killing non-essential processes.

The thing to note is that the symptom for the problem was on one node but the cause is
on another. The issue can only be resolved on the node holding the lock. Sometimes, the
best solution will be to reset that node. Once killed, the O2DLM recovery process will
clear all locks owned by the dead node and let the cluster continue to operate. As harsh
as that sounds, at times it is the only solution. The good news is that, by following the
trail, you now have enough information to file a bug and get the real issue resolved.

.TP
\fBNFS EXPORTING\fR
OCFS2 volumes can be exported as NFS volumes. This support is limited to NFS version
3, which translates to Linux kernel version 2.4 or later.

If the version of the Linux kernel on the system exporting the volume is older than
\fB2.6.30\fR, then the NFS clients must mount the volumes using the \fInordirplus\fR
mount option. This disables the READDIRPLUS RPC call to workaround a bug in NFSD,
detailed in the following link:

.nf
.ps 9
.ft 6
http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html
.ft
.ps
.fi

Users running NFS version 2 can export the volume after having disabled subtree checking
(mount option no_subtree_check). Be warned, disabling the check has security implications
(documented in the exports(5) man page) that users must evaluate on their own.

.TP
\fBFILE SYSTEM LIMITS\fR
OCFS2 has no intrinsic limit on the total number of files and directories in the file
system. In general, it is only limited by the size of the device. But there is one limit
imposed by the current filesystem. It can address at most four billion clusters. A file
system with 1MB cluster size can go up to 4PB, while a file system with a 4KB cluster size
can address up to 16TB.

.TP
\fBSYSTEM OBJECTS\fR
The OCFS2 file system stores its internal meta-data, including bitmaps, journals, etc., as
system files. These are grouped in a system directory. These files and directories are not
accessible via the file system interface but can be viewed using the \fBdebugfs.ocfs2(8)\fR
tool.

To list the system directory (referred to as double-slash), do:

.nf
.ps 8
.ft 6
# debugfs.ocfs2 -R "ls -l //" /dev/sde1
        66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 .
        66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 ..
        67     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 bad_blocks
        68     -rw-r--r--   1  0  0      1179648 19-Jul-2011 13:36 global_inode_alloc
        69     -rw-r--r--   1  0  0         4096 19-Jul-2011 14:35 slot_map
        70     -rw-r--r--   1  0  0      1048576 19-Jul-2011 13:36 heartbeat
        71     -rw-r--r--   1  0  0  53686960128 19-Jul-2011 13:36 global_bitmap
        72     drwxr-xr-x   2  0  0         3896 25-Jul-2011 15:05 orphan_dir:0000
        73     drwxr-xr-x   2  0  0         3896 19-Jul-2011 13:36 orphan_dir:0001
        74     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0000
        75     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0001
        76     -rw-r--r--   1  0  0    121634816 19-Jul-2011 13:36 inode_alloc:0000
        77     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 inode_alloc:0001
        77     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:36 journal:0000
        79     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:37 journal:0001
        80     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0000
        81     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0001
        82     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0000
        83     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0001
.ft 1
.ps
.fi

The file names that end with numbers are slot specific and are referred to as node-local
system files. The set of node-local files used by a node can be determined from the slot
map. To list the slot map, do:

.nf
.ft 6
# debugfs.ocfs2 -R "slotmap" /dev/sde1
    Slot#    Node#
        0       32
        1       35
        2       40
        3       31
        4       34
        5       33
.ft 1
.fi

For more information, refer to the OCFS2 support guides available in the Documentation
section at http://oss.oracle.com/projects/ocfs2.

.TP
\fBHEARTBEAT, QUORUM, AND FENCING\fR
Heartbeat is an essential component in any cluster. It is charged with accurately
designating nodes as dead or alive. A mistake here could lead to a cluster hang or a
corruption.

\fIo2hb\fR is the disk heartbeat component of \fBo2cb\fR. It periodically updates a
timestamp on disk, indicating to others that this node is alive. It also reads all the
timestamps to identify other live nodes. Other cluster components, like \fIo2dlm\fR
and \fIo2net\fR, use the \fIo2hb\fR service to get node up and down events.

The quorum is the group of nodes in a cluster that is allowed to operate on the shared
storage. When there is a failure in the cluster, nodes may be split into groups that can
communicate in their groups and with the shared storage but not between groups.
\fIo2quo\fR determines which group is allowed to continue and initiates fencing of
the other group(s).

Fencing is the act of forcefully removing a node from a cluster. A node with OCFS2
mounted will fence itself when it realizes that it does not have quorum in a degraded
cluster. It does this so that other nodes won’t be stuck trying to access its resources.

\fBo2cb\fR uses a machine reset to fence. This is the quickest route for the node to
rejoin the cluster.

.TP
\fBPROCESSES\fR

.RS
.TP
\fB[o2net]\fR
One per node. It is a work-queue thread started when the cluster is brought on-line
and stopped when it is off-lined. It handles network communication for all mounts.
It gets the list of active nodes from O2HB and sets up a TCP/IP communication
channel with each live node. It sends regular keep-alive packets to detect any
interruption on the channels.

.TP
\fB[user_dlm]\fR
One per node. It is a work-queue thread started when dlmfs is loaded and stopped
when it is unloaded (dlmfs is a synthetic file system that allows user space
processes to access the in-kernel dlm).

.TP
\fB[ocfs2_wq]\fR
One per node. It is a work-queue thread started when the OCFS2 module is loaded
and stopped when it is unloaded. It is assigned background file system tasks that
may take cluster locks like flushing the truncate log, orphan directory recovery and
local alloc recovery. For example, orphan directory recovery runs in the background
so that it does not affect recovery time.

.TP
\fB[o2hb-14C29A7392]\fR
One per heartbeat device. It is a kernel thread started when the heartbeat region is
populated in configfs and stopped when it is removed. It writes every two seconds
to a block in the heartbeat region, indicating that this node is alive. It also reads the
region to maintain a map of live nodes. It notifies subscribers like o2net and o2dlm
of any changes in the live node map.

.TP
\fB[ocfs2dc]\fR
One per mount. It is a kernel thread started when a volume is mounted and stopped
when it is unmounted. It downgrades locks in response to blocking ASTs (BASTs)
requested by other nodes.

.TP
\fB[jbd2/sdf1-97]\fR
One per mount. It is part of JBD2, which OCFS2 uses for journaling.

.TP
\fB[ocfs2cmt]\fR
One per mount. It is a kernel thread started when a volume is mounted and stopped
when it is unmounted. It works with kjournald2.

.TP
\fB[ocfs2rec]\fR
It is started whenever a node has to be recovered. This thread performs file system
recovery by replaying the journal of the dead node. It is scheduled to run after dlm
recovery has completed.

.TP
\fB[dlm_thread]\fR
One per dlm domain. It is a kernel thread started when a dlm domain is created and
stopped when it is destroyed. This thread sends ASTs and blocking ASTs in response
to lock level convert requests. It also frees unused lock resources.

.TP
\fB[dlm_reco_thread]\fR
One per dlm domain. It is a kernel thread that handles dlm recovery when another
node dies. If this node is the dlm recovery master, it re-masters every lock resource
owned by the dead node.

.TP
\fB[dlm_wq]\fR
One per dlm domain. It is a work-queue thread that o2dlm uses to queue blocking
tasks.
.RE

.TP
\fBFUTURE WORK\fR
File system development is a never ending cycle. Faster and larger disks, faster
and more number of processors, larger caches, etc. keep changing the sweet spot for
performance forcing developers to rethink long held beliefs. Add to that new use cases,
which forces developers to be innovative in providing solutions that melds seamlessly
with existing semantics.

We are currently looking to add features like transparent compression, transparent
encryption, delayed allocation, multi-device support, etc. as well as work on improving
performance on newer generation machines.

If you are interested in contributing, email the development team at ocfs2-devel@oss.oracle.com.

.SH "ACKNOWLEDGEMENTS"
.PP

The principal developers of the OCFS2 file system, its tools and the O2CB cluster stack,
are \fIJoel Becker\fR, \fIZach Brown\fR, \fIMark Fasheh\fR, \fIJan Kara\fR, \fIKurt Hackel\fR,
\fITao Ma\fR, \fISunil Mushran\fR, \fITiger Yang\fR and \fITristan Ye\fR.

Other developers who have contributed to the file system via bug fixes, testing, etc.
are \fIWim Coekaerts\fR, \fISrinivas Eeda\fR, \fIColy Li\fR, \fIJeff Mahoney\fR,
\fIMarcos Matsunaga\fR, \fIGoldwyn Rodrigues\fR, \fIManish Singh\fR and \fIWengang Wang\fR.

The members of the Linux Cluster community including \fIAndrew Beekhof\fR,
\fILars Marowsky-Bree\fR, \fIFabio Massimo Di Nitto\fR and \fIDavid Teigland\fR.

The members of the Linux File system community including \fIChristoph Hellwig\fR and
\fIChris Mason\fR.

The corporations that have contributed resources for this project including \fIOracle\fR,
\fISUSE Labs\fR, \fIEMC\fR, \fIEmulex\fR, \fIHP\fR, \fIIBM\fR, \fIIntel\fR and
\fINetwork Appliance\fR.

.SH "SEE ALSO"
.BR debugfs.ocfs2(8)
.BR fsck.ocfs2(8)
.BR fsck.ocfs2.checks(8)
.BR mkfs.ocfs2(8)
.BR mount.ocfs2(8)
.BR mounted.ocfs2(8)
.BR o2cluster(8)
.BR o2image(8)
.BR o2info(1)
.BR o2cb(7)
.BR o2cb(8)
.BR o2cb.sysconfig(5)
.BR o2hbmonitor(8)
.BR ocfs2.cluster.conf(5)
.BR tunefs.ocfs2(8)

.SH "AUTHOR"
Oracle Corporation

.SH "COPYRIGHT"
Copyright \(co 2004, 2012 Oracle. All rights reserved.