File: design-raid-stripe-tree.txt

package info (click to toggle)
btrfs-progs 6.17.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 20,612 kB
  • sloc: ansic: 127,282; sh: 7,915; python: 1,384; makefile: 900; asm: 296
file content (317 lines) | stat: -rw-r--r-- 14,670 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
BTRFS RAID Stripe Tree Design
=============================


Problem Statement
-----------------


RAID on zoned devices
---------------------
The current implementation of RAID profiles in BTRFS is based on the implicit
assumption that data placement is deterministic in the device chunks used for
mapping block groups.
With deterministic data placement, all physical on-disk extents of one logical
file extent are positioned at the same offset relative to the starting LBA of
a device chunk. This prevents the need for reading any meta-data to access an
on-disk file extent. Figure 1 shows an example of it.


	+------------+      +------------+
	|            |      |            |
	|            |      |            |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|    D 1     |      |    D 1     |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|    D 2     |      |    D 2     |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|            |      |            |
	|            |      |            |
	|            |      |            |
	+------------+      +------------+
	Figure 1: Deterministic data placement



With non-deterministic data placement, the on-disk extents of a logical file
extent can be scattered around inside the chunk. To read back the data with
non-deterministic data placement, additional meta-data describing the position
of the extents inside a chunk is needed. Figure 2 shows an example for this
style of data placement.


	+------------+      +------------+
	|            |      |            |
	|            |      |            |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|    D 1     |      |    D 2     |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|    D 2     |      |    D 1     |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|            |      |            |
	|            |      |            |
	|            |      |            |
	+------------+      +------------+
	Figure 2: Non-deterministic data placement

As BTRFS support for zoned block devices uses the Zone Append operation for
writing file data extents, there is no guarantee that the written extents have
the same offset within different device chunks. This implies that to be able
to use RAID with zoned devices, non-deterministic data placement must be
supported and additional meta-data describing the location of file extents
within device chunks is needed.


Lessons learned from RAID 5
---------------------------
BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID
write hole problem. This problem exists because sub-stripe write operations
are not done using a copy-on-write (COW) but done using Read-Modify-Write
(RMW). With out-of-place writing like COW, no blocks will be overwritten and
there is no risk of exposing bad data or corrupting a data stripe parity in
case of sudden power loss or other unexpected events preventing correctly
writing to the device.

RAID Stripe Tree Design overview
--------------------------------

To solve the problems stated above, additional meta-data is introduced: a RAID
Stripe Tree holds the logical to physical translation for the RAID stripes.
For each logical file extent (struct btrfs_file_extent_item) a stripe extent
is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a
container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride
holds the device ID and the physical start location on that device of the
sub-stripe of a file extent, as well as the stride's length.
Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item
disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the
btrfs_key and disk_num_bytes as the offset. The key’s type is
BTRFS_STRIPE_EXTENT_KEY.

On-disk format
--------------

struct btrfs_file_extent_item {
        /* […] */
        __le64 disk_bytenr; ------------------------------------+
        __le64 disk_num_bytes;  --------------------------------|----+
	/* […] */                                               |    |
};                                                              |    |
                                                                |    |
struct btrfs_key key = {   -------------------------------------|----|--+
	.objectid = btrfs_file_extent_item::disk_bytenr, <------+    |  |
	.type = BTRFS_STRIPE_EXTENT_KEY,                             |  |
	.offset = btrfs_file_extent_item::disk_num_bytes, <----------+  |
};                                                                      |
                                                                        |
struct btrfs_raid_stride { <------------------+                         |
       __le64 devid;                          |                         |
       __le64 physical;                       |                         |
       __le64 length;                         |                         |
};                                            |                         |
                                              |                         |
struct btrfs_stripe_extent {  <---------------|-------------------------+
	u8 encoding;                          |
	u8 reserved[7];                       |
       struct btrfs_raid_stride strides[]; ---+
};


User-space support
------------------


mkfs
----
Supporting the RAID Stripe Tree in user-space consists of three things to do
for mkfs. The first step is creating the root of the RAID Stripe Tree itself.
Then mkfs must set the incompat flag, so mounting a filesystem with a RAID
Stripe Tree is impossible for a kernel version without appropriate support.
Lastly it must allow RAID profiles on zoned devices once the tree is present.


Check
-----
The ‘btrfs check’ support for RAID Stripe Tree is not implemented yet, but its
task is to read the struct btrfs_stripe_extent entries for each struct
btrfs_file_extent_item and verify that a correct mapping between these two
exists. If data checksum verification is requested as well, the tree also must
be read to perform the logical to physical translation, otherwise the data
cannot be read, and the checksums cannot be verified.


Example tree dumps
------------------

Example 1: Write 128k to and empty FS

RAID0
        item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 131072
                        stripe 1 devid 2 physical XXXXXXXXX length 131072
RAID10
        item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
                        stripe 2 devid 3 physical XXXXXXXXX length 65536
                        stripe 3 devid 4 physical XXXXXXXXX length 65536

Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then
write 16k.

RAID0
        item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
        item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 2 physical XXXXXXXXX length 4096
        item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
                        encoding: RAID0
                        stripe 0 devid 2 physical XXXXXXXXX length 61440
                        stripe 1 devid 1 physical XXXXXXXXX length 4096
        item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 4096
RAID1
        item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
        item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 4096
                        stripe 1 devid 2 physical XXXXXXXXX length 4096
        item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
        item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 4096
                        stripe 1 devid 2 physical XXXXXXXXX length 4096
RAID10
        item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
        item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 3 physical XXXXXXXXX length 4096
                        stripe 1 devid 4 physical XXXXXXXXX length 4096
        item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104
                        encoding: RAID10
                        stripe 0 devid 3 physical XXXXXXXXX length 61440
                        stripe 1 devid 4 physical XXXXXXXXX length 61440
                        stripe 2 devid 1 physical XXXXXXXXX length 4096
                        stripe 3 devid 2 physical XXXXXXXXX length 4096
        item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 4096
                        stripe 1 devid 2 physical XXXXXXXXX length 4096

Example 3: Pre-fill stripe with 32K data, then write 64K of data and then
overwrite 8k in the middle.

RAID0
        item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
        item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
                        stripe 2 devid 1 physical XXXXXXXXX length 32768
        item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 8192
RAID1
        item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
                        stripe 1 devid 2 physical XXXXXXXXX length 32768
        item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 131072
                        stripe 1 devid 2 physical XXXXXXXXX length 131072
        item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 8192
                        stripe 1 devid 2 physical XXXXXXXXX length 8192
RAID10
        item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
                        stripe 1 devid 2 physical XXXXXXXXX length 32768
        item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
                        stripe 1 devid 2 physical XXXXXXXXX length 32768
                        stripe 2 devid 3 physical XXXXXXXXX length 65536
                        stripe 3 devid 4 physical XXXXXXXXX length 65536
                        stripe 4 devid 1 physical XXXXXXXXX length 32768
                        stripe 5 devid 2 physical XXXXXXXXX length 32768
        item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 8192
                        stripe 1 devid 2 physical XXXXXXXXX length 8192


Glossary
--------


RAID
	Redundant Array of Independent Disks. This is a storage mechanism
	where data is not stored on a single disk alone but either mirrored
	(in case of RAID 1) or split across multiple disks (RAID 0). Other
	RAID levels like RAID5 and RAID6 stripe the data across multiple disks
	and add parity information to enable data recovery in case of a disk
	failure.


LBA 
	Logical Block Address. LBAs describe the address space of a block
	device as a linearly increasing address map. LBAs are internally
	mapped to different physical locations  by the device firmware.


Zoned Block Device
	Zoned Block Devices are a special kind of block devices that partition
	their LBA space into so called zones. These zones can impose write
	constraints on the host, e.g., allowing only sequential writes aligned
	to a zone write pointer.


Zone Append
	A write operation where the start LBA of a Zone is specified instead
	of a destination LBA for the data to be written. On completion the
	device reports the starting LBA used to write the data back to the
	host.


Copy-on-Write
	A write technique where the data is not overwritten, but a new version
	of it is written out of place.


Read-Modify-Write
	A write technique where the data to be written is first read from the
	block device, modified in memory and the modified data written back in
	place on the block device.