1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317
|
BTRFS RAID Stripe Tree Design
=============================
Problem Statement
-----------------
RAID on zoned devices
---------------------
The current implementation of RAID profiles in BTRFS is based on the implicit
assumption that data placement is deterministic in the device chunks used for
mapping block groups.
With deterministic data placement, all physical on-disk extents of one logical
file extent are positioned at the same offset relative to the starting LBA of
a device chunk. This prevents the need for reading any meta-data to access an
on-disk file extent. Figure 1 shows an example of it.
+------------+ +------------+
| | | |
| | | |
| | | |
+------------+ +------------+
| | | |
| D 1 | | D 1 |
| | | |
+------------+ +------------+
| | | |
| D 2 | | D 2 |
| | | |
+------------+ +------------+
| | | |
| | | |
| | | |
| | | |
+------------+ +------------+
Figure 1: Deterministic data placement
With non-deterministic data placement, the on-disk extents of a logical file
extent can be scattered around inside the chunk. To read back the data with
non-deterministic data placement, additional meta-data describing the position
of the extents inside a chunk is needed. Figure 2 shows an example for this
style of data placement.
+------------+ +------------+
| | | |
| | | |
| | | |
+------------+ +------------+
| | | |
| D 1 | | D 2 |
| | | |
+------------+ +------------+
| | | |
| D 2 | | D 1 |
| | | |
+------------+ +------------+
| | | |
| | | |
| | | |
| | | |
+------------+ +------------+
Figure 2: Non-deterministic data placement
As BTRFS support for zoned block devices uses the Zone Append operation for
writing file data extents, there is no guarantee that the written extents have
the same offset within different device chunks. This implies that to be able
to use RAID with zoned devices, non-deterministic data placement must be
supported and additional meta-data describing the location of file extents
within device chunks is needed.
Lessons learned from RAID 5
---------------------------
BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID
write hole problem. This problem exists because sub-stripe write operations
are not done using a copy-on-write (COW) but done using Read-Modify-Write
(RMW). With out-of-place writing like COW, no blocks will be overwritten and
there is no risk of exposing bad data or corrupting a data stripe parity in
case of sudden power loss or other unexpected events preventing correctly
writing to the device.
RAID Stripe Tree Design overview
--------------------------------
To solve the problems stated above, additional meta-data is introduced: a RAID
Stripe Tree holds the logical to physical translation for the RAID stripes.
For each logical file extent (struct btrfs_file_extent_item) a stripe extent
is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a
container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride
holds the device ID and the physical start location on that device of the
sub-stripe of a file extent, as well as the stride's length.
Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item
disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the
btrfs_key and disk_num_bytes as the offset. The key’s type is
BTRFS_STRIPE_EXTENT_KEY.
On-disk format
--------------
struct btrfs_file_extent_item {
/* […] */
__le64 disk_bytenr; ------------------------------------+
__le64 disk_num_bytes; --------------------------------|----+
/* […] */ | |
}; | |
| |
struct btrfs_key key = { -------------------------------------|----|--+
.objectid = btrfs_file_extent_item::disk_bytenr, <------+ | |
.type = BTRFS_STRIPE_EXTENT_KEY, | |
.offset = btrfs_file_extent_item::disk_num_bytes, <----------+ |
}; |
|
struct btrfs_raid_stride { <------------------+ |
__le64 devid; | |
__le64 physical; | |
__le64 length; | |
}; | |
| |
struct btrfs_stripe_extent { <---------------|-------------------------+
u8 encoding; |
u8 reserved[7]; |
struct btrfs_raid_stride strides[]; ---+
};
User-space support
------------------
mkfs
----
Supporting the RAID Stripe Tree in user-space consists of three things to do
for mkfs. The first step is creating the root of the RAID Stripe Tree itself.
Then mkfs must set the incompat flag, so mounting a filesystem with a RAID
Stripe Tree is impossible for a kernel version without appropriate support.
Lastly it must allow RAID profiles on zoned devices once the tree is present.
Check
-----
The ‘btrfs check’ support for RAID Stripe Tree is not implemented yet, but its
task is to read the struct btrfs_stripe_extent entries for each struct
btrfs_file_extent_item and verify that a correct mapping between these two
exists. If data checksum verification is requested as well, the tree also must
be read to perform the logical to physical translation, otherwise the data
cannot be read, and the checksums cannot be verified.
Example tree dumps
------------------
Example 1: Write 128k to and empty FS
RAID0
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
RAID1
stripe 0 devid 1 physical XXXXXXXXX length 131072
stripe 1 devid 2 physical XXXXXXXXX length 131072
RAID10
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
stripe 2 devid 3 physical XXXXXXXXX length 65536
stripe 3 devid 4 physical XXXXXXXXX length 65536
Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then
write 16k.
RAID0
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 65536
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 2 physical XXXXXXXXX length 4096
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
encoding: RAID0
stripe 0 devid 2 physical XXXXXXXXX length 61440
stripe 1 devid 1 physical XXXXXXXXX length 4096
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 4096
RAID1
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 4096
stripe 1 devid 2 physical XXXXXXXXX length 4096
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 4096
stripe 1 devid 2 physical XXXXXXXXX length 4096
RAID10
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 65536
stripe 1 devid 2 physical XXXXXXXXX length 65536
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 3 physical XXXXXXXXX length 4096
stripe 1 devid 4 physical XXXXXXXXX length 4096
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104
encoding: RAID10
stripe 0 devid 3 physical XXXXXXXXX length 61440
stripe 1 devid 4 physical XXXXXXXXX length 61440
stripe 2 devid 1 physical XXXXXXXXX length 4096
stripe 3 devid 2 physical XXXXXXXXX length 4096
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 4096
stripe 1 devid 2 physical XXXXXXXXX length 4096
Example 3: Pre-fill stripe with 32K data, then write 64K of data and then
overwrite 8k in the middle.
RAID0
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 32768
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 32768
stripe 1 devid 2 physical XXXXXXXXX length 65536
stripe 2 devid 1 physical XXXXXXXXX length 32768
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 8192
RAID1
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 32768
stripe 1 devid 2 physical XXXXXXXXX length 32768
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 131072
stripe 1 devid 2 physical XXXXXXXXX length 131072
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
encoding: RAID1
stripe 0 devid 1 physical XXXXXXXXX length 8192
stripe 1 devid 2 physical XXXXXXXXX length 8192
RAID10
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 32768
stripe 1 devid 2 physical XXXXXXXXX length 32768
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 32768
stripe 1 devid 2 physical XXXXXXXXX length 32768
stripe 2 devid 3 physical XXXXXXXXX length 65536
stripe 3 devid 4 physical XXXXXXXXX length 65536
stripe 4 devid 1 physical XXXXXXXXX length 32768
stripe 5 devid 2 physical XXXXXXXXX length 32768
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
encoding: RAID10
stripe 0 devid 1 physical XXXXXXXXX length 8192
stripe 1 devid 2 physical XXXXXXXXX length 8192
Glossary
--------
RAID
Redundant Array of Independent Disks. This is a storage mechanism
where data is not stored on a single disk alone but either mirrored
(in case of RAID 1) or split across multiple disks (RAID 0). Other
RAID levels like RAID5 and RAID6 stripe the data across multiple disks
and add parity information to enable data recovery in case of a disk
failure.
LBA
Logical Block Address. LBAs describe the address space of a block
device as a linearly increasing address map. LBAs are internally
mapped to different physical locations by the device firmware.
Zoned Block Device
Zoned Block Devices are a special kind of block devices that partition
their LBA space into so called zones. These zones can impose write
constraints on the host, e.g., allowing only sequential writes aligned
to a zone write pointer.
Zone Append
A write operation where the start LBA of a Zone is specified instead
of a destination LBA for the data to be written. On completion the
device reports the starting LBA used to write the data back to the
host.
Copy-on-Write
A write technique where the data is not overwritten, but a new version
of it is written out of place.
Read-Modify-Write
A write technique where the data to be written is first read from the
block device, modified in memory and the modified data written back in
place on the block device.
|