File: disk-format.rst

package info (click to toggle)
raft 0.22.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 2,504 kB
  • sloc: ansic: 37,539; makefile: 264; sh: 77; python: 22
file content (65 lines) | stat: -rw-r--r-- 3,291 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Disk format
===========

The implementation of metadata and log persistency is virtually the same as the
one found in  `LogCabin`_.

The disk files consist of metadata files, closed segments, and open
segments. Metadata files are used to track Raft metadata, such as the server's
current term, vote, and log's start index. Segments contain contiguous entries
that are part of the log. Closed segments are never written to again (but may be
renamed and truncated if a suffix of the log is truncated). Open segments are
where newly appended entries go. Once an open segment reaches the maximum
allowed size, it is closed and a new one is used. There are usually about 3 open
segments at any given time, the one with the lower index is the one actively
being written, and the other ones have been fallocate'd and are ready to be used
as soon as the active one gets closed.

Metadata files are named "metadata1" and "metadata2". The code alternates
between these so that there is always at least one readable metadata file.

On startup, the readable metadata file with the higher version number is used.

The format of a metadata file is:

* [8 bytes] Format (currently 1).
* [8 bytes] Incremental version number.
* [8 bytes] Current term.
* [8 bytes] ID of server we voted for.

All values are in little endian encoding.

Closed segments are named by the format string "%lu-%lu" with their start and
end indexes, both inclusive. Closed segments always contain at least one entry,
and the end index is always at least as large as the start index. Closed segment
files may occasionally include data past their filename's end index (these are
ignored but a warning is logged). This can happen if the suffix of the segment
is truncated and a crash occurs at an inopportune time (the segment file is
first renamed, then truncated, and a crash occurs in between).

Open segments are named by the format string "open-%lu" with a unique
number. These should not exist when the server shuts down cleanly, but they
exist while the server is running and may be left around during a crash.  Open
segments either contain entries which come after the last closed segment or are
full of zeros. When the server crashes while appending to an open segment, the
end of that file may be corrupt. We can't distinguish between a corrupt file and
a partially written entry. The code assumes it's a partially written entry, logs
a warning, and ignores it.

Truncating a suffix of the log will remove all entries that are no longer part
of the log. Truncating a prefix of the log will only remove complete segments
that are before the new log start index. For example, if a segment has entries
10 through 20 and the prefix of the log is truncated to start at entry 15, that
entire segment will be retained.

Each segment file starts with a segment header, which currently contains just an
8-byte version number for the format of that segment. The current format
(version 1) is just a concatenation of serialized entry batches.

Each batch has the following format:

* [4 bytes] CRC32 checksum of the batch header, little endian.
* [4 bytes] CRC32 checksum of the batch data, little endian.
* [  ...  ] Batch of one or more entries.

.. _LogCabin: https://github.com/logcabin/logcabin/blob/master/Storage/SegmentedLog.h