1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338
|
moss design overview
--------------------
moss as a in-memory KV collection and write-back cache
======================================================
moss originally began as a 100% go, in-memory, ordered, key-value
collection library with optional write-back caching. Users can
optionally provide their own lower-level storage to which moss can
asynchronously persist batched-up mutations.
+---------------------+
| moss |
+---------------------+
| lower-level storage |
+---------------------+
Figure 1.
That is, users can use moss as a write-back cache in-front of
forestdb, rocksdb, boltdb, etc.
Later on, the project added its own, optional, built-in lower-level
storage implementation, called "mossStore", which is an append-only
design that will be described in a later section.
To implement the cache, moss follows an LSM-array (Log Structured
Merge array) design, which allows mutations in the cache to be
readable, iteratable and snapshot'able.
When the user atomically commits a batch of key-val mutations into a
moss "collection", the batch of mutations is sorted by key, becoming
an immutable "segment" of key-val mutations. The segment is then
pushed onto the end of a logical sequence of segments. See:
collection.go / collection.ExecuteBatch().
The logical sequence of segments that is maintained by a collection
has different sections...
+---------------------+
| collection |
| |
| * top |
| * mid |
| * base |
| * clean |
| |
+---------------------+
Figure 2.
The top, mid, base sections are considered dirty, containing segments
with mutations that haven't yet been written back to the optional
lower-level storage. The clean section is considered non-dirty,
having already been persisted to the optional lower-level store.
A section is implemented as a "segmentStack". See: segment_stack.go.
The top section is where incoming batches of mutations are pushed.
+---------------------+
| collection |
| |
| * top |
| segment-0 |
| * mid |
| * base |
| * clean |
| |
+---------------------+
Figure 3.
Eventually, following the LSM based approach, the logical sequence of
segments becomes too long or too tall...
+---------------------+
| collection |
| |
| * top |
| segment-4 |
| segment-3 |
| segment-2 |
| segment-1 |
| segment-0 |
| * mid |
| * base |
| * clean |
| |
+---------------------+
Figure 4.
So, the segments need to be merged together. A background "merger"
goroutine handles this in a loop, by atomically moving segments from
the top section to the mid section.
+---------------------+
| collection |
| |
| * top |
| * mid |
| segment-4 |
| segment-3 |
| segment-2 |
| segment-1 |
| segment-0 |
| * base |
| * clean |
| |
+---------------------+
Figure 5.
This opens up more logical, configuration-controlled space in the top
section for more incoming batches, such as segment-5...
+---------------------+
| collection |
| |
| * top |
| segment-5 |
| * mid |
| segment-4 |
| segment-3 |
| segment-2 |
| segment-1 |
| segment-0 |
| * base |
| * clean |
| |
+---------------------+
Figure 6.
The merger then asynchronously merges the segments in the mid section
and atomically swaps in the shorter mid section when done, and
repeats. See: segment_stack_merge.go.
+---------------------+
| collection |
| |
| * top |
| segment-5 |
| * mid |
| segment-0...4 |
| * base |
| * clean |
| |
+---------------------+
Figure 7.
To support optional write-back persistence, a background "persister"
goroutine handles this in a loop, by atomically moving the mid section
into the base section....
+---------------------+
| collection |
| |
| * top |
| segment-5 |
| * mid |
| * base |
| segment-0...4 |
| * clean |
| |
+---------------------+
Figure 8.
The persister then writes out all the mutations in the base section to
the lower-level storage. When done with persistence, the persister
atomically clears out the base section or optionally moves it into the
clean section, and repeats. See: persister.go.
+---------------------+
| collection |
| |
| * top |
| segment-5 |
| * mid |
| * base |
| * clean |
| |
+---------------------+
| lower-level-storage |
| |
| mutations from... |
| segment-0...4 |
| |
+---------------------+
Figure 9.
A snapshot is implemented by copying all the segment pointers from all
the sections and returning a brand new, immutable segmentStack. See:
collection.go / collection.Snapshot().
For example, if the snapshot was taken when the collection was in the
state represented by Figure 6, then the snapshot would look like...
+---------------------+
| segmentStack |
| |
| segment-5 |
| segment-4 |
| segment-3 |
| segment-2 |
| segment-1 |
| segment-0 |
| |
+---------------------+
Figure 10.
The lower-level storage must also provide snapshot semantics for
snapshotting to work.
Per the LSM design approach, the key-val mutations from a more recent
or higher segment in a segmentStack shadow the key-val mutations from
an older or lower segment in a segmentStack. See: segment_stack.go /
segmentStack.Get().
An iterator is implemented by performing a heap-merge from multiple
segments. See: iterator.go.
mossStore
=========
One option for lower-level storage is the built-in storage
implementation called mossStore. It uses an append-only file-format
design, where mossStore appends new segments and a "footer" to the end
of a file, with page-aligned start offsets. The footer contains
metadata such as the offsets and lengths of the segments written to
the file. See: store.go / SegmentLoc struct and pageAlignCeil().
Once the file footer is appended, mossStore performs an mmap() to read
the just-appended segments. See: store_footer.go /
Footer.loadSegments().
The already-written parts of the file are immutable and any mmap()'ed
regions are also read-only.
On a restart or reopen of a previous mossStore file, mossStore scans
the file backwards looking for the last good footer. See:
store_footer.go / ScanFooter().
mossStore supports navigating to previous, read-only snapshots of the
database, by scanning the file for older footers. See:
store_previous.go.
mossStore can revert to a previous snapshot of the database, by
copying an older footer and just appending it to the end of the file
as a brand-new footer. See: store_revert.go.
Full compaction is supported by mossStore to handle the situation of
an ever-growing append-only file, where active data from file X is
copied to a file X+1, and then file X is deleted. See:
store_compact.go.
To track file-related handle resources, mossStore uses ref-counting.
For file handles, see file.go. For mmap()'ed regions, see mmap.go.
One implication of this mossStore design is that a file written on one
machine endian'ness cannot be opened on a machine with different
endian'ness, especially since mossStore has an optimization where
large arrays of uint64's are read/written from/to storage as large,
contiguous byte arrays. See: slice_util.go.
Child Collections
=================
Moss allows multiple related key-value collections to be logically
& atomically grouped into child collections within a collection.
The following properties hold true for these child collections:
* Logically separate key-space.
* Independent iteration & point queries.
* Fast deletions and recreations.
* Atomic grouping via the batch API. This means that all child
collection batches are processed together in an all-or-none manner
during moss's background merges and compactions.
For example, say A, B are child collections :
Batch1: | A-10-docs + B-3-docs |
Batch2: | B-4-docs |
Batch3: | A-2-docs |
After background merging only the following can happen:
Batch(1+2) | A-10-docs + B-7-docs |
Batch3: | A-2-docs |
OR
Batch(1+2+3) | A-12-docs + B-7-docs |
Background merges and compactions will never split up child
collections that were executed in the same batch.
So there will never be a state like the following:
Batch(1+2): | A-10-docs|
Batch3: | B-7-docs|
This feature is similar to multiple tables in a relational database.
Here is a usecase: say a collection represents people in an software
company. The keys are user ids of the employees. Now if the company
wishes to store metadata about its employees, such as number of people
in development, sales, marketing in the same collection, it can do so
by prefixing the user ids with a "role" id. This wastes key space and
makes deletion of employees an expensive operation. Instead the
"roles" can simply be a child collection and the keys in this child
collection can be the actual roles having the count of the number of
employees in each role as the values.
Internally child collections are recursively stored within the
respective Snapshot implementations.
For example segmentStack implements child collection as follows:
top-level or default collection
+---------------------+
| segmentStack |
| |
| segment-2 |
| segment-1 |
| segment-0 |
| | child-collection
| "roles" => -------------> +---------------------+
| | | segmentStack |
+---------------------+ | |
| segment-2 |
| segment-1 |
| segment-0 |
| |
+---------------------+
Note: Since child collections are recursively linked with higher
collections, any child collection snapshots opened must be explicity
closed in order to prevent resource leaks.
|