File: DESIGN.md

package info (click to toggle)
golang-github-couchbase-moss 0.0~git20170914.0.07c86e8-4
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, buster, sid
  • size: 664 kB
  • sloc: python: 230; makefile: 37
file content (338 lines) | stat: -rw-r--r-- 11,426 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
moss design overview
--------------------

moss as a in-memory KV collection and write-back cache
======================================================

moss originally began as a 100% go, in-memory, ordered, key-value
collection library with optional write-back caching.  Users can
optionally provide their own lower-level storage to which moss can
asynchronously persist batched-up mutations.

    +---------------------+
    | moss                |
    +---------------------+
    | lower-level storage |
    +---------------------+

    Figure 1.

That is, users can use moss as a write-back cache in-front of
forestdb, rocksdb, boltdb, etc.

Later on, the project added its own, optional, built-in lower-level
storage implementation, called "mossStore", which is an append-only
design that will be described in a later section.

To implement the cache, moss follows an LSM-array (Log Structured
Merge array) design, which allows mutations in the cache to be
readable, iteratable and snapshot'able.

When the user atomically commits a batch of key-val mutations into a
moss "collection", the batch of mutations is sorted by key, becoming
an immutable "segment" of key-val mutations.  The segment is then
pushed onto the end of a logical sequence of segments.  See:
collection.go / collection.ExecuteBatch().

The logical sequence of segments that is maintained by a collection
has different sections...

    +---------------------+
    | collection          |
    |                     |
    | * top               |
    | * mid               |
    | * base              |
    | * clean             |
    |                     |
    +---------------------+

    Figure 2.

The top, mid, base sections are considered dirty, containing segments
with mutations that haven't yet been written back to the optional
lower-level storage.  The clean section is considered non-dirty,
having already been persisted to the optional lower-level store.

A section is implemented as a "segmentStack".  See: segment_stack.go.

The top section is where incoming batches of mutations are pushed.

    +---------------------+
    | collection          |
    |                     |
    | * top               |
    |     segment-0       |
    | * mid               |
    | * base              |
    | * clean             |
    |                     |
    +---------------------+

    Figure 3.

Eventually, following the LSM based approach, the logical sequence of
segments becomes too long or too tall...

    +---------------------+
    | collection          |
    |                     |
    | * top               |
    |     segment-4       |
    |     segment-3       |
    |     segment-2       |
    |     segment-1       |
    |     segment-0       |
    | * mid               |
    | * base              |
    | * clean             |
    |                     |
    +---------------------+

    Figure 4.

So, the segments need to be merged together.  A background "merger"
goroutine handles this in a loop, by atomically moving segments from
the top section to the mid section.

    +---------------------+
    | collection          |
    |                     |
    | * top               |
    | * mid               |
    |     segment-4       |
    |     segment-3       |
    |     segment-2       |
    |     segment-1       |
    |     segment-0       |
    | * base              |
    | * clean             |
    |                     |
    +---------------------+

    Figure 5.

This opens up more logical, configuration-controlled space in the top
section for more incoming batches, such as segment-5...

    +---------------------+
    | collection          |
    |                     |
    | * top               |
    |     segment-5       |
    | * mid               |
    |     segment-4       |
    |     segment-3       |
    |     segment-2       |
    |     segment-1       |
    |     segment-0       |
    | * base              |
    | * clean             |
    |                     |
    +---------------------+

    Figure 6.

The merger then asynchronously merges the segments in the mid section
and atomically swaps in the shorter mid section when done, and
repeats.  See: segment_stack_merge.go.

    +---------------------+
    | collection          |
    |                     |
    | * top               |
    |     segment-5       |
    | * mid               |
    |     segment-0...4   |
    | * base              |
    | * clean             |
    |                     |
    +---------------------+

    Figure 7.

To support optional write-back persistence, a background "persister"
goroutine handles this in a loop, by atomically moving the mid section
into the base section....

    +---------------------+
    | collection          |
    |                     |
    | * top               |
    |     segment-5       |
    | * mid               |
    | * base              |
    |     segment-0...4   |
    | * clean             |
    |                     |
    +---------------------+

    Figure 8.

The persister then writes out all the mutations in the base section to
the lower-level storage.  When done with persistence, the persister
atomically clears out the base section or optionally moves it into the
clean section, and repeats.  See: persister.go.

    +---------------------+
    | collection          |
    |                     |
    | * top               |
    |     segment-5       |
    | * mid               |
    | * base              |
    | * clean             |
    |                     |
    +---------------------+
    | lower-level-storage |
    |                     |
    | mutations from...   |
    |     segment-0...4   |
    |                     |
    +---------------------+

    Figure 9.

A snapshot is implemented by copying all the segment pointers from all
the sections and returning a brand new, immutable segmentStack.  See:
collection.go / collection.Snapshot().

For example, if the snapshot was taken when the collection was in the
state represented by Figure 6, then the snapshot would look like...

    +---------------------+
    | segmentStack        |
    |                     |
    |     segment-5       |
    |     segment-4       |
    |     segment-3       |
    |     segment-2       |
    |     segment-1       |
    |     segment-0       |
    |                     |
    +---------------------+

    Figure 10.

The lower-level storage must also provide snapshot semantics for
snapshotting to work.

Per the LSM design approach, the key-val mutations from a more recent
or higher segment in a segmentStack shadow the key-val mutations from
an older or lower segment in a segmentStack.  See: segment_stack.go /
segmentStack.Get().

An iterator is implemented by performing a heap-merge from multiple
segments.  See: iterator.go.

mossStore
=========

One option for lower-level storage is the built-in storage
implementation called mossStore.  It uses an append-only file-format
design, where mossStore appends new segments and a "footer" to the end
of a file, with page-aligned start offsets.  The footer contains
metadata such as the offsets and lengths of the segments written to
the file.  See: store.go / SegmentLoc struct and pageAlignCeil().

Once the file footer is appended, mossStore performs an mmap() to read
the just-appended segments.  See: store_footer.go /
Footer.loadSegments().

The already-written parts of the file are immutable and any mmap()'ed
regions are also read-only.

On a restart or reopen of a previous mossStore file, mossStore scans
the file backwards looking for the last good footer.  See:
store_footer.go / ScanFooter().

mossStore supports navigating to previous, read-only snapshots of the
database, by scanning the file for older footers.  See:
store_previous.go.

mossStore can revert to a previous snapshot of the database, by
copying an older footer and just appending it to the end of the file
as a brand-new footer.  See: store_revert.go.

Full compaction is supported by mossStore to handle the situation of
an ever-growing append-only file, where active data from file X is
copied to a file X+1, and then file X is deleted.  See:
store_compact.go.

To track file-related handle resources, mossStore uses ref-counting.
For file handles, see file.go.  For mmap()'ed regions, see mmap.go.

One implication of this mossStore design is that a file written on one
machine endian'ness cannot be opened on a machine with different
endian'ness, especially since mossStore has an optimization where
large arrays of uint64's are read/written from/to storage as large,
contiguous byte arrays.  See: slice_util.go.

Child Collections
=================

Moss allows multiple related key-value collections to be logically
& atomically grouped into child collections within a collection.
The following properties hold true for these child collections:
* Logically separate key-space.
* Independent iteration & point queries.
* Fast deletions and recreations.
* Atomic grouping via the batch API. This means that all child
  collection batches are processed together in an all-or-none manner
  during moss's background merges and compactions.

For example, say A, B are child collections :
    Batch1: | A-10-docs + B-3-docs |
    Batch2: | B-4-docs |
    Batch3: | A-2-docs |

After background merging only the following can happen:
    Batch(1+2) | A-10-docs + B-7-docs |
    Batch3: | A-2-docs |
    OR
    Batch(1+2+3)  | A-12-docs + B-7-docs |

Background merges and compactions will never split up child
collections that were executed in the same batch.

So there will never be a state like the following:
    Batch(1+2): | A-10-docs|
    Batch3: | B-7-docs|

This feature is similar to multiple tables in a relational database.

Here is a usecase: say a collection represents people in an software
company.  The keys are user ids of the employees. Now if the company
wishes to store metadata about its employees, such as number of people
in development, sales, marketing in the same collection, it can do so
by prefixing the user ids with a "role" id. This wastes key space and
makes deletion of employees an expensive operation. Instead the
"roles" can simply be a child collection and the keys in this child
collection can be the actual roles having the count of the number of
employees in each role as the values.

Internally child collections are recursively stored within the
respective Snapshot implementations.

For example segmentStack implements child collection as follows:

    top-level or default collection
    +---------------------+
    | segmentStack        |
    |                     |
    |     segment-2       |
    |     segment-1       |
    |     segment-0       |
    |                     |             child-collection
    |     "roles" => -------------> +---------------------+
    |                     |         | segmentStack        |
    +---------------------+         |                     |
                                    |     segment-2       |
                                    |     segment-1       |
                                    |     segment-0       |
                                    |                     |
                                    +---------------------+

Note: Since child collections are recursively linked with higher
collections, any child collection snapshots opened must be explicity
closed in order to prevent resource leaks.