1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273
|
<!---
This document contains embedded graphviz diagrams inside ```dot blocks.
To convert it to rendered form using render.py:
$ ./render.py wrapping-upb.in.md
You can also live-preview this document with all diagrams using Markdown Preview Enhanced
in Visual Studio Code:
https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced
--->
# μpb Arena Fusion
μpb generally follows a thread-compatibility model where only operations on a
`const` pointer may occur concurrently from multiple threads; non-const
operations must not race with each other or operations on `const` pointers.
Fusion, via `upb_Arena_Fuse`, binds the lifetime of two arenas together, such
that none are freed until all of the transitively fused arenas reach refcount 0.
This is useful to avoid dangling pointers when a message in one arena references
a message in another - such as when a child message is constructed in isolation
then set onto a parent message - by fusing the parent's arena with the child's,
the child's lifetime is tied to the parent's without needing to copy it.
Both modifying the refcount and fusing are thread safe; if you want to
conceptually perform allocations concurrently from multiple threads with the
same arena lifetime, one way to achieve that would be to share a `const
upb_Arena* parent` plus a dedicated arena per thread, and fuse the
thread-specific arena with the `parent` arena. By contrast, reference counting
cannot help to achieve concurrent allocations on multiple threads, it only helps
in synchronizing lifetimes from multiple things observing the same arena (which
may be a non-`const` pointer for multiple writers in a single threaded context,
or a `const` pointer for multiple readers in a multithreaded context).
To be usable everywhere, it has a lock-free implementation, documented below.
## Data Structure
Each arena has three pointer sized members used to track the relationship
between arenas. Together, they implement a hybrid disjoint set forest and
doubly-linked list.
```c
// Tagged pointer - tracked as black arrows in diagrams
UPB_ATOMIC(uintptr_t) parent_or_count;
// Linked list - tracked as red arrows in diagrams
UPB_ATOMIC(struct upb_ArenaInternal*) next;
// Linked list - previous tracked as blue arrows in diagrams, tail as dashed
UPB_ATOMIC(uintptr_t) previous_or_tail;
```
The disjoint set is used to check whether two arenas are already fused and to
provide agreement on a single reference count across all fused arenas. The
linked list is used to implement freeing arenas once the refcount reaches 0, and
to implement `upb_Arena_SpaceAllocated`.
Two arenas are considered fused when the disjoint set says they are; the linked
list is guaranteed to converge before the refcount reaches 0, but may only track
a subset of fused arenas while racing with fuse calls.
## Finding the root
Each set of fused arenas is uniqiuely identified by the root node of its tree.
To find the root for a given arena, traverse the `parent_or_count` pointer until
you reach a node with a count instead of a parent pointer - that's the root. To
avoid expensive lookups if frequently repeated, this data structure uses path
splitting to halve the distance from the root for every node traversed.
Repeatedly finding the root for a series of nodes that are part of the same
fused set will converge to O(1) quickly. Let's find the root of D, starting
with:
```dot
digraph {
C [label="C (next)"]
D [label="D (current)"]
A -> B [dir=back]
B -> C [dir=back]
C -> D [dir=back]
}
```
We traverse our parents; each time we pass one we link it to its grandparent, so
that future lookups are faster.
```dot
digraph {
B [label="B (next)"]
C [label="C (current)"]
A -> B [dir=back]
B -> C [dir=back]
B -> D [dir=back]
}
```
Then:
```dot
digraph {
A [label="A (next)"]
B [label="B (current)"]
A -> B [dir=back]
A -> C [dir=back]
B -> D [dir=back]
}
```
D and C's paths got shorter; if we query D's root again, all subsequent lookups
will only require one step.
## Fusing
Fusing starts by identifying the roots of each arena to fuse - if they already
share a root, there's nothing to be done. We'll work through the example of
fusing C and D - which is the same as fusing A and B, as they're the roots.
```dot
digraph {
A [label="A (refs=2)"]
B [label="B (refs=3)"]
A -> C [dir=back]
A -> C [weight=0.001, color="red"]
C -> A [weight=0.001, color="blue"]
A -> C [style=dashed]
B -> D [dir=back]
B -> D [weight=0.001, color="red"]
D -> B [weight=0.001, color="blue"]
B -> D [style=dashed]
}
```
A and B are roots, so they store a refcount in their `parent_or_count` and the
start of the linked list in their `next` pointers; C and D are not, so they
store pointers to their parent node in `parent_or_count`. C and D store `NULL`
in `next` as they happen to be the last nodes in their set.
### Pass refcount
As soon as the parent changes, all future refcount operations will switch to the
new root. To avoid decrementing to 0 while the old root still has active
references, we add the lower-addressed node's refcount to the new root.
```dot
digraph {
A [label="A (refs=5)"]
B [label="B (refs=3)"]
A -> C [dir=back]
A -> C [weight=0.001, color="red"]
C -> A [weight=0.001, color="blue"]
A -> C [style=dashed]
B -> D [dir=back]
B -> D [weight=0.001, color="red"]
D -> B [weight=0.001, color="blue"]
B -> D [style=dashed]
}
```
The total transferred refcount is tracked throughout the operation so it can be
fixed up at the end if retries resulted in over-transfer.
### Union
Now the arena with a lower address becomes the new root (A in this case), by
compare-and-exchange the higher address arena's `parent_or_count` pointer.
```dot
digraph {
A [label="A (refs=5)"]
A -> C [dir=back]
A -> C [weight=0.001, color="red"]
C -> A [weight=0.001, color="blue"]
A -> C [style=dashed]
B -> D [dir=back]
B -> D [weight=0.001, color="red"]
D -> B [weight=0.001, color="blue"]
B -> D [style=dashed]
A -> B [dir=back]
}
```
This atomically eliminates B's refcount and makes it one of A's children. Any
change to B's refcount (or fusing it to a different arena with a lower address)
will make the swap fail, and the process starts again from the beginning.
### Refcount fixups
If B's refcount changed while performing the union, we increase or decrease A's
refcount by the difference between the initial refcount we added and B's final
refcount, observed as part of the compare-and-exchange of B's `parent_or_count`.
### Fuse linked lists
To fuse the linked lists, we need to make A's tail point to B. The root node is
always the start of the linked list, so that the list can't contain cycles. From
A's tail pointer, we traverse until we hit the end (C in this case), and CAS its
`next` to B.
```dot
digraph {
A [label="A (refs=5)"]
A -> C [dir=back]
A -> C [weight=0.001, color="red"]
C -> A [weight=0.001, color="blue"]
A -> C [style=dashed]
B -> D [dir=back]
B -> D [weight=0.001, color="red"]
D -> B [weight=0.001, color="blue"]
B -> D [style=dashed]
A -> B [dir=back]
C -> B [weight=0.001, color="red"]
}
```
With this, B is now reachable from the root and the final free operation will be
able to see it, traversing A->C->B->D.
### Update tail pointer
If we stopped at the previous step, it's easy to see how fusing many arenas
could result in O(n^2) behavior, as we'd spend all of our time traversing the
linked list from the root to find the tail. To avoid this, we update A's tail
pointer to point to B's tail - so the next fuse operation will not have to
traverse C or B.
```dot
digraph {
A [label="A (refs=5)"]
A -> C [dir=back]
A -> C [weight=0.001, color="red"]
C -> A [weight=0.001, color="blue"]
A -> D [style=dashed]
B -> D [dir=back]
B -> D [weight=0.001, color="red"]
D -> B [weight=0.001, color="blue"]
B -> D [style=dashed]
A -> B [dir=back]
C -> B [weight=0.001, color="red"]
}
```
### Update back links
To make it possible to always traverse the list in both directions for
`upb_Arena_SpaceAllocated`, we need to make the matching back link for our
doubly-linked list. Since we linked C to B, we need to link B to C via B's
`previous_or_tail` pointer.
```dot
digraph {
A [label="A (refs=5)"]
A -> C [dir=back]
A -> C [weight=0.001, color="red"]
C -> A [weight=0.001, color="blue"]
A -> D [style=dashed]
B -> D [dir=back]
B -> D [weight=0.001, color="red"]
D -> B [weight=0.001, color="blue"]
B -> C [weight=0.001, color="blue"]
A -> B [dir=back]
C -> B [weight=0.001, color="red"]
}
```
## Counting allocated space
Traversing the linked list while the arenas are still live can be tricky, as we
can't necessarily reach our own node via the linked list from the root. Instead,
we scan the linked list in both directions starting from the caller-specified
node. This makes space counting weakly consistent - it's possible to see that A
and B are fused, but not see A's space reflected in `SpaceAllocated(B)` if the
fuse operation that joined them is still in progress. But counting allocated
space is always consistent with itself and any fully complete fuses - nodes are
only appended or prepended to the list, so starting the traversal at the same
point each time guarantees that future scans see a superset of previous ones.
|