1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
|
Uniform message delivery
========================
Author: Bela Ban
Definition
----------
Dynamic uniformity is the property that a message M is delivered to the group members only
after all members acknowledge M's reception ([1] ch 13.9). This ensures that either all operational
members or none receive M. The latter could occur if M's sender
crashed right after sending M, but before *any* of the members received M. If at least one member P
received M, and P doesn't crash, then M will be delivered to all members.
This is different from two-phase commit (2PC) protocols in the database world, where even non-operational
(= crashed) members must eventually deliver M, by logging the PREPARE and COMMIT actions, and processing
the log when restarting after a crash.
2PC is not necessary in the group communication world, because the state of a new member can be initialized
from an existing member by means of state transfer.
Design
------
There is an additional UNIFORM protocol, which needs to sit somewhere on top of NAKACK (note that uniformity doesn't
apply to unicast messages, only to multicast messages). UNIFORM also needs reliable unicast messages, so it must sit
on top of UNICAST, too.
init():
- preparing-table (sender),
committing-table (sender): map<M, List<Address>> where M is <sender:seqno> and List is a list of
members from which we expect ACKs for M
- prepared-table (receiver): map<M, Entry> where M is <sender:seqno>
and Entry contains (1) the real message received as part of PREPARE and
(2) the list of members to which this PREPARE was sent
- committed-table (receiver): list<M> where M is <sender:seqno>
On send(M):
- Add PREPARE header to M
- Put message M in preparing-table (key is M (dest + seqno), value is list of members from which we expect
ACKs (this is the membership view at the time M was sent)
- Pass message down
On reception of PREPARE(M) from P:
- Add M to prepared-table if M is not yet in committed-table
- Send PREPARE-ACK(M) to P
On reception of PREPARE-ACK(M) from P:
- Remove P from M's list in preparing-table
- If M's list is empty:
- Remove M from preparing-table
- Add M to committing-table
- Send COMMIT(M) message to group
On reception of COMMIT(M) from P:
- Remove M from prepared-table (if present)
- Add M to committed-table (if not yet present)
- Pass M up the stack (delivering it)
- Send COMMIT-ACK(M) to P
On reception of COMMIT-ACK(M) from P:
- Remove P from M's list in committing-table
- If M's list is empty:
- Remove M from committing-table
- Send ACK(M) to group
On reception of ACK(M):
- Remove M from committed-table
- On reception of COLLECT(P) from R:
- Get all messages from P in prepared-table and committed-table which are not yet in list L
sent as part of the COLLECT message and add them to L
- Send a COLLECT-ACK to R containing the updated L
On reception of VIEW(V):
- We need to agree on all pending PREPARE, COMMIT and ACK messages for a sender P who is not in V anymore.
(This is similar to the LEAVE/STOP use case below)
- Some member R drives this flush protocol, R might be the coordinator, or a deterministically chosen member (e.g.
the one right next to P)
- Get a list L of all messages from P in prepared-table and committed-table
- Send a COLLECT(P) message to the group, the message contains L
- Wait for all COLLECT-ACK responses:
- Merge response into 2 lists: L1 and L2
- Add all elements of L1 into preparing-table, for each element of L1 send a PREPARE message
- Add all elements of L2 into committing-table, for each element of L2 send a COMMIT message
- The rest is driven by the regular processing of the sender
On reception of SUSPECT(P):
-Ignored
On LEAVE/STOP:
- Flush preparing-table: wait for all PREPARE-ACKs but don't send new PREPARE messages
- Flush committing-table: wait for all COMMIT-ACKs but don't send new COMMIT messages
- Flush acking-table: wait until all ACKs have been received, don't add new elements
Misc
----
- Provide JMX stats of all tables, so we can introspect them at runtime, e.g. to ensure that they don't grow
infinitely
- The PREPARE, COMMIT and ACK messages contain also a member list at which the message is targeted
- The VIEW processing needs to be handled in a separate thread. When a view V2 arrives while V1 is still
handled, and if V2 contains fewer members than V1, then the wait for COLLECT-ACK might wait forever. Therefore,
all members from V1 not in V2 will be removed from the processing of V1, so the wait for COLLECT-ACK can terminate
- The fact that processing of a message M from P is continued by another member R when P crashes, will lead to
eventual delivery of M as long as at least 1 member has received M and that member remains operational for M
to be committed across the group
References
----------
[1] Birman, Kenneth P.: Building Secure and Reliable Network Applications.
Manning, Greenwich CT, 1996
|