1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
|
FLUSH protocol for default stack
================================
Author: Bela Ban
See http://jira.jboss.com/jira/browse/JGRP-82
Overview
--------
Flushing means that all members of a group flush their pending messages. The process of flushing acquiesces the cluster
so that state transfer or a join can be done. It is also called the stop-the-world model as nobody will be able to
send messages while a flush is in process.
When is it needed ?
-------------------
(1) On state transfer. When a member requests state transfer it tells everyone to stop sending messages and waits
for everyone's ack. Then it asks the application for its state and ships it back to the requester. After the
requester has received and set the state successfully, the requester tells everyone to resume sending messages.
(2) On view changes (e.g.a join). Before installing a new view V2, flushing would ensure that all messages *sent* in the
current view V1 are indeed *delivered* in V1, rather than in V2 (in all non-faulty members). This is essentially
Virtual Synchrony.
Design
------
As another protocol FLUSH which handles a FLUSH event. FLUSH sits just below the channel, e.g. above STATE_TRANSFER
and FC. STATE_TRANSFER and GMS protocol request flush by sending a SUSPEND event up the stack, where it is handled
by the FLUSH protocol.
When done (e.g. view was installed or state transferred), the protocol sends up a RESUME event, which will allow
everyone in the cluster to resume sending.
FLUSH protocol (state transfer)
-------------------------------
- State requester sends SUSPEND event: multicast START_FLUSH to everyone (including self)
- On reception of START_FLUSH:
- send BLOCK event up the stack to application level
- once BLOCK returns flip latch on FLUSH.down(), all threads invoking FLUSH.down() will block on mutex
- get DIGEST from NAKACK below and append it to FLUSH_COMPLETED message
- send FLUSH_COMPLETED back to flush requester
- On reception of FLUSH_COMPLETED at flush coordinator:
- Record FLUSH_COMPLETED in vector (1 per member)
- Once FLUSH_COMPLETED messages have been received from all members:
- allow thread that passed SUSPEND event to return
- State requester sends RESUME event: multicast STOP_FLUSH message
- On reception of STOP_FLUSH message:
- flip latch off at FLUSH.down() and send UNBLOCK event up to application level
FLUSH protocol (view installation for join/crash/leave)
-------------------------------------------------------
- Coordinator initiates FLUSH prior to view V2 installation
- Coordinator sends SUSPEND event: multicast START_FLUSH to every surviving/existing member in previous view V1 (including self)
- On reception of START_FLUSH (existing members in V1):
- send BLOCK event up the stack to application level
- once BLOCK returns flip latch on FLUSH.down(), all threads invoking FLUSH.down() will block on mutex
- get DIGEST from NAKACK below and append it to FLUSH_COMPLETED message
- send FLUSH_COMPLETED back to flush requester
- On reception of FLUSH_OK at each existing member in V1:
- On reception of FLUSH_COMPLETED at flush coordinator:
- Record FLUSH_COMPLETED in vector (1 per member)
- Once FLUSH_COMPLETED messages have been received from all members:
- allow thread that passed SUSPEND event to return
- Coordinator stops FLUSH after view V2 installation
- On RESUME event: multicast STOP_FLUSH message
- On reception of STOP_FLUSH message:
- flip latch off at FLUSH.down() and send UNBLOCK event up to application level
Crash handling during FLUSH
----------------------------
- On suspect(P)
- Do not expect any FLUSH message exchange from P (effectively exclude from FLUSH)
- If P is the flush requester and Q is next member in view
- Run the FLUSH protocol from scratch with Q as flush requester
Notes
-----
We are only concerned with multicast message in FLUSH. We always let unicast messages
down the latch at FLUSH.down() since virtual synchrony is only relevant to multicasts.
FLUSH on view change
--------------------
Whenever there is a view change, GMS needs to use FLUSH to acquiesce the group, before sending the view change message
and returning the new view to the joiner. Here's pseudo code of the coordinator A in group {A,B,C}:
- On reception of JOIN(D):
- FLUSH(A,B,C) - wait until flush returns, we flush A, B and C, but onviously not D
- Compute new view V2={A,B,C,D}
- Multicast V2 to {A,B,C}, wait for all acks
- Send JOIN_RSP with V2 to D, wait for ack from D
- RESUME sending messages
This algorithm also works for a number of JOIN and LEAVE requests (view bundling):
- On handleJoinAndLeaveRequests(M, S, L, J), where M is the current membership, S the set of suspected members,
L the set of leaving members and J the set of joined members:
- FLUSH(M)
- Compute new view V2=M - S - L + J
- Multicast V2 to M-S, wait for all acks
- Send LEAVE_RSP to L, wait for all acks
- Send JOIN_RSP to J, wait for all acks
- Resume sending messages
FLUSH with join & state transfer
--------------------------------
http://jira.jboss.com/jira/browse/JGRP-236 (DONE)
Needs description (TODO)
Description below is outdated - needs revisiting (TODO)
Design change: flushing and block() / unblock() callbacks
---------------------------------------------------------
The change consists of moving the blocking of message sending in FLUSH.down() from the first phase to the second phase:
Phase #1: send START_FLUSH across the cluster, on reception everyone invokes block(). We do *not* (as previously done)
block messages in FLUSH.down(). Method block() might send some messages, but when block() returns (or
Channel.blockOk() is called), FLUSH_OK will be broadcast across the cluster. Because FIFO is in place, we
can be sure that all *multicast* (group) messages from member P are received *before* P's FLUSH_OK message.
Phase #2: send FLUSH_OK across the cluster. When every member has received all FLUSH_OKs from all other member, every
member blocks messages in FLUSH.down().
Phase #3: send FLUSH_COMPLETED to the initiator of the flush
Phase #4: send STOP_FLUSH across the cluster, e.g. when state has been transferred, or member has joined
Use case: block() callback needs to complete transactions which are in the 2PC (two-phase commit) phase
- Members {A,B,C}
- C just joined and now starts a state transfer, to acquire the state from A
- A and B have transactions in PREPARED or PREPARING state, so these need to be completed
- C initiates the flush protocol
- START_FLUSH is broadcast across the cluster
- A and B have their block() method invoked, they broadcast PREPARE and/or COMMIT messages to complete existing
prepared or preparing transaction, and blocks or rolls back other transactions
- A and B have to wait for PREPARE or COMMIT acks from all members, they block until these are received (inside
of the block() callback)
- Once all acks have been received, A and B return from block(), they now broadcast FLUSH_OK messages across the cluster
- Once everyone has received FLUSH_OK messages from A, B and C, we can be assured that we have also received all
PREPARE/COMMIT/ROLLBACK messages and their acks from all members
- Now FLUSH.down() at A,B,C starts blocking sending of messages
Issues:
- What happens is a member sends another message M *after* returning from block() but *before* that member has received
all FLUSH_OK messages and therefore blocks ?
- This algorithm only works for multicast messages, not unicast messages because multicast and unicast messages are not
ordered with respect to each other
|