FLUSH protocol for default stack ================================ Author: Bela Ban See http://jira.jboss.com/jira/browse/JGRP-82 Overview -------- Flushing means that all members of a group flush their pending messages. The process of flushing acquiesces the cluster so that state transfer or a join can be done. It is also called the stop-the-world model as nobody will be able to send messages while a flush is in process. When is it needed ? ------------------- (1) On state transfer. When a member requests state transfer it tells everyone to stop sending messages and waits for everyone's ack. Then it asks the application for its state and ships it back to the requester. After the requester has received and set the state successfully, the requester tells everyone to resume sending messages. (2) On view changes (e.g.a join). Before installing a new view V2, flushing would ensure that all messages *sent* in the current view V1 are indeed *delivered* in V1, rather than in V2 (in all non-faulty members). This is essentially Virtual Synchrony. Design ------ As another protocol FLUSH which handles a FLUSH event. FLUSH sits just below the channel, e.g. above STATE_TRANSFER and FC. STATE_TRANSFER and GMS protocol request flush by sending a SUSPEND event up the stack, where it is handled by the FLUSH protocol. When done (e.g. view was installed or state transferred), the protocol sends up a RESUME event, which will allow everyone in the cluster to resume sending. FLUSH protocol (state transfer) ------------------------------- - State requester sends SUSPEND event: multicast START_FLUSH to everyone (including self) - On reception of START_FLUSH: - send BLOCK event up the stack to application level - once BLOCK returns flip latch on FLUSH.down(), all threads invoking FLUSH.down() will block on mutex - get DIGEST from NAKACK below and append it to FLUSH_COMPLETED message - send FLUSH_COMPLETED back to flush requester - On reception of FLUSH_COMPLETED at flush coordinator: - Record FLUSH_COMPLETED in vector (1 per member) - Once FLUSH_COMPLETED messages have been received from all members: - allow thread that passed SUSPEND event to return - State requester sends RESUME event: multicast STOP_FLUSH message - On reception of STOP_FLUSH message: - flip latch off at FLUSH.down() and send UNBLOCK event up to application level FLUSH protocol (view installation for join/crash/leave) ------------------------------------------------------- - Coordinator initiates FLUSH prior to view V2 installation - Coordinator sends SUSPEND event: multicast START_FLUSH to every surviving/existing member in previous view V1 (including self) - On reception of START_FLUSH (existing members in V1): - send BLOCK event up the stack to application level - once BLOCK returns flip latch on FLUSH.down(), all threads invoking FLUSH.down() will block on mutex - get DIGEST from NAKACK below and append it to FLUSH_COMPLETED message - send FLUSH_COMPLETED back to flush requester - On reception of FLUSH_OK at each existing member in V1: - On reception of FLUSH_COMPLETED at flush coordinator: - Record FLUSH_COMPLETED in vector (1 per member) - Once FLUSH_COMPLETED messages have been received from all members: - allow thread that passed SUSPEND event to return - Coordinator stops FLUSH after view V2 installation - On RESUME event: multicast STOP_FLUSH message - On reception of STOP_FLUSH message: - flip latch off at FLUSH.down() and send UNBLOCK event up to application level Crash handling during FLUSH ---------------------------- - On suspect(P) - Do not expect any FLUSH message exchange from P (effectively exclude from FLUSH) - If P is the flush requester and Q is next member in view - Run the FLUSH protocol from scratch with Q as flush requester Notes ----- We are only concerned with multicast message in FLUSH. We always let unicast messages down the latch at FLUSH.down() since virtual synchrony is only relevant to multicasts. FLUSH on view change -------------------- Whenever there is a view change, GMS needs to use FLUSH to acquiesce the group, before sending the view change message and returning the new view to the joiner. Here's pseudo code of the coordinator A in group {A,B,C}: - On reception of JOIN(D): - FLUSH(A,B,C) - wait until flush returns, we flush A, B and C, but onviously not D - Compute new view V2={A,B,C,D} - Multicast V2 to {A,B,C}, wait for all acks - Send JOIN_RSP with V2 to D, wait for ack from D - RESUME sending messages This algorithm also works for a number of JOIN and LEAVE requests (view bundling): - On handleJoinAndLeaveRequests(M, S, L, J), where M is the current membership, S the set of suspected members, L the set of leaving members and J the set of joined members: - FLUSH(M) - Compute new view V2=M - S - L + J - Multicast V2 to M-S, wait for all acks - Send LEAVE_RSP to L, wait for all acks - Send JOIN_RSP to J, wait for all acks - Resume sending messages FLUSH with join & state transfer -------------------------------- http://jira.jboss.com/jira/browse/JGRP-236 (DONE) Needs description (TODO) Description below is outdated - needs revisiting (TODO) Design change: flushing and block() / unblock() callbacks --------------------------------------------------------- The change consists of moving the blocking of message sending in FLUSH.down() from the first phase to the second phase: Phase #1: send START_FLUSH across the cluster, on reception everyone invokes block(). We do *not* (as previously done) block messages in FLUSH.down(). Method block() might send some messages, but when block() returns (or Channel.blockOk() is called), FLUSH_OK will be broadcast across the cluster. Because FIFO is in place, we can be sure that all *multicast* (group) messages from member P are received *before* P's FLUSH_OK message. Phase #2: send FLUSH_OK across the cluster. When every member has received all FLUSH_OKs from all other member, every member blocks messages in FLUSH.down(). Phase #3: send FLUSH_COMPLETED to the initiator of the flush Phase #4: send STOP_FLUSH across the cluster, e.g. when state has been transferred, or member has joined Use case: block() callback needs to complete transactions which are in the 2PC (two-phase commit) phase - Members {A,B,C} - C just joined and now starts a state transfer, to acquire the state from A - A and B have transactions in PREPARED or PREPARING state, so these need to be completed - C initiates the flush protocol - START_FLUSH is broadcast across the cluster - A and B have their block() method invoked, they broadcast PREPARE and/or COMMIT messages to complete existing prepared or preparing transaction, and blocks or rolls back other transactions - A and B have to wait for PREPARE or COMMIT acks from all members, they block until these are received (inside of the block() callback) - Once all acks have been received, A and B return from block(), they now broadcast FLUSH_OK messages across the cluster - Once everyone has received FLUSH_OK messages from A, B and C, we can be assured that we have also received all PREPARE/COMMIT/ROLLBACK messages and their acks from all members - Now FLUSH.down() at A,B,C starts blocking sending of messages Issues: - What happens is a member sends another message M *after* returning from block() but *before* that member has received all FLUSH_OK messages and therefore blocks ? - This algorithm only works for multicast messages, not unicast messages because multicast and unicast messages are not ordered with respect to each other