1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
|
Separation of merges from view handling
=======================================
Author: Bela Ban
JIRA: https://jira.jboss.org/jira/browse/JGRP-1009
Goal:
-----
We don't want concurrent merges and view changes (join or leave processing). During a merge, join and leave requests
should be discarded. Likewise, during a join/leave processing, merge requests should be discarded.
We already do discard join or leave requests during a merge (ViewHandler is suspended), but not the other way round.
JGRP-1009 leads to spurious merges: when a join or leave request is being processed and the view is being
disseminated, if a merge is permitted to occur during this, the merge leader might detect different views
(due to them arriving at different members at different times, maybe a few millisconds apart) and initiate a merge.
This won't happen when the merge is discarded during view processing / installation.
Design:
-------
There are 3 types of events we need to take into acccount:
- The coord receiving a JOIN/JOIN_WITH_STATE/LEAVE/SUSPECT event (anything which leads to a new view being installed)
- The coord receiving a MERGE event (e.g. from MERGE2 somewhere below in the stack)
- The coord receiving a MERGE-REQ request (from a coord in a different partition)
On reception of a JOIN/JOIN_WITH_STATE/LEAVE/SUSPECT event
----------------------------------------------------------
- If the ViewHandler is suspended --> discard the event
- Else, add the event
- When starting to process the event(s) in the queue:
- Suspend the ViewHandler
- Start the Resumer task (which resumes the ViewHandler after N seconds)
- Resume the ViewHandler when done processing
On reception of a MERGE event
-----------------------------
- If the ViewHandler is suspended --> discard the event
- Else:
- If there are JOIN/LEAVE/etc events in the queue: discard the event and start the processing of the queued events
- Else:
- Process the MERGE event
- Suspend the ViewHandler
- Start the Resumer task (which resumes the ViewHandler after N seconds)
- Resume the ViewHandler when done processing
On reception of a MERGE-REQ
---------------------------
- If the ViewHandler is suspended --> reject the MERGE-REQ (send MERGE-RSP with merge_rejected=true)
- Else:
- Suspend the ViewHandler
- Start the Resumer task
- When the merge is done --> resume the ViewHandler
- On Resumer timeout: resume the ViewHandler
(this could happen for instance when a remote coord starts a merge, then crashes before merge completion)
Resuming the view handler:
--------------------------
The following 4 cases can resume the view handler
#1 JOIN/LEAVE
-------------
- When the view has been installed by the coord, the view handler is resumed
- The view handler needs to be resumed also if the view installation fails, e.g. due to a failed flush
#2 MERGE
--------
- On competion of the merge (successful or failed), the view handler is resumed
#3 MERGE-REQ
------------
- Resume the view handler when getting a MergeView
- Special case: if the merge leader crashes before merge completion:
- On a MERGE-REQ, record the merge leader's address (when suspending the view handler)
- When the merge completes, null the merge leaders address again
- When we get a view excluding the merge leader, and the leader's address is non-null, resume the
view handler and null the merge leader's address
#4 The Resumer kicks in
-----------------------
- The Resumer is started whenever the view handler is suspended
- It resumes the view handler when run
- When the view handler is resumed regularly, the Resumer is stopped
Issues:
-------
- What if the client sends a JOIN_WITH_STATE, the coord processes the JOIN, but suspends the queue after it and before
processing the GET_STATE ?
|