Probabilistic Broadcast for JGroups
		===================================


JavaGroups currently uses virtual synchrony (VS) in its main protocol
suite. VS is suited for tightly coupled, lockstep replication. Typical
examples are clusters, replicated databases etc. Group size is 100
max, and it is targeted to LANs rather than WANs.

The problem with VS is that is has to enforce that all members have
received all messages in a view before proceeding to the next
view. This is done by a FLUSH protocol, which ensures (by
retransmission) that each member has seen all messages in the current
view. During the FLUSH protocol, all members are essentially
blocked. Messages can be sent, but they will be sent only when the
FLUSH protocol has terminated (in one of the subsequent view, not in
the current one). The FLUSH protocol itself may need to be restarted,
e.g. in the case when a participating member fails during the FLUSH.

When one node (or a link) in a VS group is slow, it will bring the
performance of the entire group down, as members proceed at the pace
of the slowest members (at least during membership
changes). (Otherwise, the likely result is just growing buffers and
retransmissions, as messages waiting to be delivered are buffered).

The bimodel multicast (or probabilistic broadcast) protocols (PBCAST)
developed at Cornell try to solve this problem by providing
probabilistic reliability guarantees rather than hard ones. In a
nutshell, the probability of a very small number of members receiving
a message is high and the probability of all members receiving it is
high as well. The probability of some members receiving a message is
very small, because the 'epidemic' nature of PBCAST infects the group
exponentially, making sure every member receives a message, or none.

PBCAST protocols therefore scale very well, both in terms of group
member size as well as over WANs with intermittent link/node
failures. By implementing a PBCAST protocol, JavaGroups can now be
used in WAN settings. However, there are no hard reliability
guarantees anymore, just probabilitic ones. Yes there are a number of
applications, which don't need hard reliability, and can live with
probabilistic guarantees, for example replicated naming services and
publish-subscribe applications. In these settings, eventual
convergence of replicated state and low-cost of the protocol is more
important than lock-step replication.

The JavaGroups API will not be changed at all. However, applications
with a protocol stack configured to use PBCAST have to be aware that
views are only an approximation of the membership, not a hard
guarantee.

The PBCAST protocol is located in the ./pbcast subdirectory of
./Protocols. The major changes are:


GMS
---
Unlike VS, the JavaGroups implementation of PBCAST does not per se
guarantee that the set of messages delivered in a view V is the same
at all members. Therefore, applications cannot rely on the fact that
when they send a message in view V, it will be received by all current
non-faulty members in V.

Views are delivered at each receiver at a certain position in the
incoming message stream. However, as PBCAST only provides FIFO (which
guarantees that messages from sender P are seen in the order sent by
P), it is possible that messages sent by senders P and Q in view V1
can be received in different views at each receiver. However, it is
possible to add total order by implementing a TOTAL protocol and
adding it on top of a given protocol stack. This would then
essentially provide VS.

Consider the following example: P send messages m1 and m2 in view V1
(consisting of P, Q and R). While it sends the messages, a new member
S joins the group. Since there is no FLUSH protocol that ensures that
m1 and m2 are delivered in V1, the following could happen: m1 is
delivered to Q and R in V1. Message m2 is delivered to Q, but is lost
to R (e.g. dropped by a lossy link). Now, the new view V2 is installed
by Q (which is the coordinator). Now, m2 is retransmitted by P to
R. Clearly, VS would drop m2 because it was sent in a previous
view. However, PBCAST faces two choices: either accept the message and
deliver it or drop it as well. If we accept it, the FIFO properties
for P are upheld, if we drop it, the next message m3 from P will not
be delivered until m2 was seen by R. (Message IDs are not reset to 0
because we have no total order over views beeing delivered at each
member at the same location in the message stream, as shown
above). Therefore, we have to accept the message.

This leads to the conclusion that views are not used as a demarcation
between message sets, but rather as indication that the group
membership has changed. Therefore, protocols in the PBCAST suite will
only use views to update their internal membership list, but never
make the assumption that all members will see the view change at the
same logical location in their message streams.


FLUSH
-----
Not used anymore, as we're not flushing messages when proceeding to
the next view.


NAKACK
------
Not used anymore. Functionality will be covered by PBCAST. NAKACK made
assumptions about views and messages and can therefore not be used.


VIEW_ENFORCER
-------------
Not used anymore. Messages sent in one view can be delivered in
another one, although this usually doesn't happen. But we cannot make
any assumptions about it.


STATE_TRANSFER
--------------
Not used anymore. New protocol for state transfer, especially geared
towards big states (transfer in multiple transfers). However,
STATE_TRANSFER could still be used (a TOTAL protocol has to be
present).


QUEUE
-----
May be used by the new state transfer protocol


STABLE
------
Not used anymore. Functionality will be covered by PBCAST protocol.


Refs
----
[1] http://www.cs.cornell.edu/Info/Projects/Spinglass/index.html