1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840
|
<?xml version="1.0" encoding="UTF-8"?>
<chapter id="user-advanced">
<title>Advanced Concepts</title>
<para>This chapter discusses some of the more advanced concepts of JGroups
with respect to using it and setting it up correctly.</para>
<section>
<title>Using multiple channels</title>
<para>When using a fully virtual synchronous protocol stack, the
performance may not be great because of the larger number of protocols
present. For certain applications, however, throughput is more important
than ordering, e.g. for video/audio streams or airplane tracking. In the
latter case, it is important that airplanes are handed over between
control domains correctly, but if there are a (small) number of radar
tracking messages (which determine the exact location of the plane)
missing, it is not a problem. The first type of messages do not occur very
often (typically a number of messages per hour), whereas the second type
of messages would be sent at a rate of 10-30 messages/second. The same
applies for a distributed whiteboard: messages that represent a video or
audio stream have to be delivered as quick as possible, whereas messages
that represent figures drawn on the whiteboard, or new participants
joining the whiteboard have to be delivered according to a certain
order.</para>
<para>The requirements for such applications can be solved by using two
separate stacks: one for control messages such as group membership, floor
control etc and the other one for data messages such as video/audio
streams (actually one might consider using one channel for audio and one
for video). The control channel might use virtual synchrony, which is
relatively slow, but enforces ordering and retransmission, and the data
channel might use a simple UDP channel, possibly including a fragmentation
layer, but no retransmission layer (losing packets is preferred to costly
retransmission).</para>
<para>The <classname>Draw2Channels</classname> demo program (in the
<classname>org.jgroups.demos</classname> package) demonstrates how to use
two different channels.</para>
</section>
<section>
<title id="SharedTransport">The shared transport: sharing a transport between multiple channels in a JVM</title>
<para>
To save resources (threads, sockets and CPU cycles), transports of channels residing within the same
JVM can be shared. If we have 4 channels inside of a JVM (as is the case in an application server
such as JBoss), then we have 4 separate thread pools and sockets (1 per transport, and there are 4
transports (1 per channel)).
</para>
<para>
If those transport happen to be the same (all 4 channels use UDP, for example), then we can share them and
only create 1 instance of UDP. That transport instance is created and started only once, when the first
channel is created, and is deleted when the last channel is closed.
</para>
<para>
Each channel created over a shared transport has to join a different cluster. An exception will be thrown
if a channel sharing a transport tries to connect to a cluster to which another channel over the same
transport is already connected.
</para>
<para>
When we have 3 channels (C1 connected to "cluster-1", C2 connected to "cluster-2" and C3 connected to
"cluster-3") sending messages over the same shared transport, the cluster name
with which the channel connected is used to multiplex messages over the shared transport: a header with
the cluster name ("cluster-1") is added when C1 sends a message.
</para>
<para>
When a message with a header of "cluster-1" is received by the shared transport, it is used to demultiplex
the message and dispatch it to the right channel (C1 in this example) for processing.
</para>
<para>
How channels can share a single transport is shown in <xref linkend="SharedTransportFig"/>.
</para>
<figure id="SharedTransportFig"><title>A shared transport</title>
<graphic fileref="images/SharedTransport.png" format="PNG" align="center" />
</figure>
<para>
Here we see 4 channels which share 2 transports. Note that first 3 channels which share transport
"tp_one" have the same protocols on top of the shared transport. This is <emphasis>not</emphasis>
required; the protocols above "tp_one" could be different for each of the 3 channels as long
as all applications residing on the same shared transport have the same requirements for the transport's
configuration.
</para>
<para>
To use shared transports, all we need to do is to add a property "singleton_name" to the transport
configuration. All channels with the same singleton name will be shared.
</para>
</section>
<section>
<title>Transport protocols</title>
<para>A <emphasis>transport protocol</emphasis> refers to the protocol at
the bottom of the protocol stack which is responsible for sending and
receiving messages to/from the network. There are a number of transport
protocols in JGroups. They are discussed in the following sections.</para>
<para>A typical protocol stack configuration using UDP is:</para>
<screen>
<config>
<UDP
mcast_addr="${jgroups.udp.mcast_addr:228.10.10.10}"
mcast_port="${jgroups.udp.mcast_port:45588}"
discard_incompatible_packets="true"
max_bundle_size="60000"
max_bundle_timeout="30"
ip_ttl="${jgroups.udp.ip_ttl:2}"
enable_bundling="true"
thread_pool.enabled="true"
thread_pool.min_threads="1"
thread_pool.max_threads="25"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="false"
thread_pool.queue_max_size="100"
thread_pool.rejection_policy="Run"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="1"
oob_thread_pool.max_threads="8"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="Run"/>
<PING timeout="2000"
num_initial_members="3"/>
<MERGE2 max_interval="30000"
min_interval="10000"/>
<FD_SOCK/>
<FD timeout="10000" max_tries="5" shun="true"/>
<VERIFY_SUSPECT timeout="1500" />
<pbcast.NAKACK
use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
shun="false"
view_bundling="true"/>
<FC max_credits="20000000"
min_threshold="0.10"/>
<FRAG2 frag_size="60000" />
<pbcast.STATE_TRANSFER />
</config>
</screen>
<para>In a nutshell the properties of the protocols are:</para>
<variablelist>
<varlistentry>
<term>UDP</term>
<listitem>
<para>This is the transport protocol. It uses IP multicasting to send messages to the entire cluster, or
individual nodes. Other transports include TCP, TCP_NIO and TUNNEL.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>PING</term>
<listitem>
<para>Uses IP multicast (by default) to find initial members. Once
found, the current coordinator can be determined and a unicast JOIN
request will be sent to it in order to join the cluster.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>MERGE2</term>
<listitem>
<para>Will merge subgroups back into one group, kicks in after a cluster partition.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>FD_SOCK</term>
<listitem>
<para>Failure detection based on sockets (in a ring form between
members). Generates notification if a member fails</para>
</listitem>
</varlistentry>
<varlistentry>
<term>FD</term>
<listitem>
<para>Failure detection based on heartbeats and are-you-alive messages (in a ring form between
members). Generates notification if a member fails</para>
</listitem>
</varlistentry>
<varlistentry>
<term>VERIFY_SUSPECT</term>
<listitem>
<para>Double-checks whether a suspected member is really dead,
otherwise the suspicion generated from protocol below is discarded</para>
</listitem>
</varlistentry>
<varlistentry>
<term>pbcast.NAKACK</term>
<listitem>
<para>Ensures (a) message reliability and (b) FIFO. Message
reliability guarantees that a message will be received. If not,
the receiver(s) will request retransmission. FIFO guarantees that all
messages from sender P will be received in the order P sent them</para>
</listitem>
</varlistentry>
<varlistentry>
<term>UNICAST</term>
<listitem>
<para>Same as NAKACK for unicast messages: messages from sender P
will not be lost (retransmission if necessary) and will be in FIFO
order (conceptually the same as TCP in TCP/IP)</para>
</listitem>
</varlistentry>
<varlistentry>
<term>pbcast.STABLE</term>
<listitem>
<para>Deletes messages that have been seen by all members (distributed message garbage collection)</para>
</listitem>
</varlistentry>
<varlistentry>
<term>pbcast.GMS</term>
<listitem>
<para>Membership protocol. Responsible for joining/leaving members and installing new views.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>FRAG2</term>
<listitem>
<para>Fragments large messages into smaller ones and reassembles
them back at the receiver side. For both multicast and unicast messages</para>
</listitem>
</varlistentry>
<varlistentry>
<term>STATE_TRANSFER</term>
<listitem>
<para>
Ensures that state is correctly transferred from an existing member (usually the coordinator) to a
new member.
</para>
</listitem>
</varlistentry>
</variablelist>
<section>
<title>UDP</title>
<para>UDP uses IP multicast for sending messages to all members of a
group and UDP datagrams for unicast messages (sent to a single member).
When started, it opens a unicast and multicast socket: the unicast
socket is used to send/receive unicast messages, whereas the multicast
socket sends/receives multicast messages. The channel's address will be
the address and port number of the <emphasis>unicast</emphasis>
socket.</para>
<section>
<title>Using UDP and plain IP multicasting</title>
<para>A protocol stack with UDP as transport protocol is typically
used with groups whose members run on the same host or are distributed
across a LAN. Before running such a stack a programmer has to ensure
that IP multicast is enabled across subnets. It is often the case that
IP multicast is not enabled across subnets. Refer to section <xref
linkend="ItDoesntWork" /> for running a test program that determines
whether members can reach each other via IP multicast. If this does
not work, the protocol stack cannot use UDP with IP multicast as
transport. In this case, the stack has to either use UDP without IP
multicasting or other transports such as TCP.</para>
</section>
<section id="IpNoMulticast">
<title>Using UDP without IP multicasting</title>
<para>The protocol stack with UDP and PING as the bottom protocols use
IP multicasting by default to send messages to all members (UDP) and
for discovery of the initial members (PING). However, if multicasting
cannot be used, the UDP and PING protocols can be configured to send
multiple unicast messages instead of one multicast message <footnote>
<para>Although not as efficient (and using more bandwidth), it is
sometimes the only possibility to reach group members.</para>
</footnote> (UDP) and to access a well-known server (
<emphasis>GossipRouter</emphasis> ) for initial membership information
(PING).</para>
<para>To configure UDP to use multiple unicast messages to send a
group message instead of using IP multicasting, the
<parameter>ip_mcast</parameter> property has to be set to
<literal>false</literal> .</para>
<para>To configure PING to access a GossipRouter instead of using IP
multicast the following properties have to be set:</para>
<variablelist>
<varlistentry>
<term>gossip_host</term>
<listitem>
<para>The name of the host on which GossipRouter is
started</para>
</listitem>
</varlistentry>
<varlistentry>
<term>gossip_port</term>
<listitem>
<para>The port on which GossipRouter is listening</para>
</listitem>
</varlistentry>
<varlistentry>
<term>gossip_refresh</term>
<listitem>
<para>The number of milliseconds to wait until refreshing our
address entry with the GossipRouter</para>
</listitem>
</varlistentry>
</variablelist>
<para>Before any members are started the GossipRouter has to be
started, e.g.</para>
<screen>
java org.jgroups.stack.GossipRouter -port 5555 -bindaddress localhost
</screen>
<para>This starts the GossipRouter on the local host on port 5555. The
GossipRouter is essentially a lookup service for groups and members.
It is a process that runs on a well-known host and port and accepts
GET(group) and REGISTER(group, member) requests. The REGISTER request
registers a member's address and group with the GossipRouter. The GET
request retrieves all member addresses given a group name. Each member
has to periodically ( <parameter>gossip_refresh</parameter> )
re-register their address with the GossipRouter, otherwise the entry
for that member will be removed (accommodating for crashed
members).</para>
<para>The following example shows how to disable the use of IP
multicasting and use a GossipRouter instead. Only the bottom two
protocols are shown, the rest of the stack is the same as in the
previous example:
<screen>
<UDP ip_mcast="false" mcast_addr="224.0.0.35" mcast_port="45566" ip_ttl="32"
mcast_send_buf_size="150000" mcast_recv_buf_size="80000"/>
<PING gossip_host="localhost" gossip_port="5555" gossip_refresh="15000"
timeout="2000" num_initial_members="3"/>
</screen>
</para>
<para>The property <parameter>ip_mcast</parameter> is set to
<literal>false</literal> in <classname>UDP</classname> and the gossip
properties in <classname>PING</classname> define the GossipRouter to
be on the local host at port 5555 with a refresh rate of 15 seconds.
If PING is parameterized with the GossipRouter's address
<emphasis>and</emphasis> port, then gossiping is enabled, otherwise it
is disabled. If only one parameter is given, gossiping will be
<emphasis>disabled</emphasis> .</para>
<para>Make sure to run the GossipRouter before starting any members,
otherwise the members will not find each other and each member will
form its own group <footnote>
<para>This can actually be used to test the MERGE2 protocol: start
two members (forming two singleton groups because they don't find
each other), then start the GossipRouter. After some time, the two
members will merge into one group</para>
</footnote> .</para>
</section>
</section>
<section>
<title>TCP</title>
<para>TCP is a replacement of UDP as bottom layer in cases where IP
Multicast based on UDP is not desired. This may be the case when
operating over a WAN, where routers will discard IP MCAST. As a rule of
thumb UDP is used as transport for LANs, whereas TCP is used for
WANs.</para>
<para>The properties for a typical stack based on TCP might look like
this (edited/protocols removed for brevity):
<screen>
<TCP start_port="7800" />
<TCPPING timeout="3000"
initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800],localhost[7801]}"
port_range="1"
num_initial_members="3"/>
<VERIFY_SUSPECT timeout="1500" />
<pbcast.NAKACK
use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="true"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
shun="true"
view_bundling="true"/>
</screen>
</para>
<variablelist>
<varlistentry>
<term>TCP</term>
<listitem>
<para>The transport protocol, uses TCP (from TCP/IP) to send
unicast and multicast messages. In the latter case, it sends
multiple unicast messages.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>TCPPING</term>
<listitem>
<para>Discovers the initial membership to determine coordinator.
Join request will then be sent to coordinator.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>VERIFY_SUSPECT</term>
<listitem>
<para>Double checks that a suspected member is really dead</para>
</listitem>
</varlistentry>
<varlistentry>
<term>pbcast.NAKACK</term>
<listitem>
<para>Reliable and FIFO message delivery</para>
</listitem>
</varlistentry>
<varlistentry>
<term>pbcast.STABLE</term>
<listitem>
<para>Distributed garbage collection of messages seen by all
members</para>
</listitem>
</varlistentry>
<varlistentry>
<term>pbcast.GMS</term>
<listitem>
<para>Membership services. Takes care of joining and removing
new/old members, emits view changes</para>
</listitem>
</varlistentry>
</variablelist>
<para>Since TCP already offers some of the reliability guarantees that
UDP doesn't, some protocols (e.g. FRAG and UNICAST) are not needed on
top of TCP.</para>
<para>When using TCP, each message to the group is sent as multiple
unicast messages (one to each member). Due to the fact that IP
multicasting cannot be used to discover the initial members, another
mechanism has to be used to find the initial membership. There are a
number of alternatives:</para>
<itemizedlist>
<listitem>
<para>PING with GossipRouter: same solution as described in <xref
linkend="IpNoMulticast" /> . The <parameter>ip_mcast</parameter>
property has to be set to <literal>false</literal> . GossipRouter
has to be started before the first member is started.</para>
</listitem>
<listitem>
<para>TCPPING: uses a list of well-known group members that it
solicits for initial membership</para>
</listitem>
<listitem>
<para>TCPGOSSIP: essentially the same as the above PING <footnote>
<para>PING and TCPGOSSIP will be merged in the future.</para>
</footnote> . The only difference is that TCPGOSSIP allows for
multiple GossipRouters instead of only one.</para>
</listitem>
<listitem>
<para>JDBC_PING: using a shared database via JDBC or DataSource.</para>
</listitem>
</itemizedlist>
<para>The next two section illustrate the use of TCP with both TCPPING
and TCPGOSSIP.</para>
<section>
<title>Using TCP and TCPPING</title>
<para>A protocol stack using TCP and TCPPING looks like this (other
protocols omitted):
<screen>
<TCP start_port="7800" /> +
<TCPPING initial_hosts="HostA[7800],HostB[7800]" port_range="5"
timeout="3000" num_initial_members="3" />
</screen>
</para>
<para>The concept behind TCPPING is that no external daemon such as
GossipRouter is needed. Instead some selected group members assume the
role of well-known hosts from which initial membership information can
be retrieved. In the example <parameter>HostA</parameter> and
<parameter>HostB</parameter> are designated members that will be used
by TCPPING to lookup the initial membership. The property
<parameter>start_port</parameter> in <classname>TCP</classname> means
that each member should try to assign port 7800 for itself. If this is
not possible it will try the next higher port (
<literal>7801</literal> ) and so on, until it finds an unused
port.</para>
<para><classname>TCPPING</classname> will try to contact both
<parameter>HostA</parameter> and <parameter>HostB</parameter> ,
starting at port <literal>7800</literal> and ending at port
<literal>7800 + port_range</literal> , in the above example ports
<literal>7800</literal> - <literal>7804</literal> . Assuming that at
least one of <parameter>HostA</parameter> or
<parameter>HostB</parameter> is up, a response will be received. To be
absolutely sure to receive a response all the hosts on which members
of the group will be running can be added to the configuration
string.</para>
</section>
<section>
<title>Using TCP and TCPGOSSIP</title>
<para>As mentioned before <classname>TCPGOSSIP</classname> is
essentially the same as <classname>PING</classname> with properties
<parameter>gossip_host</parameter> ,
<parameter>gossip_port</parameter> and
<parameter>gossip_refresh</parameter> set. However, in TCPGOSSIP these
properties are called differently as shown below (only the bottom two
protocols are shown):
<screen>
<TCP />
<TCPGOSSIP initial_hosts="localhost[5555],localhost[5556]" gossip_refresh_rate="10000"
num_initial_members="3" />
</screen>
</para>
<para>The <parameter>initial_hosts</parameter> properties combines
both the host and port of a GossipRouter, and it is possible to
specify more than one GossipRouter. In the example there are two
GossipRouters at ports <literal>5555</literal> and
<literal>5556</literal> on the local host. Also,
<parameter>gossip_refresh_rate</parameter> defines how many
milliseconds to wait between refreshing the entry with the
GossipRouters.</para>
<para>The advantage of having multiple GossipRouters is that, as long
as at least one is running, new members will always be able to
retrieve the initial membership. Note that the GossipRouter should be
started before any of the members.</para>
</section>
</section>
<section>
<title>TUNNEL</title>
<section>
<title>Using TUNNEL to tunnel a firewall</title>
<para>Firewalls are usually placed at the connection to the internet.
They shield local networks from outside attacks by screening incoming
traffic and rejecting connection attempts to host inside the firewalls
by outside machines. Most firewall systems allow hosts inside the
firewall to connect to hosts outside it (outgoing traffic), however,
incoming traffic is most often disabled entirely.</para>
<para><emphasis>Tunnels</emphasis> are host protocols which
encapsulate other protocols by multiplexing them at one end and
demultiplexing them at the other end. Any protocol can be tunneled by
a tunnel protocol.</para>
<para>The most restrictive setups of firewalls usually disable
<emphasis>all</emphasis> incoming traffic, and only enable a few
selected ports for outgoing traffic. In the solution below, it is
assumed that one TCP port is enabled for outgoing connections to the GossipRouter.</para>
<para>JGroups has a mechanism that allows a programmer to tunnel a
firewall. The solution involves a GossipRouter, which has to be outside of the firewall,
so other members (possibly also behind firewalls) can access it.</para>
<para>The solution works as follows. A channel inside a firewall has
to use protocol TUNNEL instead of UDP or TCP as bottommost layer. Recommended
discovery protocol is PING, starting with 2.8 release, you do not have to specify
any gossip routers in PING.
<screen>
<TUNNEL gossip_router_hosts="127.0.0.1[12001]" />
<PING />
</screen>
</para>
<para><classname>TCPGOSSIP</classname> uses the GossipRouter (outside
the firewall) at port <literal>12001</literal> to register its address
(periodically) and to retrieve the initial membership for its
group. It is not recommended to use TCPGOSSIP for discovery if TUNNEL is
already used. TCPGOSSIP might be used in rare scenarios when registration and
initial member discovery <emphasis>has to be done </emphasis>through gossip
router indepedent of transport protocol being used. Starting with 2.8 release
TCPGOSSIP accepts one or multiple router hosts as a comma delimited list
of host[port] elements specified in a property initial_hosts.</para>
<para><classname>TUNNEL</classname> establishes a TCP connection to the
<emphasis>GossipRouter</emphasis> process (also outside the firewall) that
accepts messages from members and passes them on to other members.
This connection is initiated by the host inside the firewall and
persists as long as the channel is connected to a group. GossipRouter will
use the <emphasis>same connection</emphasis> to send incoming messages
to the channel that initiated the connection. This is perfectly legal,
as TCP connections are fully duplex. Note that, if GossipRouter tried to
establish its own TCP connection to the channel behind the firewall,
it would fail. But it is okay to reuse the existing TCP connection,
established by the channel.</para>
<para>Note that <classname>TUNNEL</classname> has to be given the
hostname and port of the GossipRouter process. This example assumes a GossipRouter
is running on the local host at port <literal>12001</literal>. Both
TUNNEL and TCPGOSSIP (or PING) access the same GossipRouter.
Starting with 2.8 release TUNNEL transport layer accepts one or multiple router
hosts as a comma delimited list of host[port] elements specified in a
property gossip_router_hosts.</para>
<para>Any time a message has to be sent, TUNNEL forwards the message
to GossipRouter, which distributes it to its destination: if the message's
destination field is null (send to all group members), then GossipRouter
looks up the members that belong to that group and forwards the
message to all of them via the TCP connection they established when
connecting to GossipRouter. If the destination is a valid member address,
then that member's TCP connection is looked up, and the message is
forwarded to it <footnote>
<para>To do so, GossipRouter has to maintain a table between groups,
member addresses and TCP connections.</para>
</footnote> .</para>
<para>
Starting with 2.8 release, gossip router is no longer a single
point of failure. In a set-up with multiple gossip routers, routers do
not communicate among themselves, and single point of failure is avoided
by having each channel simply connect to multiple available routers. In
case one or more routers go down, cluster members are still able to
exchange message through remaining available router instances, if there
are any.
For each send invocation, a channel goes through a list of available
connections to routers and attempts to send a message on each connection
until it succeeds. If a message could not be sent on any of the
connections – an exception is raised. Default policy for connection
selection is random. However, we also provide an plug-in interface for
other policies as well.
Gossip router configuration is static and is not updated for the
lifetime of the channel. A list of available routers has to be provided
in channel configuration file.</para>
<para>To tunnel a firewall using JGroups, the following steps have to
be taken:</para>
<orderedlist>
<listitem>
<para>Check that a TCP port (e.g. 12001) is enabled in
the firewall for outgoing traffic</para>
</listitem>
<listitem>
<para>Start the GossipRouter:
<screen>
start org.jgroups.stack.GossipRouter -port 12001
</screen>
</para>
</listitem>
<listitem>
<para>Configure the TUNNEL protocol layer as instructed
above.</para>
</listitem>
<listitem>
<para>Create a channel</para>
</listitem>
</orderedlist>
<para>The general setup is shown in <xref linkend="TunnelingFig" />
.</para>
<figure id="TunnelingFig">
<title>Tunneling a firewall</title>
<mediaobject>
<imageobject>
<imagedata align="center" fileref="images/Tunneling.png" />
</imageobject>
<textobject>
<phrase>A diagram representing tunneling a firewall.</phrase>
</textobject>
</mediaobject>
</figure>
<para>First, the GossipRouter process is created on host
B. Note that host B should be outside the firewall, and all channels in
the same group should use the same GossipRouter process.
When a channel on host A is created, its
<classname>TCPGOSSIP</classname> protocol will register its address
with the GossipRouter and retrieve the initial membership (assume this
is C). Now, a TCP connection with the GossipRouter is established by A; this
will persist until A crashes or voluntarily leaves the group. When A
multicasts a message to the group, GossipRouter looks up all group members
(in this case, A and C) and forwards the message to all members, using
their TCP connections. In the example, A would receive its own copy of
the multicast message it sent, and another copy would be sent to
C.</para>
<para>This scheme allows for example <emphasis>Java applets</emphasis>
, which are only allowed to connect back to the host from which they
were downloaded, to use JGroups: the HTTP server would be located on
host B and the gossip and GossipRouter daemon would also run on that host.
An applet downloaded to either A or C would be allowed to make a TCP
connection to B. Also, applications behind a firewall would be able to
talk to each other, joining a group.</para>
<para>However, there are several drawbacks: first, having to maintain a TCP connection for the
duration of the connection might use up resources in the host system
(e.g. in the GossipRouter), leading to scalability problems, second, this
scheme is inappropriate when only a few channels are located behind
firewalls, and the vast majority can indeed use IP multicast to
communicate, and finally, it is not always possible to enable outgoing
traffic on 2 ports in a firewall, e.g. when a user does not 'own' the
firewall.</para>
</section>
</section>
</section>
<section>
<title>The concurrent stack</title>
<para>
The concurrent stack (introduced in 2.5) provides a number of improvements over previous releases,
which has some deficiencies:
<itemizedlist>
<listitem>
Large number of threads: each protocol had by default 2 threads, one for the up and one for the
down queue. They could be disabled per protocol by setting up_thread or down_thread to false.
In the new model, these threads have been removed.
</listitem>
<listitem>
Sequential delivery of messages: JGroups used to have a single queue for incoming messages,
processed by one thread. Therefore, messages from different senders were still processed in
FIFO order. In 2.5 these messages can be processed in parallel.
</listitem>
<listitem>
Out-of-band messages: when an application doesn't care about the ordering properties of a message,
the OOB flag can be set and JGroups will deliver this particular message without regard for any
ordering.
</listitem>
</itemizedlist>
</para>
<section>
<title>Overview</title>
<para>
The architecture of the concurrent stack is shown in <xref linkend="ConcurrentStackFig"/>. The changes
were made entirely inside of the transport protocol (TP, with subclasses UDP, TCP and TCP_NIO). Therefore,
to configure the concurrent stack, the user has to modify the config for (e.g.) UDP in the XML file.
</para>
<para>
<figure id="ConcurrentStackFig"><title>The concurrent stack</title>
<graphic fileref="images/ConcurrentStack.png" format="PNG" align="left" />
</figure>
</para>
<para>
</para>
<para>
The concurrent stack consists of 2 thread pools (java.util.concurrent.Executor): the out-of-band (OOB)
thread pool and the regular thread pool. Packets are received by multicast or unicast receiver threads
(UDP) or a ConnectionTable (TCP, TCP_NIO). Packets marked as OOB (with Message.setFlag(Message.OOB)) are
dispatched to the OOB thread pool, and all other packets are dispatched to the regular thread pool.
</para>
<para>
When a thread pool is disabled, then we use the thread of the caller (e.g. multicast or unicast
receiver threads or the ConnectionTable) to send the message up the stack and into the application.
Otherwise, the packet will be processed by a thread from the thread pool, which sends the message up
the stack. When all current threads are busy, another thread might be created, up to the maximum number
of threads defined. Alternatively, the packet might get queued up until a thread becomes available.
</para>
<para>
The point of using a thread pool is that the receiver threads should only receive the packets and forward
them to the thread pools for processing, because unmarshalling and processing is slower than simply
receiving the message and can benefit from parallelization.
</para>
<section>
<title>Configuration</title>
<para>Note that this is preliminary and names or properties might change</para>
<para>
We are thinking of exposing the thread pools programmatically, meaning that a developer might be able to set both
threads pools programmatically, e.g. using something like TP.setOOBThreadPool(Executor executor).
</para>
<para>
Here's an example of the new configuration:
<screen>
<![CDATA[
<UDP
mcast_addr="228.10.10.10"
mcast_port="45588"
thread_pool.enabled="true"
thread_pool.min_threads="1"
thread_pool.max_threads="100"
thread_pool.keep_alive_time="20000"
thread_pool.queue_enabled="false"
thread_pool.queue_max_size="10"
thread_pool.rejection_policy="Run"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="1"
oob_thread_pool.max_threads="4"
oob_thread_pool.keep_alive_time="30000"
oob_thread_pool.queue_enabled="true"
oob_thread_pool.queue_max_size="10"
oob_thread_pool.rejection_policy="Run"/>
]]>
</screen>
</para>
<para>
The attributes for the 2 thread pools are prefixed with thread_pool and oob_thread_pool respectively.
</para>
<para>
The attributes are listed below. The roughly correspond to the options of a
java.util.concurrent.ThreadPoolExecutor in JDK 5.
<table>
<title>Attributes of thread pools</title>
<tgroup cols="2">
<colspec align="left" />
<thead>
<row>
<entry align="center">Name</entry>
<entry align="center">Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>enabled</entry>
<entry>Whether of not to use a thread pool. If set to false, the caller's thread
is used.</entry>
</row>
<row>
<entry>min_threads</entry>
<entry>The minimum number of threads to use.</entry>
</row>
<row>
<entry>max_threads</entry>
<entry>The maximum number of threads to use.</entry>
</row>
<row>
<entry>keep_alive_time</entry>
<entry>Number of milliseconds until an idle thread is removed from the pool</entry>
</row>
<row>
<entry>queue_enabled</entry>
<entry>Whether of not to use a (bounded) queue. If enabled, when all minimum
threads are busy, work items are added to the queue. When the queue is full,
additional threads are created, up to max_threads. When max_threads have been
reached, the rejection policy is consulted.</entry>
</row>
<row>
<entry>max_size</entry>
<entry>The maximum number of elements in the queue. Ignored if the queue is
disabled</entry>
</row>
<row>
<entry>rejection_policy</entry>
<entry>Determines what happens when the thread pool (and queue, if enabled) is
full. The default is to run on the caller's thread. "Abort" throws an runtime
exception. "Discard" discards the message, "DiscardOldest" discards the
oldest entry in the queue. Note that these values might change, for example a
"Wait" value might get added in the future.</entry>
</row>
<row>
<entry>thread_naming_pattern</entry>
<entry>Determines how threads are named that are running from thread pools in
concurrent stack. Valid values include any combination of "cl" letters, where
"c" includes the cluster name and "l" includes local address of the channel.
The default is "cl"
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
</section>
<section>
<title>Elimination of up and down threads</title>
<para>
By removing the 2 queues/protocol and the associated 2 threads, we effectively reduce the number of
threads needed to handle a message, and thus context switching overhead. We also get clear and unambiguous
semantics for Channel.send(): now, all messages are sent down the stack on the caller's thread and
the send() call only returns once the message has been put on the network. In addition, an exception will
only be propagated back to the caller if the message has not yet been placed in a retransmit buffer.
Otherwise, JGroups simply logs the error message but keeps retransmitting the message. Therefore,
if the caller gets an exception, the message should be re-sent.
</para>
<para>
On the receiving side, a message is handled by a thread pool, either the regular or OOB thread pool. Both
thread pools can be completely eliminated, so that we can save even more threads and thus further
reduce context switching. The point is that the developer is now able to control the threading behavior
almost completely.
</para>
</section>
<section>
<title>Concurrent message delivery</title>
<para>
Up to version 2.5, all messages received were processed by a single thread, even if the messages were
sent by different senders. For instance, if sender A sent messages 1,2 and 3, and B sent message 34 and 45,
and if A's messages were all received first, then B's messages 34 and 35 could only be processed after
messages 1-3 from A were processed !
</para>
<para>
Now, we can process messages from different senders in parallel, e.g. messages 1, 2 and 3 from A can be
processed by one thread from the thread pool and messages 34 and 35 from B can be processed on a different
thread.
</para>
<para>
As a result, we get a speedup of almost N for a cluster of N if every node is sending messages and we
configure the thread pool to have at least N threads. There is actually a unit test
(ConcurrentStackTest.java) which demonstrates this.
</para>
</section>
<section id="Scopes">
<title>Scopes: concurrent message delivery for messages from the same sender</title>
<para>
In the previous paragraph, we showed how the concurrent stack delivers messages from different senders
concurrently. But all (non-OOB) messages from the same sender P are delivered in the order in which
P sent them. However, this is not good enough for certain types of applications.
</para>
<para>
Consider the case of an application which replicates HTTP sessions. If we have sessions X, Y and Z, then
updates to these sessions are delivered in the order in which there were performed, e.g. X1, X2, X3,
Y1, Z1, Z2, Z3, Y2, Y3, X4. This means that update Y1 has to wait until updates X1-3 have been delivered.
If these updates take some time, e.g. spent in lock acquisition or deserialization, then all subsequent
messages are delayed by the sum of the times taken by the messages ahead of them in the delivery order.
</para>
<para>
However, in most cases, updates to different web sessions should be completely unrelated, so they could
be delivered concurrently. For instance, a modification to session X should not have any effect on
session Y, therefore updates to X, Y and Z can be delivered concurrently.
</para>
<para>
One solution to this is out-of-band (OOB) messages (see next paragraph). However, OOB messages do not
guarantee ordering, so updates X1-3 could be delivered as X1, X3, X2. If this is not wanted, but
messages pertaining to a given web session should all be delivered concurrently between sessions, but
ordered <emphasis>within</emphasis> a given session, then we can resort to <emphasis>scoped messages</emphasis>.
</para>
<para>
Scoped messages apply only to <emphasis>regular</emphasis> (non-OOB) messages, and are delivered
concurrently between scopes, but ordered within a given scope. For example, if we used the sessions above
(e.g. the jsessionid) as scopes, then the delivery could be as follows ('->' means sequential, '||' means concurrent):
<screen>
X1 -> X2 -> X3 -> X4 || Y1 -> Y2 -> Y3 || Z1 -> Z2 -> Z3
</screen>
This means that all updates to X are delivered in parallel to updates to Y and updates to Z. However, within
a given scope, updates are delivered in the order in which they were performed, so X1 is delivered before
X2, which is deliverd before X3 and so on.
</para>
<para>
Taking the above example, using scoped messages, update Y1 does <emphasis>not</emphasis> have to wait for
updates X1-3 to complete, but is processed immediately.
</para>
<para>
To set the scope of a message, use method Message.setScope(short).
</para>
<para>
Scopes are implemented in a separate protocol called <xref linkend="SCOPE"/>. This protocol
has to be placed somewhere above ordering protocols like UNICAST or NAKACK (or SEQUENCER for that matter).
</para>
<note>
<title>Uniqueness of scopes</title>
<para>
Note that scopes should be <emphasis>as unique as possible</emphasis>. Compare this to hashing: the fewer collisions
there are, the better the concurrency will be. So, if for example, two web sessions pick the same
scope, then updates to those sessions will be delivered in the order in which they were sent, and
not concurrently. While this doesn't cause erraneous behavior, it defies the purpose of SCOPE.
</para>
<para>
Also note that, if multicast and unicast messages have the same scope, they will be delivered
in sequence. So if A multicasts messages to the group with scope 25, and A also unicasts messages
to B with scope 25, then A's multicasts and unicasts will be delivered in order at B ! Again,
this is correct, but since multicasts and unicasts are unrelated, might slow down things !
</para>
</note>
</section>
<section>
<title>Out-of-band messages</title>
<para>
OOB messages completely ignore any ordering constraints the stack might have. Any message marked as OOB
will be processed by the OOB thread pool. This is necessary in cases where we don't want the message
processing to wait until all other messages from the same sender have been processed, e.g. in the
heartbeat case: if sender P sends 5 messages and then a response to a heartbeat request received from
some other node, then the time taken to process P's 5 messages might take longer than the heartbeat
timeout, so that P might get falsely suspected ! However, if the heartbeat response is marked as OOB,
then it will get processed by the OOB thread pool and therefore might be concurrent to its previously
sent 5 messages and not trigger a false suspicion.
</para>
<para>
The 2 unit tests UNICAST_OOB_Test and NAKACK_OOB_Test demonstrate how OOB messages influence the ordering,
for both unicast and multicast messages.
</para>
</section>
<section>
<title>Replacing the default and OOB thread pools</title>
<para>
In 2.7, there are 3 thread pools and 4 thread factories in TP:
<table>
<title>Thread pools and factories in TP</title>
<tgroup cols="2">
<colspec align="left" />
<thead>
<row>
<entry align="center">Name</entry>
<entry align="center">Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>Default thread pool</entry>
<entry>This is the pools for handling incoming messages. It can be fetched using
getDefaultThreadPool() and replaced using setDefaultThreadPool(). When setting a
thread pool, the old thread pool (if any) will be shutdown and all of it tasks
cancelled first
</entry>
</row>
<row>
<entry>OOB thread pool</entry>
<entry>This is the pool for handling incoming OOB messages. Methods to get and set
it are getOOBThreadPool() and setOOBThreadPool()</entry>
</row>
<row>
<entry>Timer thread pool</entry>
<entry>This is the thread pool for the timer. The max number of threads is set through
the timer.num_threads property. The timer thread pool cannot be set, it can only
be retrieved using getTimer(). However, the thread factory of the timer
can be replaced (see below)</entry>
</row>
<row>
<entry>Default thread factory</entry>
<entry>This is the thread factory (org.jgroups.util.ThreadFactory) of the default
thread pool, which handles incoming messages. A thread pool factory is used to
name threads and possibly make them daemons.
It can be accessed using
getDefaultThreadPoolThreadFactory() and setDefaultThreadPoolThreadFactory()</entry>
</row>
<row>
<entry>OOB thread factory</entry>
<entry>This is the thread factory for the OOB thread pool. It can be retrieved
using getOOBThreadPoolThreadFactory() and set using method
setOOBThreadPoolThreadFactory()</entry>
</row>
<row>
<entry>Timer thread factory</entry>
<entry>This is the thread factory for the timer thread pool. It can be accessed
using getTimerThreadFactory() and setTimerThreadFactory()</entry>
</row>
<row>
<entry>Global thread factory</entry>
<entry>The global thread factory can get used (e.g. by protocols) to create threads
which don't live in the transport, e.g. the FD_SOCK server socket handler thread.
Each protocol has a method getTransport(). Once the TP is obtained, getThreadFactory()
can be called to get the global thread factory. The global thread factory
can be replaced with setThreadFactory()</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
<section>
<title>Sharing of thread pools between channels in the same JVM</title>
<para>
In 2.7, the default and OOB thread pools can be shared between instances running inside the same JVM. The
advantage here is that multiple channels running within the same JVM can pool (and therefore save) threads.
The disadvantage is that thread naming will not show to which channel instance an incoming thread
belongs to.
</para>
<para>
Note that we can not just shared thread pools between JChannels within the same JVM, but we can also
share entire transports. For details see <xref linkend="SharedTransport">this section</xref>.
</para>
</section>
</section>
<section>
<title>Misc</title>
<section>
<title>Shunning</title>
<para>
Note that in 2.8, shunning has been removed, so the sections below only apply to versions up to 2.7.
</para>
Let's say we have 4 members in a group: {A,B,C,D}. When a member (say D) is expelled from the group, e.g.
because it didn't respond to are-you-alive messages, and later comes back, then it is shunned. Shunning
causes a member to leave the group and re-join, if this is enabled on the Channel. To enable automatic
re-connects, the AUTO_RECONNECT option has to be set on the Channel:
<screen>
channel.setOpt(Channel.AUTO_RECONNECT, Boolean.TRUE);
</screen>
<para>To enable shunning, set FD.shun and GMS.shun to true.</para>
Let's look at a more detailed example. Say member D is overloaded, and doesn't respond to are-you-alive
messages (done by the failure detection (FD) protocol). It is therefore suspected and excluded. The new
view for A, B and C will be {A,B,C}, however for D the view is still {A,B,C,D}. So when D comes back and
sends messages to the group, or any individiual member, those messages will be discarded, because A,B and
C don't see D in their view. D is shunned when A,B or C receive an are-you-alive message from D, or D
shuns itself when it receives a view which doesn't include D.<para/>
So shunning is always a unilateral decision. However, things may be different if all members exclude each
other from the group. For example, say we have a switch connecting A, B, C and D. If someone pulls all
plugs on the switch, or powers the switch down, then A, B, C and D will all form singleton groups, that is,
each member thinks it's the only member in the group. When the switch goes back to normal, then each member
will shun everybody else (a real shun fest :-)). This is clearly not desirable, so in this case shunning
should be turned off:
<screen>
<FD timeout="2000" max_tries="3" shun="false"/>
...
<pbcast.GMS join_timeout="3000" shun="false"/>
</screen>
</section>
<section>
<title>Using a custom socket factory</title>
<para>
JGroups creates all of its sockets through a SocketFactory, which is located in the transport (TP) or
TP.ProtocolAdapter (in a shared transport). The factory has methods to create sockets (Socket,
ServerSocket, DatagramSocket and MulticastSocket)
<footnote>
<para>
Currently, SocketFactory does not support creation of NIO sockets / channels.
</para>
</footnote>,
closen sockets and list all open sockets. Every socket creation method has a service name, which could
be for example "jgroups.fd_sock.srv_sock". The service name is used to look up a port (e.g. in a config
file) and create the correct socket.
</para>
<para>
To provide one's own socket factory, the following has to be done: if we have a non-shared transport,
the code below creates a SocketFactory implementation and sets it in the transport:
</para>
<screen>
JChannel ch;
MySocketFactory factory; // e.g. extends DefaultSocketFactory
ch=new JChannel("config.xml");
ch.setSocketFactory(new MySocketFactory());
ch.connect("demo");
</screen>
<para>
If a shared transport is used, then we have to set 2 socket factories: 1 in the shared transport and
one in the TP.ProtocolAdapter:
</para>
<screen>
JChannel c1=new JChannel("config.xml"), c2=new JChannel("config.xml");
TP transport=c1.getProtocolStack().getTransport();
transport.setSocketFactory(new MySocketFactory("transport"));
c1.setSocketFactory(new MySocketFactory("first-cluster"));
c2.setSocketFactory(new MySocketFactory("second-cluster"));
c1.connect("first-cluster");
c2.connect("second-cluster");
</screen>
<para>
First, we grab one of the channels to fetch the transport and set a SocketFactory in it. Then we
set one SocketFactory per channel that resides on the shared transport. When JChannel.connect() is
called, the SocketFactory will be set in TP.ProtocolAdapter.
</para>
</section>
</section>
<section>
<title>Handling network partitions</title>
<para>
Network partitions can be caused by switch, router or network interface crashes, among other things. If we
have a cluster {A,B,C,D,E} spread across 2 subnets {A,B,C} and {D,E} and the switch to which D and E are
connected crashes, then we end up with a network partition, with subclusters {A,B,C} and {D,E}.
</para>
<para>
A, B and C can ping each other, but not D or E, and vice versa. We now have 2 coordinators, A and D. Both
subclusters operate independently, for example, if we maintain a shared state, subcluster {A,B,C} replicate
changes to A, B and C.
</para>
<para>
This means, that if during the partition, some clients access {A,B,C}, and others {D,E}, then we end up
with different states in both subclusters. When a partition heals, the merge protocol (e.g. MERGE2) will
notify A and D that there were 2 subclusters and merge them back into {A,B,C,D,E}, with A being the new
coordinator and D ceasing to be coordinator.
</para>
<para>
The question is what happens with the 2 diverged substates ?
</para>
<para>
There are 2 solutions to merging substates: first we can attempt to create a new state from the 2 substates,
and secondly we can shut down all members of the <emphasis>non primary partition</emphasis>, such that they
have to re-join and possibly reacquire the state from a member in the primary partition.
</para>
<para>
In both cases, the application has to handle a MergeView (subclass of View), as shown in the code below:
<screen>
public void viewAccepted(View view) {
if(view instanceof MergeView) {
MergeView tmp=(MergeView)view;
Vector<View> subgroups=tmp.getSubgroups();
// merge state or determine primary partition
// run this in a separate thread !
}
}
</screen>
</para>
<para>
It is essential that the merge view handling code run on a separate thread if it needs more than a few
milliseconds, or else it would block the calling thread.
</para>
<para>
The MergeView contains a list of views, each view represents a subgroups and has the list of members which
formed this group.
</para>
<section>
<title>Merging substates</title>
<para>
The application has to merge the substates from the various subgroups ({A,B,C} and {D,E}) back into one
single state for {A,B,C,D,E}. This task <emphasis>has</emphasis> to be done by the application because
JGroups knows nothing about the application state, other than it is a byte buffer.
</para>
<para>
If the in-memory state is backed by a database, then the solution is easy: simply discard the in-memory
state and fetch it (eagerly or lazily) from the DB again. This of course assumes that the members of
the 2 subgroups were able to write their changes to the DB. However, this is often not the case, as
connectivity to the DB might have been severed by the network partition.
</para>
<para>
Another solution could involve tagging the state with time stamps. On merging, we could compare the
time stamps for the substates and let the substate with the more recent time stamps win.
</para>
<para>
Yet another solution could increase a counter for a state each time the state has been modified. The state
with the highest counter wins.
</para>
<para>
Again, the merging of state can only be done by the application. Whatever algorithm is picked to merge
state, it has to be deterministic.
</para>
</section>
<section>
<title>The primary partition approach</title>
</section>
<para>
The primary partition approach is simple: on merging, one subgroup is designated as the
<emphasis>primary partition</emphasis> and all others as non-primary partitions. The members in the primary
partition don't do anything, whereas the members in the non-primary partitions need to drop their state and
re-initialize their state from fresh state obtained from a member of the primary partition.
</para>
<para>
The code to find the primary partition needs to be deterministic, so that all members pick the <emphasis>
same</emphasis> primary partition. This could be for example the first view in the MergeView, or we could
sort all members of the new MergeView and pick the subgroup which contained the new coordinator (the one
from the consolidated MergeView). Another possible solution could be to pick the largest subgroup, and, if
there is a tie, sort the tied views lexicographically (all Addresses have a compareTo() method) and pick the
subgroup with the lowest ranked member.
</para>
<para>
Here's code which picks as primary partition the first view in the MergeView, then re-acquires the state from
the <emphasis>new</emphasis> coordinator of the combined view:
<screen>
public static void main(String[] args) throws Exception {
final JChannel ch=new JChannel("/home/bela/udp.xml");
ch.setReceiver(new ExtendedReceiverAdapter() {
public void viewAccepted(View new_view) {
handleView(ch, new_view);
}
});
ch.connect("x");
while(ch.isConnected())
Util.sleep(5000);
}
private static void handleView(JChannel ch, View new_view) {
if(new_view instanceof MergeView) {
ViewHandler handler=new ViewHandler(ch, (MergeView)new_view);
// requires separate thread as we don't want to block JGroups
handler.start();
}
}
private static class ViewHandler extends Thread {
JChannel ch;
MergeView view;
private ViewHandler(JChannel ch, MergeView view) {
this.ch=ch;
this.view=view;
}
public void run() {
Vector<View> subgroups=view.getSubgroups();
View tmp_view=subgroups.firstElement(); // picks the first
Address local_addr=ch.getLocalAddress();
if(!tmp_view.getMembers().contains(local_addr)) {
System.out.println("Not member of the new primary partition ("
+ tmp_view + "), will re-acquire the state");
try {
ch.getState(null, 30000);
}
catch(Exception ex) {
}
}
else {
System.out.println("Not member of the new primary partition ("
+ tmp_view + "), will do nothing");
}
}
}
</screen>
</para>
<para>
The handleView() method is called from viewAccepted(), which is called whenever there is a new view. It spawns
a new thread which gets the subgroups from the MergeView, and picks the first subgroup to be the primary
partition. Then, if it was a member of the primary partition, it does nothing, and if not, it reaqcuires
the state from the coordinator of the primary partition (A).
</para>
<para>
The downside to the primary partition approach is that work (= state changes) on the non-primary partition
is discarded on merging. However, that's only problematic if the data was purely in-memory data, and not
backed by persistent storage. If the latter's the case, use state merging discussed above.
</para>
<para>
It would be simpler to shut down the non-primary partition as soon as the network partition is detected, but
that a non trivial problem, as we don't know whether {D,E} simply crashed, or whether they're still alive,
but were partitioned away by the crash of a switch. This is called a <emphasis>split brain syndrome</emphasis>,
and means that none of the members has enough information to determine whether it is in the primary or
non-primary partition, by simply exchanging messages.
</para>
<section>
<title>The Split Brain syndrome and primary partitions</title>
<para>
In certain situations, we can avoid having multiple subgroups where every subgroup is able to make
progress, and on merging having to discard state of the non-primary partitions.
</para>
<para>
If we have a fixed membership, e.g. the cluster always consists of 5 nodes, then we can run code on
a view reception that determines the primary partition. This code
<itemizedlist>
<listitem>assumes that the primary partition has to have at least 3 nodes</listitem>
<listitem>any cluster which has less than 3 nodes doesn't accept modfications. This could be done for
shared state for example, by simply making the {D,E} partition read-only. Clients can access the
{D,E} partition and read state, but not modify it.
</listitem>
<listitem>
As an alternative, clusters without at least 3 members could shut down, so in this case D and
E would leave the cluster.
</listitem>
</itemizedlist>
</para>
<para>
The algorithm is shown in pseudo code below:
<screen>
On initialization:
- Mark the node as read-only
On view change V:
- If V has >= N members:
- If not read-write: get state from coord and switch to read-write
- Else: switch to read-only
</screen>
</para>
<para>
Of course, the above mechanism requires that at least 3 nodes are up at any given time, so upgrades have
to be done in a staggered way, taking only one node down at a time. In the worst case, however, this
mechanism leaves the cluster read-only and notifies a system admin, who can fix the issue. This is still
better than shutting the entire cluster down.
</para>
</section>
</section>
<section>
<title>Flushing: making sure every node in the cluster received a message</title>
When sending messages, the properties of the default stacks (udp.xml, tcp.xml) are that all messages are delivered
reliably to all (non-crashed) members. However, there are no guarantees with respect to the view in which a message
will get delivered. For example, when a member A with view V1={A,B,C} multicasts message M1 to the group and D joins
at about the same time, then D may or may not receive M1, and there is no guarantee that A, B and C receive M1 in
V1 or V2={A,B,C,D}.
<para>
To change this, we can turn on virtual synchrony (by adding FLUSH to the top of the stack), which guarantees that
<itemizedlist>
<listitem>
A message M sent in V1 will be delivered in V1. So, in the example above, M1 would get delivered in
view V1; by A, B and C, but not by D.
</listitem>
<listitem>
The set of messages seen by members in V1 is the same for all members before a new view V2 is installed.
This is important, as it ensures that all members in a given view see the same messages. For example,
in a group {A,B,C}, C sends 5 messages. A receives all 5 messages, but B doesn't. Now C crashes before
it can retransmit the messages to B. FLUSH will now ensure, that before installing V2={A,B} (excluding
C), B gets C's 5 messages. This is done through the flush protocol, which has all members reconcile
their messages before a new view is installed. In this case, A will send C's 5 messages to B.
</listitem>
</itemizedlist>
</para>
<para>
Sometimes it is important to know that every node in the cluster received all messages up to a certain point,
even if there is no new view being installed. To do this (initiate a manual flush), an application programmer
can call Channel.startFlush() to start a flush and Channel.stopFlush() to terminate it.
</para>
<para>
Channel.startFlush() flushes all pending messages out of the system. This stops all senders (calling
Channel.down() during a flush will block until the flush has completed)<footnote><para>Note that block()
will be called in a Receiver when the flush is about to start and unblock() will be called when it ends</para></footnote>.
When startFlush() returns, the caller knows that (a) no messages will get sent anymore until stopFlush() is
called and (b) all members have received all messages sent before startFlush() was called.
</para>
<para>
Channel.stopFlush() terminates the flush protocol, no blocked senders can resume sending messages.
</para>
<para>
Note that the FLUSH protocol has to be present on top of the stack, or else the flush will fail.
</para>
</section>
<section>
<title>Large clusters</title>
<para>
This section is a collection of best practices and tips and tricks for running large clusters on JGroups.
By large clusters, we mean several hundred nodes in a cluster.
</para>
<section>
<title>Reducing chattiness</title>
<para>
When we have a chatty protocol, scaling to a large number of nodes might be a problem: too many messages
are sent and - because they are generated in addition to the regular traffic - this can have a
negative impact on the cluster. A possible impact is that more of the regular messages are dropped, and
have to be retransmitted, which impacts performance. Or heartbeats are dropped, leading to false
suspicions. So while the negative effects of chatty protocols may not be seen in small clusters, they
<emphasis>will</emphasis> be seen in large clusters !
</para>
<section>
<title>Discovery</title>
<para>
A discovery protocol (e.g. PING, TCPPING, MPING etc) is run at startup, to discover the initial
membership, and periodically by the merge protocol, to detect partitioned subclusters.
</para>
<para>
When we send a multicast discovery request to a large cluster, every node in the cluster might
possibly reply with a discovery response sent back to the sender. So, in a cluster of 300 nodes,
the discovery requester might be up to 299 discovery responses ! Even worse, because num_ping_requests
in Discovery is by default set to 2, so we're sending 2 discovery requests, we might receive up to
num_ping_requests * (N-1) discovery responses, even though we might be able to find out the
coordinator after a few responses already !
</para>
<para>
To reduce the large number of responses, we can set a max_rank property: the value defines which
members are going to send a discovery response. The rank is the index of a member in a cluster: in
{A,B,C,D,E}, A's index is 1, B's index is 2 and so on. A max_rank of 3 would trigger discovery
responses from only A, B and C, but not from D or E.
</para>
<para>
We highly recommend setting max_rank in large clusters.
</para>
<para>
This functionality was implemented in
<ulink url="https://jira.jboss.org/browse/JGRP-1181">https://jira.jboss.org/browse/JGRP-1181</ulink>.
</para>
</section>
<section>
<title>Failure detection protocols</title>
<para>
Failure detection protocols determine when a member is unresponsive, and subsequently
<emphasis>suspect</emphasis> it. Usually (FD, FD_ALL), messages (heartbeats) are used to determine
the health of a member, but we can also use TCP connections (FD_SOCK) to connect to a member P, and
suspect P when the connection is closed.
</para>
<para>
Heartbeating requires messages to be sent around, and we need to be careful to limit the number of
messages sent by a failure detection protocol (1) to detect crashed members and (2) when a member
has been suspected. The following sections discuss how to configure FD_ALL and FD_SOCK, the most
commonly used failure detection protocols, for use in large clusters.
</para>
<section>
<title>FD_SOCK</title>
</section>
<section>
<title>FD_ALL</title>
</section>
</section>
</section>
</section>
<section id="RelayAdvanced">
<title>Bridging between remote clusters</title>
<para>
In 2.12, the RELAY protocol was added to JGroups (for the properties see <xref linkend="RELAY">RELAY</xref>).
It allows for bridging of remote clusters. For example, if we have a cluster in New York (NYC) and another
one in San Francisco (SFO), then RELAY allows us to bridge NYC and SFO, so that multicast messages sent in
NYC will be forwarded to SFO and vice versa.
</para>
<para>
The NYC and SFO clusters could for example use IP multicasting (UDP as transport), and the bridge could use
TCP as transport. The SFO and NYC clusters don't even need to use the same cluster name.
</para>
<para>
<xref linkend="RelayFig"/> shows how the two clusters are bridged.
</para>
<para>
<figure id="RelayFig"><title>Relaying between different clusters</title>
<graphic fileref="images/RELAY.png" format="PNG" align="left" width="15cm"/>
</figure>
</para>
<para>
The cluster on the left side with nodes A (the coordinator), B and C is called "NYC" and use IP
multicasting (UDP as transport). The cluster on the right side ("SFO") has nodes D (coordinator), E and F.
</para>
<para>
The bridge between the local clusters NYC and SFO is essentially another cluster with the coordinators
(A and D) of the local clusters as members. The bridge typically uses TCP as transport, but any of the
supported JGroups transports could be used (including UDP, if supported across a WAN, for instance).
</para>
<para>
Only a coordinator relays traffic between the local and remote cluster. When A crashes or leaves, then the
next-in-line (B) takes over and starts relaying.
</para>
<para>
Relaying is done via the RELAY protocol added to the top of the stack. The bridge is configured with
the bridge_props property, e.g. bridge_props="/home/bela/tcp.xml". This creates a JChannel inside RELAY.
</para>
<para>
Note that property "site" must be set in both subclusters. In the example above, we could set site="nyc"
for the NYC subcluster and site="sfo" for the SFO ubcluster.
</para>
<para>
The design is described in detail in JGroups/doc/design/RELAY.txt (part of the source distribution). In
a nutshell, multicast messages received in a local cluster are wrapped and forwarded to the remote cluster
by a relay (= the coordinator of a local cluster). When a remote cluster receives such a message, it is
unwrapped and put onto the local cluster.
</para>
<para>
JGroups uses subclasses of UUID (PayloadUUID) to ship the site name with an address. When we see an address
with site="nyc" on the SFO side, then RELAY will forward the message to the SFO subcluster, and vice versa.
When C multicasts a message in the NYC cluster, A will forward it to D, which will re-broadcast the message on
its local cluster, with the sender being D. This means that the sender of the local broadcast will appear
as D (so all retransmit requests got to D), but the original sender C is preserved in the header.
At the RELAY protocol, the sender will be replaced with the original sender (C) having site="nyc".
When node F wants to reply to the sender of the multicast, the destination
of the message will be C, which is intercepted by the RELAY protocol and forwarded to the current
relay (D). D then picks the correct destination (C) and sends the message to the remote cluster, where
A makes sure C (the original sender) receives it.
</para>
<para>
An important design goal of RELAY is to be able to have completely autonomous clusters, so NYC doesn't for
example have to block waiting for credits from SFO, or a node in the SFO cluster doesn't have to ask a node
in NYC for retransmission of a missing message.
</para>
<section>
<title>Views</title>
<para>
RELAY presents a <emphasis>global view</emphasis> to the application, e.g. a view received by
nodes could be {D,E,F,A,B,C}. This view is the same on all nodes, and a global view is generated by
taking the two local views, e.g. A|5 {A,B,C} and D|2 {D,E,F}, comparing the coordinators' addresses
(the UUIDs for A and D) and concatenating the views into a list. So if D's UUID is greater than
A's UUID, we first add D's members into the global view ({D,E,F}), and then A's members.
</para>
<para>
Therefore, we'll always see all of A's members, followed by all of D's members, or the other way round.
</para>
<para>
To see which nodes are local and which ones remote, we can iterate through the addresses (PayloadUUID)
and use the site (PayloadUUID.getPayload()) name to for example differentiate between "nyc" and "sfo".
</para>
</section>
<section>
<title>Configuration</title>
<para>
To setup a relay, we need essentially 3 XML configuration files: 2 to configure the local clusters and
1 for the bridge.
</para>
<para>
To configure the first local cluster, we can copy udp.xml from the JGroups distribution and add RELAY on top
of it: <RELAY bridge_props="/home/bela/tcp.xml" />. Let's say we call this config relay.xml.
</para>
<para>
The second local cluster can be configured by copying relay.xml to relay2.xml. Then change the
mcast_addr and/or mcast_port, so we actually have 2 different cluster in case we run instances of
both clusters in the same network. Of course, if the nodes of one cluster are run in a different
network from the nodes of the other cluster, and they cannot talk to each other, then we can simply
use the same configuration.
</para>
<para>
The 'site' property needs to be configured in relay.xml and relay2.xml, and it has to be different. For
example, relay.xml could use site="nyc" and relay2.xml could use site="sfo".
</para>
<para>
The bridge is configured by taking the stock tcp.xml and making sure both local clusters can see each
other through TCP.
</para>
</section>
</section>
<section id="DaisyChaining">
<title>Daisychaining</title>
<para>
Daisychaining refers to a way of disseminating messages sent to the entire cluster.
</para>
<para>
The idea behind it is that it is inefficient to broadcast a message in clusters where IP multicasting is
not available. For example, if we only have TCP available (as is the case in most clouds today), then we
have to send a broadcast (or group) message N-1 times. If we want to broadcast M to a cluster of 10,
we send the same message 9 times.
</para>
<para>
Example: if we have {A,B,C,D,E,F}, and A broadcasts M, then it sends it to B, then to C, then to D etc.
If we have a 1 GB switch, and M is 1GB, then sending a broadcast to 9 members takes 9 seconds, even if we
parallelize the sending of M. This is due to the fact that the link to the switch only sustains 1GB / sec.
(Note that I'm conveniently ignoring the fact that the switch will start dropping packets if it is
overloaded, causing TCP to retransmit, slowing things down)...
</para>
<para>
Let's introduce the concept of a round. A round is the time it takes to send or receive a message.
In the above example, a round takes 1 second if we send 1 GB messages.
In the existing N-1 approach, it takes X * (N-1) rounds to send X messages to a cluster of N nodes.
So to broadcast 10 messages a the cluster of 10, it takes 90 rounds.
</para>
<para>
Enter DAISYCHAIN.
</para>
<para>
The idea is that, instead of sending a message to N-1 members, we only send it to our neighbor, which
forwards it to its neighbor, and so on. For example, in {A,B,C,D,E}, D would broadcast a message by
forwarding it to E, E forwards it to A, A to B, B to C and C to D. We use a time-to-live field,
which gets decremented on every forward, and a message gets discarded when the time-to-live is 0.
</para>
<para>
The advantage is that, instead of taxing the link between a member and the switch to send N-1 messages,
we distribute the traffic more evenly across the links between the nodes and the switch.
Let's take a look at an example, where A broadcasts messages m1 and m2 in
cluster {A,B,C,D}, '-->' means sending:
</para>
<section>
<title>Traditional N-1 approach</title>
<para>
<itemizedlist mark='opencircle'>
<listitem>Round 1: A(m1) --> B</listitem>
<listitem>Round 2: A(m1) --> C</listitem>
<listitem>Round 3: A(m1) --> D</listitem>
<listitem>Round 4: A(m2) --> B</listitem>
<listitem>Round 5: A(m2) --> C</listitem>
<listitem>Round 6: A(m2) --> D</listitem>
</itemizedlist>
It takes 6 rounds to broadcast m1 and m2 to the cluster.
</para>
</section>
<section>
<title>Daisychaining approach</title>
<para>
<itemizedlist mark='opencircle'>
<listitem>Round 1: A(m1) --> B</listitem>
<listitem>Round 2: A(m2) --> B || B(m1) --> C</listitem>
<listitem>Round 3: B(m2) --> C || C(m1) --> D</listitem>
<listitem>Round 4: C(m2) --> D</listitem>
</itemizedlist>
<para>In round 1, A send m1 to B.</para>
<para>In round 2, A sends m2 to B, but B also forwards m1 (received in round 1) to C.</para>
<para>In round 3, A is done. B forwards m2 to C and C forwards m1 to D (in parallel, denoted by '||').</para>
<para>In round 4, C forwards m2 to D.</para>
</para>
</section>
<section>
<title>Switch usage</title>
<para>
Let's take a look at this in terms of switch usage: in the N-1 approach, A can only send 125MB/sec,
no matter how many members there are in the cluster, so it is constrained by the link capacity to the
switch. (Note that A can also receive 125MB/sec in parallel with today's full duplex links).
</para>
<para>
So the link between A and the switch gets hot.
</para>
<para>
In the daisychaining approach, link usage is more even: if we look for example at round 2, A sending
to B and B sending to C uses 2 different links, so there are no constraints regarding capacity of a
link. The same goes for B sending to C and C sending to D.
</para>
<para>
In terms of rounds, the daisy chaining approach uses X + (N-2) rounds, so for a cluster size of 10 and
broadcasting 10 messages, it requires only 18 rounds, compared to 90 for the N-1 approach !
</para>
</section>
<section>
<title>Performance</title>
<para>
To measure performance of DAISYCHAIN, a performance test (test.Perf) was run, with 4 nodes connected
to a 1 GB switch; and every node sending 1 million 8K messages, for a total of 32GB received by
every node. The config used was tcp.xml.
</para>
<para>
The N-1 approach yielded a throughput of 73 MB/node/sec, and the daisy chaining approach 107MB/node/sec !
</para>
</section>
<section>
<title>Configuration</title>
<para>
DAISYCHAIN can be placed directly on top of the transport, regardless of whether it is UDP or TCP, e.g.
<screen>
<TCP .../>
<DAISYCHAIN .../>
<TCPPING .../>
</screen>
</para>
</section>
</section>
<section>
<title>Ergonomics</title>
<para>
Ergonomics is similar to the dynamic setting of optimal values for the JVM, e.g. garbage collection,
memory sizes etc. In JGroups, ergonomics means that we try to dynamically determine and set optimal
values for protocol properties. Examples are thread pool size, flow control credits, heartbeat
frequency and so on.
</para>
</section>
</chapter>
|