1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
|
Replicated Cache
================
Author: Bela Ban
Idea
----
To have a big virtual memory by aggregating the physical memory of all nodes in a cluster. E.g. if we have hosts
A, B, C and D and each host has 2GB of physical memory, then we can have 4 * 2 GB = 8 GB of virtual memory.
ReplCache is a hashmap which distributes its keys and values across the cluster, based on consistent hashing.
The difference to PartitionedHashMap is that ReplCache allows a user to define *per data item* whether it should be
replicated (for high availability) or not and if yes, how many times.
There are 3 methods: put(), get() and remove().
When adding a new key/value pair, put() takes the 'replication count' as argument, along with the key, value and
a timeout.
A replication count of 1 means the data is stored only on 1 node in the cluster, and there is no replication.
The data will be placed on a node in the cluster that corresponds to the consistent hashcode of the data's key.
A replication count of -1 means that the data item is replicated to all cluster nodes.
A replication count of K means the element is stored K times, e.g. PUT(KEY, VAL, 2, timeout) means that an element
will be created on 2 nodes. When the view changes, the cluster makes sure that the above KEY, VAL is always present
on 2 nodes. Note that K has to be less than or equal to N (= number of nodes). When K > N, then ReplCache treats
K as -1 (replicate to all nodes).
K == 0 is invalid and will be ignored; the data will not be stored in the cluster.
TBD: a replication count which defines a percentage, e.g. 0.3 means replicate to 30% of all nodes.
The advantage of defining replication counts per element is that we can define what reliability we want for
individual data items. For example, an element that can easily be retrieved from disk or database probably does
fine with a count of 1 (= no replication). Here, we use the cache just as a speedup to prevent DB access.
An important item that is costly to recreate, or cannot be recreated at all, should probably have a count of -1.
The point of being able to define replication counts per data item is that we can save memory. If we
compare this to RAID 1, then - because we're replicating every single data item - we can effectively only use
half of the memory (disk space) allocated to the RAID system. With per data replication counts, we can increase the
net memory that can be used (unless of course all elements are added with a count of -1 !).
Put() always results in a multicast across the cluster. Each node determines by itself whether it will add the KEY|VAL
or not. This is done by computing a set of consistent hashes from KEY, mapping them to a set of servers and determining
whether the node's address is in that set. If so, a node will add the KEY,VAL to its local cache, else not.
Get() first checks the level 1 cache (L1 cache, not mentioned so far), and the regular cache (L2 cache).
If the data is found, it will be returned, else we multicast a GET request (bounded with a timeout).
Every node returns its key/value from the local cache. Before returning from get(), we add the result to our L1 cache.
(The L1 cache will independently evict timed-out items, or evict items when it needs more space.
Items with a timeout of -1 are never placed into the L1 cache).
Remove() simply multicasts the KEY across the cluster. Every node removes KEY from its local cache, and the L1 cache
if enabled.
Design decisions
----------------
There are a few considerations and assumptions that influenced the design:
- Keys and values must be small. We do *not* provide technology which breaks large data items into multiple chunks
and distributes or replicates these chunks individually
- IP multicasting should be used in the transport. If we used TCP, communication would get costly (N-1 issue)
- K cannot be reduced for the same key, e.g. if K == 3 and then K == 2 for the same key, we'll have some leftover
data with K == 3. If this is necessary, remove the key first before calling put() again.
API
---
put(KEY, VAL, K, TIMEOUT):
--------------------------
Places KEY,VAL into the hashmaps of selected cluster nodes. Existing data will be overwritten. KEY and VAL have to
be serializable.
K is the replication count and can be:
-1: replicate everywhere
1: create only on 1 node
> 1 <= N: store on K nodes
TIMEOUT (ms):
-1: no caching
0: cache until removed
> 0: cache for TIMEOUT milliseconds
The put() method creates a message with KEY, VAL, K and TIMEOUT and multicasts it. Each node which receives the message
does the following:
- If K == -1: add it to the local cache and return
- If K == 1: compute the server based on the consistent hashcode for KEY and see whether local_addr == server. If
so, add the KEY, VAL to the local cache and return. Else, drop the message.
- If K > 0: compute K consistent hashes from KEY. If local_addr is part of the set of server addresses, add KEY,VAL
to the local cache. Else, drop the message.
VAL get(KEY):
-------------
- Look up KEY in the L1 cache. If found,and not expired, return it
- Multicast a GET request.
- If a non-null response has been received: add it to the L1 cache (if not -1) and return
- Else return null
void remove(KEY):
-----------------
- Multicast a REMOVE(KEY) message
- On reception, every node removes KEY from its local cache
View change:
------------
- The handling of view changes needs to be done in a separate thread. Suggestion: do this in a timer task, and
start the task 100ms after the view change callback. This is to ensure that every cluster node installed the view.
- Old view is V, new view is V-NEW
- For a new node P
- For each key KEY
- If K == -1: copy KEY to P (only the coord does this)
- If K == 1: compute new hash (based on V-NEW), if not same as local node -> move KEY to P
- If K > 1: re-balance
- For each left node Q
- For each key KEY:
- If K == -1: no-op
- If K == 1: compute new hash (based on V-NEW), if not same as local node -> move KEY to P
- If K > 1: re-balance
Re-balance (K > 1):
-------------------
- Old view is V, new view is V-NEW
- move() installs a KEY regardless of whether it would be accepted in the current view
- For each key KEY:
- If K == -1:
- For each newly joined node P:
- The coord sends KEY to P (move())
- If K == 1:
- Compute 1 hashcode (based on V-NEW) and determine the new node N
- If N != local node --> move KEY to N
- If K > 1:
- Compute hashes for V (NODES)
- Compute hashes for V-NEW (NEW-NODES)
- If NODES == NEW-NODES --> return
- Else
- Multicast KEY (make sure everyone has new view installed, to compute correct acceptance)
- If local node is not in NEW-NODES --> remove KEY from local node
- Compute nodes for V (NODES)
- Compute nodes for V-NEW (NEW-NODES)
- If NODES == NEW-NODES --> return
- Else
- Multicast KEY (every node will check the hash against itself and store if needed)
- If the local node is not in NEW-NODES --> remove KEY
Stopping the cache:
-------------------
- For each key KEY:
- If K is -1: no-op (every node already has KEY)
- If K == 1: compute new hash (based on the view excluding the current node) and copy KEY to the new node
- If K > 1: no-op (somebody else has KEY, view change will copy to more nodes if needed)
|