Replicated Cache ================ Author: Bela Ban Id: $Id: ReplCache.txt,v 1.6 2008/12/23 16:11:02 belaban Exp $ Idea ---- To have a big virtual memory by aggregating the physical memory of all nodes in a cluster. E.g. if we have hosts A, B, C and D and each host has 2GB of physical memory, then we can have 4 * 2 GB = 8 GB of virtual memory. ReplCache is a hashmap which distributes its keys and values across the cluster, based on consistent hashing. There are 3 methods: put(), get() and remove(). When adding a new key/value pair, put() takes the 'replication factor' as argument, alongside the key and value and a timeout. A replication factor of 0 means no replication. A replication factor of -1 means that the element is replicated to all cluster nodes. A replication factor of K means replicate the element K times, e.g. PUT(KEY, VAL, 2, timeout) means that an element will be created on 2 nodes. When the view changes, the cluster makes sure that the above KEY, VAL is always present in 2 nodes. Note that K has to be less than or equal to N (= number of nodes). When K > N, then ReplCache treats K as -1 (replicate to all nodes). TBD: a replication factor which defines a percentage, e.g. 0.3 means replicate to 30% of all nodes. The advantage of defining replication factors per element is that we can define what reliability we want for individual data items. For example, an element that can easily be retrieved from disk or database probably does fine with a factor of 0 (= no replication). Here, we use the cache just as a speedup to prevent DB access. An important item that is costly to recreate, or cannot be recreated at all, should probably have a factor of -1. The point of being able to define replication factors per data item is that we can save memory this way. If we compare this to RAID 0+1, then - because we're replicating every single data item - we can effectively only use half of the memory (disk space) allocated to the RAID system. With per data replication factors, we can increase the net memory that can be used (unless of course all elements are added with a factor of -1 !). Put() always results in a multicast across the cluster. Each node determines by itself whether it will add the KEY|VAL or not. This is done by computing a set of consistent hashes from KEY, mapping them to a set of servers and determining whether the node's address is in that set. If so, a node will add the KEY,VAL to its local cache, else not. Get() first checks the level 1 cache (L1 cache, not mentioned so far). If the data is found, it will be returned, else we multicast a GET request (bounded with a timeout). Every node returns its key/value from the local cache. Before returning from get(), we add the result to our L1 cache. (The L1 cache will independently evict timed out items, or evict items when it needs more space. Items with a timeout of -1 are never placed into the L1 cache). Remove() simply multicasts the KEY across the cluster. Every node removes KEY from its local cache, and the L1 cache if enabled. Design points ------------- There are a few design considerations: - Keys and values are small. We do *not* provide technology which breaks large data items into multiple chunks and distributes or replicates these chunks individually - IP multicasting is the transport. If we used TCP, communication would get costly (N-1 issue) API --- put(KEY, VAL, K, TIMEOUT): Places KEY,VAL into the hashmaps of selected cluster nodes. Existing data will be overwritten. KEY and VAL have to be serializable. K can be -1 (replicate everywhere), 0 (create only on 1 node) or > 0 <= N (replicate to K nodes). TIMEOUT (ms): -1 (no caching), 0 (cache until removed) or > 0 (cache for TIMEOUT ms) On reception of put(): - The selected target nodes add the KEY,VAL to their local cache if the conistent hash matches their local_addr - *Everyone* removes KEY from their L1 cache. (Optimization: only the non-selected nodes do this, the selected nodes also add KEY,VAL to their L1 caches) The put() method creates a message with KEY, VAL, K and TIMEOUT and multicasts it. Each node which receives the message does the following: - If K == -1: add it to the local cache and return - If K == 0: compute the server based on the consistent hashcode for KEY and see whether local_addr == server. If so, add the KEY, VAL to the local cache and return. Else, drop the message. - If K > 0: compute K consistent hashes from KEY. If local_addr is part of the set of server addresses, add KEY,VAL to the local cache. Else, drop the message. VAL get(KEY): - Look up KEY in the L1 cache. If found,and not expired, return it - Multicast a GET request. - If a non-null response has been received: add it to the L1 cache (if not -1) and return - Else return null void remove(KEY): - Multicast a REMOVE(KEY) message - On reception, every node removes KEY from its local cache View changes: For a new or left node P, every node N: - For each local KEY: - If the K factor is -1: replicate KEY,VAL to P - If the K factor is 0: compute consistent hash and pick server S - If S == P, the server which hosted KEY before P joined moves KEY,VAL to P (PUT message), and removes KEY from its local hashmap - Else: do nothing (no rebalancing needed) - If the factor is > 0: - Compute K consistent hashes and determine the K servers which are supposed to be hosting KEY - For each server S: - Do the same as above (factor == 0)