File: FAQ-16.html

package info (click to toggle)
squid 2.6.5-6etch5
links: PTS
area: main
in suites: etch
size: 12,540 kB
ctags: 13,801
sloc: ansic: 105,278; sh: 6,083; makefile: 1,297; perl: 1,245; awk: 40
file content (671 lines) | stat: -rw-r--r-- 24,751 bytes
parent folder | download | duplicates (2)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
 <META NAME="GENERATOR" CONTENT="LinuxDoc-Tools 0.9.21">
 <TITLE>SQUID Frequently Asked Questions: Cache Digests</TITLE>
 <LINK HREF="FAQ-17.html" REL=next>
 <LINK HREF="FAQ-15.html" REL=previous>
 <LINK HREF="FAQ.html#toc16" REL=contents>
</HEAD>
<BODY>
<A HREF="FAQ-17.html">Next</A>
<A HREF="FAQ-15.html">Previous</A>
<A HREF="FAQ.html#toc16">Contents</A>
<HR>
<H2><A NAME="cache-digests"></A> <A NAME="s16">16.</A> <A HREF="FAQ.html#toc16">Cache Digests</A></H2>

<P><EM>Cache Digest FAQs compiled by
<A HREF="mailto:ndoherty@eei.ericsson.se">Niall Doherty</A>.</EM></P>

<H2><A NAME="ss16.1">16.1</A> <A HREF="FAQ.html#toc16.1">What is a Cache Digest?</A>
</H2>

<P>A Cache Digest is a summary of the contents of an Internet Object Caching
Server.
It contains, in a compact (i.e. compressed) format, an indication of whether
or not particular URLs are in the cache.</P>

<P>A "lossy" technique is used for compression, which means that very high
compression factors can be achieved at the expense of not having 100%
correct information.</P>


<H2><A NAME="ss16.2">16.2</A> <A HREF="FAQ.html#toc16.2">How and why are they used?</A>
</H2>

<P>Cache servers periodically exchange their digests with each other.</P>

<P>When a request for an object (URL) is received from a client a cache
can use digests from its peers to find out which of its peers (if any)
have that object.
The cache can then request the object from the closest peer (Squid
uses the NetDB database to determine this).</P>

<P>Note that Squid will only make digest queries in those digests that are
<EM>enabled</EM>.
It will disable a peers digest IFF it cannot fetch a valid digest
for that peer.
It will enable that peers digest again when a valid one is fetched.</P>

<P>The checks in the digest are very fast and they eliminate the need
for per-request queries to peers. Hence:</P>

<P>
<UL>
<LI>Latency is eliminated and client response time should be improved.</LI>
<LI>Network utilisation may be improved.</LI>
</UL>
</P>

<P>Note that the use of Cache Digests (for querying the cache contents of peers)
and the generation of a Cache Digest (for retrieval by peers) are independent.
So, it is possible for a cache to make a digest available for peers, and not
use the functionality itself and vice versa.</P>


<H2><A NAME="ss16.3">16.3</A> <A HREF="FAQ.html#toc16.3">What is the theory behind Cache Digests?</A>
</H2>

<P>Cache Digests are based on Bloom Filters - they are a method for
representing a set of keys with lookup capabilities;
where lookup means "is the key in the filter or not?".</P>

<P>In building a cache digest:</P>

<P>
<UL>
<LI> A vector (1-dimensional array) of m bits is allocated, with all
bits initially set to 0.</LI>
<LI> A number, k, of independent hash functions are chosen, h1, h2,
..., hk, with range { 1, ..., m }
(i.e. a key hashed with any of these functions gives a value between 1
and m inclusive).</LI>
<LI> The set of n keys to be operated on are denoted by:
A = { a1, a2, a3, ..., an }.</LI>
</UL>
</P>


<H3>Adding a Key</H3>

<P>To add a key the value of each hash function for that key is calculated.
So, if the key was denoted by <EM>a</EM>, then h1(a), h2(a), ...,
hk(a) are calculated.</P>

<P>The value of each hash function for that key represents an index into
the array and the corresponding bits are set to 1. So, a digest with
6 hash functions would have 6 bits to be set to 1 for each key added.</P>

<P>Note that the addition of a number of <EM>different</EM> keys could
cause one particular bit to be set to 1 multiple times.</P>


<H3>Querying a Key</H3>

<P>To query for the existence of a key the indices into the array are
calculated from the hash functions as above.</P>

<P>
<UL>
<LI> If any of the corresponding bits in the array are 0 then the key is
not present.</LI>
<LI> If all of the corresponding bits in the array are 1 then the key is
<EM>likely</EM> to be present.</LI>
</UL>
</P>

<P>Note the term <EM>likely</EM>.
It is possible that a <EM>collision</EM> in the digest can occur, whereby
the digest incorrectly indicates a key is present.
This is the price paid for the compact representation.
While the probability of a collision can never be reduced to zero it can
be controlled.
Larger values for the ratio of the digest size to the number of entries added
lower the probability.
The number of hash functions chosen also influence the probability.</P>


<H3>Deleting a Key</H3>


<P>To delete a key, it is not possible to simply set the associated bits
to 0 since any one of those bits could have been set to 1 by the addition
of a different key!</P>

<P>Therefore, to support deletions a counter is required for each bit position
in the array.
The procedures to follow would be:</P>

<P>
<UL>
<LI> When adding a key, set appropriate bits to 1 and increment the
corresponding counters.</LI>
<LI> When deleting a key, decrement the appropriate counters (while > 0),
and if a counter reaches 0 <EM>then</EM> the corresponding bit is set to 0.</LI>
</UL>
</P>


<H2><A NAME="ss16.4">16.4</A> <A HREF="FAQ.html#toc16.4">How is the size of the Cache Digest in Squid determined?</A>
</H2>

<P>Upon initialisation, the <EM>capacity</EM> is set to the number
of objects that can be (are) stored in the cache.
Note that there are upper and lower limits here.</P>

<P>An arbitrary constant, bits_per_entry (currently set to 5), is
used to calculate the size of the array using the following formula:</P>

<P>
<PRE>
  number of bits in array = capacity * bits_per_entry + 7
</PRE>
</P>

<P>The size of the digest, in bytes, is therefore:</P>

<P>
<PRE>
  digest size = int (number of bits in array / 8)
</PRE>
</P>

<P>When a digest rebuild occurs, the change in the cache size (capacity)
is measured.
If the capacity has changed by a large enough amount (10%) then
the digest array is freed and reallocated memory, otherwise the
same digest is re-used.</P>


<H2><A NAME="ss16.5">16.5</A> <A HREF="FAQ.html#toc16.5">What hash functions (and how many of them) does Squid use?</A>
</H2>

<P>The protocol design allows for a variable number of hash functions (k).
However, Squid employs a very efficient method using a fixed number - four.</P>

<P>Rather than computing a number of independent hash functions over a URL
Squid uses a 128-bit MD5 hash of the key (actually a combination of the URL
and the HTTP retrieval method) and then splits this into four equal
chunks.</P>
<P>Each chunk, modulo the digest size (m), is used as the value for one of
the hash functions - i.e. an index into the bit array.</P>

<P>Note: As Squid retrieves objects and stores them in its cache on disk,
it adds them to the in-RAM index using a lookup key which is an MD5 hash
- the very one discussed above.
This means that the values for the Cache Digest hash functions are
already available and consequently the operations are extremely
efficient!</P>

<P>Obviously, modifying the code to support a variable number of hash functions
would prove a little more difficult and would most likely reduce efficiency.</P>


<H2><A NAME="ss16.6">16.6</A> <A HREF="FAQ.html#toc16.6">How are objects added to the Cache Digest in Squid?</A>
</H2>

<P>Every object referenced in the index in RAM is checked to see if
it is suitable for addition to the digest.</P>
<P>A number of objects are not suitable, e.g. those that are private,
not cachable, negatively cached etc. and are skipped immediately.</P>

<P>A <EM>freshness</EM> test is next made in an attempt to guess if
the object will expire soon, since if it does, it is not worthwhile
adding it to the digest.
The object is checked against the refresh patterns for staleness...</P>

<P>Since Squid stores references to objects in its index using the MD5 key
discussed earlier there is no URL actually available for each object -
which means that the pattern used will fall back to the default pattern, ".".
This is an unfortunate state of affairs, but little can be done about
it.
A <EM>cd_refresh_pattern</EM> option will be added to the configuration
file soon which will at least make the confusion a little clearer :-)</P>

<P>Note that it is best to be conservative with your refresh pattern
for the Cache Digest, i.e.
do <EM>not</EM> add objects if they might become stale soon.
This will reduce the number of False Hits.</P>


<H2><A NAME="ss16.7">16.7</A> <A HREF="FAQ.html#toc16.7">Does Squid support deletions in Cache Digests? What are diffs/deltas?</A>
</H2>

<P>Squid does not support deletions from the digest.
Because of this the digest must, periodically, be rebuilt from scratch to
erase stale bits and prevent digest pollution.</P>

<P>A more sophisticated option is to use <EM>diffs</EM> or <EM>deltas</EM>.
These would be created by building a new digest and comparing with the
current/old one.
They would essentially consist of aggregated deletions and additions
since the <EM>previous</EM> digest.</P>

<P>Since less bandwidth should be required using these it would be possible
to have more frequent updates (and hence, more accurate information).</P>

<P>Costs:</P>

<P>
<UL>
<LI>RAM - extra RAM needed to hold two digests while comparisons takes place.</LI>
<LI>CPU - probably a negligible amount.</LI>
</UL>
</P>


<H2><A NAME="ss16.8">16.8</A> <A HREF="FAQ.html#toc16.8">When and how often is the local digest built?</A>
</H2>

<P>The local digest is built:</P>

<P>
<UL>
<LI> when store_rebuild completes after startup
(the cache contents have been indexed in RAM), and</LI>
<LI> periodically thereafter. Currently, it is rebuilt every hour
(more data and experience is required before other periods, whether
fixed or dynamically varying, can "intelligently" be chosen).
The good thing is that the local cache decides on the expiry time and
peers must obey (see later).</LI>
</UL>
</P>

<P>While the [new] digest is being built in RAM the old version (stored
on disk) is still valid, and will be returned to any peer requesting it.
When the digest has completed building it is then swapped out to disk,
overwriting the old version.</P>

<P>The rebuild is CPU intensive, but not overly so.
Since Squid is programmed using an event-handling model, the approach
taken is to split the digest building task into chunks (i.e.  chunks
of entries to add) and to register each chunk as an event.
If CPU load is overly high, it is possible to extend the build period
- as long as it is finished before the next rebuild is due!</P>

<P>It may prove more efficient to implement the digest building as a separate
process/thread in the future...</P>


<H2><A NAME="ss16.9">16.9</A> <A HREF="FAQ.html#toc16.9">How are Cache Digests transferred between peers?</A>
</H2>

<P>Cache Digests are fetched from peers using the standard HTTP protocol
(note that a <EM>pull</EM> rather than <EM>push</EM> technique is
used).</P>

<P>After the first access to a peer, a <EM>peerDigestValidate</EM> event
is queued
(this event decides if it is time to fetch a new version of a digest
from a peer).
The queuing delay depends on the number of peers already queued
for validation - so that all digests from different peers are not
fetched simultaneously.</P>

<P>A peer answering a request for its digest will specify an expiry
time for that digest by using the HTTP <EM>Expires</EM> header.
The requesting cache thus knows when it should request a fresh
copy of that peers digest.</P>

<P>Note: requesting caches use an If-Modified-Since request in case the peer
has not rebuilt its digest for some reason since the last time it was
fetched.</P>


<H2><A NAME="ss16.10">16.10</A> <A HREF="FAQ.html#toc16.10">How and where are Cache Digests stored?</A>
</H2>


<H3>Cache Digest built locally</H3>

<P>Since the local digest is generated purely for the benefit of its neighbours
keeping it in RAM is not strictly required.
However, it was decided to keep the local digest in RAM partly because of
the following:</P>

<P>
<UL>
<LI> Approximately the same amount of memory will be (re-)allocated on every
rebuild of the digest,</LI>
<LI> the memory requirements are probably quite small (when compared to other
requirements of the cache server),</LI>
<LI> if ongoing updates of the digest are to be supported (e.g. additions/deletions) it will be necessary to perform these operations on a digest
in RAM, and</LI>
<LI> if diffs/deltas are to be supported the "old" digest would have to
be swapped into RAM anyway for the comparisons.</LI>
</UL>
</P>

<P>When the digest is built in RAM, it is then swapped out to disk, where it is
stored as a "normal" cache item - which is how peers request it.</P>


<H3>Cache Digest fetched from peer</H3>

<P>When a query from a client arrives, <EM>fast lookups</EM> are
required to decide if a request should be made to a neighbour cache.
It it therefore required to keep all peer digests in RAM.</P>

<P>Peer digests are also stored on disk for the following reasons:</P>

<P>
<UL>
<LI><EM>Recovery</EM>
- If stopped and restarted, peer digests can be reused from the local
on-disk copy (they will soon be validated using an HTTP IMS request
to the appropriate peers as discussed earlier), and</LI>
<LI><EM>Sharing</EM>
- peer digests are stored as normal objects in the cache. This
allows them to be given to neighbour caches.</LI>
</UL>
</P>


<H2><A NAME="ss16.11">16.11</A> <A HREF="FAQ.html#toc16.11">How are the Cache Digest statistics in the Cache Manager to be interpreted?</A>
</H2>

<P>Cache Digest statistics can be seen from the Cache Manager or through the
<EM>squidclient</EM> utility.
The following examples show how to use the <EM>squidclient</EM> utility
to request the list of possible operations from the localhost, local
digest statistics from the localhost, refresh statistics from the
localhost and local digest statistics from another cache, respectively.</P>

<P>
<PRE>
  squidclient mgr:menu

  squidclient mgr:store_digest

  squidclient mgr:refresh

  squidclient -h peer mgr:store_digest
</PRE>
</P>

<P>The available statistics provide a lot of useful debugging information.
The refresh statistics include a section for Cache Digests which
explains why items were added (or not) to the digest.</P>

<P>The following example shows local digest statistics for a 16GB
cache in a corporate intranet environment
(may be a useful reference for the discussion below).</P>

<P>
<PRE>
store digest: size: 768000 bytes
         entries: count: 588327 capacity: 1228800 util: 48%
         deletion attempts: 0
         bits: per entry: 5 on: 1953311 capacity: 6144000 util: 32%
         bit-seq: count: 2664350 avg.len: 2.31
         added: 588327 rejected: 528703 ( 47.33 %) del-ed: 0
         collisions: on add: 0.23 % on rej: 0.23 %
</PRE>
</P>

<P><EM>entries:capacity</EM> is a measure of how many items "are likely" to
be added to the digest.
It represents the number of items that were in the local cache at the
start of digest creation - however, upper and lower limits currently
apply.
This value is multiplied by <EM>bits: per entry</EM> (an arbitrary constant)
to give <EM>bits:capacity</EM>, which is the size of the cache digest in bits.
Dividing this by 8 will give <EM>store digest: size</EM> which is the
size in bytes.</P>

<P>The number of items represented in the digest is given by
<EM>entries:count</EM>.
This should be equal to <EM>added</EM> minus <EM>deletion attempts</EM>.</P>
<P>Since (currently) no modifications are made to the digest after the initial
build (no additions are made and deletions are not supported)
<EM>deletion attempts</EM> will always be 0 and <EM>entries:count</EM>
should simply be equal to <EM>added</EM>.</P>

<P><EM>entries:util</EM> is not really a significant statistic.
At most it gives a measure of how many of the items in the store were
deemed suitable for entry into the cache compared to how many were
"prepared" for.</P>

<P><EM>rej</EM> shows how many objects were rejected.
Objects will not be added for a number of reasons, the most common being
refresh pattern settings.
Remember that (currently) the default refresh pattern will be used for
checking for entry here and also note that changing this pattern can
significantly affect the number of items added to the digest!
Too relaxed and False Hits increase, too strict and False Misses increase.
Remember also that at time of validation (on the peer) the "real" refresh
pattern will be used - so it is wise to keep the default refresh pattern
conservative.</P>

<P><EM>bits: on</EM> indicates the number of bits in the digest that are set
to 1.
<EM>bits: util</EM> gives this figure as a percentage of the total number
of bits in the digest.
As we saw earlier, a figure of 50% represents the optimal trade-off.
Values too high (say > 75%) would cause a larger number of collisions,
and hence False Hits,
while lower values mean the digest is under-utilised (using unnecessary RAM).
Note that low values are normal for caches that are starting to fill up.</P>

<P>A bit sequence is an uninterrupted sequence of bits with the same value.
<EM>bit-seq: avg.len</EM> gives some insight into the quality of the hash
functions.
Long values indicate problem, even if <EM>bits:util</EM> is 50%
(> 3 = suspicious, > 10 = very suspicious).</P>


<H2><A NAME="ss16.12">16.12</A> <A HREF="FAQ.html#toc16.12">What are False Hits and how should they be handled?</A>
</H2>

<P>A False Hit occurs when a cache believes a peer has an object
and asks the peer for it <EM>but</EM> the peer is not able to
satisfy the request.</P>

<P>Expiring or stale objects on the peer are frequent causes of False
Hits.
At the time of the query actual refresh patterns are used on the
peer and stale entries are marked for revalidation.
However, revalidation is prohibited unless the peer is behaving
as a parent, or <EM>miss_access</EM> is enabled.
Thus, clients can receive error messages instead of revalidated
objects!</P>

<P>The frequency of False Hits can be reduced but never eliminated
completely, therefore there must be a robust way of handling them
when they occur.
The philosophy behind the design of Squid is to use lightweight
techniques and optimise for the common case and robustly handle the
unusual case (False Hits).</P>

<P>Squid will soon support the HTTP <EM>only-if-cached</EM> header.
Requests for objects made to a peer will use this header and if
the objects are not available, the peer can reply appropriately
allowing Squid to recognise the situation.
The following describes what Squid is aiming towards:</P>

<P>
<UL>
<LI>Cache Digests used to obtain good estimates of where a
requested object is located in a Cache Hierarchy.</LI>
<LI>Persistent HTTP Connections between peers.
There will be no TCP startup overhead and both latency and
network load will be similar for ICP (i.e. fast).</LI>
<LI>HTTP False Hit Recognition using the <EM>only-if-cached</EM>
HTTP header - allowing fall back to another peer or, if no other
peers are available with the object, then going direct (or
<EM>through</EM> a parent if behind a firewall).</LI>
</UL>
</P>


<H2><A NAME="ss16.13">16.13</A> <A HREF="FAQ.html#toc16.13">How can Cache Digest related activity be traced/debugged?</A>
</H2>


<H3>Enabling Cache Digests</H3>

<P>If you wish to use Cache Digests (available in Squid version 2) you need to
add a <EM>configure</EM> option, so that the relevant code is compiled in:</P>

<P>
<PRE>
  ./configure --enable-cache-digests ...
</PRE>
</P>


<H3>What do the access.log entries look like?</H3>

<P>If a request is forwarded to a neighbour due a HIT in that neighbour's
Cache Digest the hierarchy (9th) field of the access.log file for
the <EM>local cache</EM> will look like <EM>CACHE_DIGEST_HIT/neighbour</EM>.
The Log Tag (4th field) should obviously show a MISS.</P>

<P>On the peer cache the request should appear as a normal HTTP request
from the first cache.</P>


<H3>What does a False Hit look like?</H3>

<P>The easiest situation to analyse is when two caches (say A and B) are
involved neither of which uses the other as a parent.
In this case, a False Hit would show up as a CACHE_DIGEST_HIT on A and
<EM>NOT</EM> as a TCP_HIT on B (or vice versa).
If B does not fetch the object for A then the hierarchy field will
look like <EM>NONE/-</EM> (and A will have received an Access Denied
or Forbidden message).
This will happen if the object is not "available" on B and B does not
have <EM>miss_access</EM> enabled for A (or is not acting as a parent
for A).</P>


<H3>How is the cause of a False Hit determined?</H3>

<P>Assume A requests a URL from B and receives a False Hit</P>
<P>
<UL>
<LI> Using the <EM>squidclient</EM> utility <EM>PURGE</EM> the URL from A, e.g.

<P>
<PRE>
  squidclient -m PURGE 'URL'
</PRE>
</P>

</LI>
<LI> Using the <EM>squidclient</EM> utility request the object from A, e.g.

<P>
<PRE>
  squidclient 'URL'
</PRE>
</P>

</LI>
</UL>
</P>

<P>The HTTP headers of the request are available.
Two header types are of particular interest:</P>

<P>
<UL>
<LI> <EM>X-Cache</EM> - this shows whether an object is available or not.</LI>
<LI> <EM>X-Cache-Lookup</EM> - this keeps the result of a store table lookup
<EM>before</EM> refresh causing rules are checked (i.e. it indicates if the
object is available before any validation would be attempted).</LI>
</UL>
</P>

<P>The X-Cache and X-Cache-Lookup headers from A should both show MISS.</P>

<P>If A requests the object from B (which it will if the digest lookup indicates
B has it - assuming B is closest peer of course :-) then there will be another
set of these headers from B.</P>

<P>If the X-Cache header from B shows a MISS a False Hit has occurred.
This means that A thought B had an object but B tells A it does not
have it available for retrieval.
The reason why it is not available for retrieval is indicated by the
X-Cache-Lookup header. If:</P>

<P>
<UL>
<LI><EM>X-Cache-Lookup = MISS</EM> then either A's (version of
B's) digest is out-of-date or corrupt OR a collision occurred
in the digest (very small probability) OR B recently purged
the object.</LI>
<LI><EM>X-Cache-Lookup = HIT</EM> then B had the object, but
refresh rules (or A's max-age requirements) prevent A from
getting a HIT (validation failed).</LI>
</UL>
</P>


<H3>Use The Source</H3>

<P>If there is something else you need to check you can always look at the
source code.
The main Cache Digest functionality is organised as follows:</P>

<P>
<UL>
<LI> <EM>CacheDigest.c (debug section 70)</EM> Generic Cache Digest routines</LI>
<LI> <EM>store_digest.c (debug section 71)</EM> Local Cache Digest routines</LI>
<LI> <EM>peer_digest.c (debug section 72)</EM> Peer Cache Digest routines</LI>
</UL>
</P>

<P>Note that in the source the term <EM>Store Digest</EM> refers to the digest
created locally.
The Cache Digest code is fairly self-explanatory (once you understand how Cache
Digests work):</P>


<H2><A NAME="ss16.14">16.14</A> <A HREF="FAQ.html#toc16.14">What about ICP?</A>
</H2>


<P>COMING SOON!</P>


<H2><A NAME="ss16.15">16.15</A> <A HREF="FAQ.html#toc16.15">Is there a Cache Digest Specification?</A>
</H2>

<P>There is now, thanks to
<A HREF="mailto:martin@net.lut.ac.uk">Martin Hamilton</A> and
<A HREF="mailto:rousskov@ircache.net">Alex Rousskov</A>.</P>

<P>Cache Digests, as implemented in Squid 2.1.PATCH2, are described in
<A HREF="/CacheDigest/cache-digest-v5.txt">cache-digest-v5.txt</A>.</P>
<P>You'll notice the format is similar to an Internet Draft.
We decided not to submit this document as a draft because Cache Digests
will likely undergo some important changes before we want to try to make
it a standard.</P>

<H2><A NAME="ss16.16">16.16</A> <A HREF="FAQ.html#toc16.16">Would it be possible to stagger the timings when cache_digests are retrieved from peers?</A>
</H2>

<P><EM>Note: The information here is current for version 2.2.</EM></P>
<P>Squid already has code to spread the digest updates. The algorithm is
currently controlled by a few hard-coded constants in <EM>peer_digest.c</EM>. For
example, <EM>GlobDigestReqMinGap</EM> variable determines the minimum interval
between two requests for a digest. You may want to try to increase the
value of GlobDigestReqMinGap from 60 seconds to whatever you feel
comfortable with (but it should be smaller than hour/number_of_peers, of
course).</P>

<P>Note that whatever you do, you still need to give Squid enough time and
bandwidth to fetch all the digests. Depending on your environment, that
bandwidth may be more or less than an ICP would require. Upcoming digest
deltas (x10 smaller than the digests themselves) may be the only way to
solve the ``big scale'' problem.</P>



<HR>
<A HREF="FAQ-17.html">Next</A>
<A HREF="FAQ-15.html">Previous</A>
<A HREF="FAQ.html#toc16">Contents</A>
</BODY>
</HTML>