File: custom_heuristics.rst

package info (click to toggle)
python-cachecontrol 0.14.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 376 kB
  • sloc: python: 2,026; makefile: 167; sh: 8
file content (170 lines) | stat: -rw-r--r-- 5,526 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
..
  SPDX-FileCopyrightText: SPDX-FileCopyrightText: 2015 Eric Larson

  SPDX-License-Identifier: Apache-2.0

===========================
 Custom Caching Strategies
===========================

There are times when a server provides responses that are logically
cacheable, but they lack the headers necessary to cause CacheControl
to cache the response. `The HTTP Caching Spec
<http://tools.ietf.org/html/rfc7234>`_ does allow for caching systems
to cache requests that lack caching headers. In these situations, the
caching system can use heuristics to determine an appropriate amount
of time to cache a response.

By default, in CacheControl the decision to cache must be explicit by
default via the caching headers. When there is a need to cache
responses that wouldn't normally be cached, a user can provide a
heuristic to adjust the response in order to make it cacheable.

For example when running a test suite against a service, caching all
responses might be helpful speeding things up while still making real
calls to the API.


Caching Heuristics
==================

A cache heuristic allows specifying a caching strategy by adjusting
response headers before the response is considered for caching.

For example, if we wanted to implement a caching strategy where every
request should be cached for a week, we can implement the strategy in
a `cachecontrol.heuristics.Heuristic`. ::

  import calendar
  from cachecontrol.heuristics import BaseHeuristic
  from datetime import datetime, timedelta
  from email.utils import parsedate, formatdate


  class OneWeekHeuristic(BaseHeuristic):

      def update_headers(self, response):
          date = parsedate(response.headers['date'])
          expires = datetime(*date[:6]) + timedelta(weeks=1)
          return {
              'expires' : formatdate(calendar.timegm(expires.timetuple())),
              'cache-control' : 'public',
          }

      def warning(self, response):
          msg = 'Automatically cached! Response is Stale.'
          return '110 - "%s"' % msg


When a response is received and we are testing for whether it is
cacheable, the heuristic is applied before checking its headers. We
also set a `warning header
<http://tools.ietf.org/html/rfc7234#section-5.5>`_ to communicate why
the response might be stale. The original response is passed into the
warning header in order to use its values. For example, if the
response has been expired for more than 24 hours a `Warning 113
<http://tools.ietf.org/html/rfc7234#section-5.5.4>`_ should be used.

In order to use this heuristic, we pass it to our `CacheControl`
constructor. ::


  from requests import Session
  from cachecontrol import CacheControl


  sess = CacheControl(Session(), heuristic=OneWeekHeuristic())
  sess.get('http://google.com')
  r = sess.get('http://google.com')
  assert r.from_cache

The google homepage specifically uses a negative expires header and
private cache control header to avoid caches. We've managed to work
around that aspect and cache the response using our heuristic.


Best Practices
==============

Cache heuristics are still a new feature, which means that the support
is somewhat rudimentary. There likely to be best practices and common
heuristics that can meet the needs of many use cases. For example, in
the above heuristic it is important to change both the `expires` and
`cache-control` headers in order to make the response cacheable.

If you do find a helpful best practice or create a helpful heuristic,
please consider sending a pull request or opening a issue.


Expires After
-------------

CacheControl bundles an `ExpiresAfter` heuristic that is aimed at
making it relatively easy to automatically cache responses for a
period of time. Here is an example

.. code-block:: python

   import requests
   from cachecontrol import CacheControlAdapter
   from cachecontrol.heuristics import ExpiresAfter

   adapter = CacheControlAdapter(heuristic=ExpiresAfter(days=1))

   sess = requests.Session()
   sess.mount('http://', adapter)

The arguments are the same as the `datetime.timedelta`
object. `ExpiresAfter` will override or add the `Expires` header and
override or set the `Cache-Control` header to `public`.


Last Modified
-------------

CacheControl bundles an `LastModified` heuristic that emulates
the behavior of Firefox, following RFC7234. Roughly stated,
this sets the expiration on a page to 10% of the difference
between the request timestamp and the last modified timestamp.
This is capped at 24-hr.

.. code-block:: python

   import requests
   from cachecontrol import CacheControlAdapter
   from cachecontrol.heuristics import LastModified

   adapter = CacheControlAdapter(heuristic=LastModified())

   sess = requests.Session()
   sess.mount('http://', adapter)


Site Specific Heuristics
------------------------

If you have a specific domain that you want to apply a specific
heuristic to, use a separate adapter. ::

  import requests
  from cachecontrol import CacheControlAdapter
  from mypkg import MyHeuristic


  sess = requests.Session()
  sess.mount(
      'http://my.specific-domain.com',
      CacheControlAdapter(heuristic=MyHeuristic())
  )

In this way you can limit your heuristic to a specific site.


Warning!
========

Caching is hard and while HTTP does a reasonable job defining rules
for freshness, overriding those rules should be done with
caution. Many have been frustrated by over aggressive caches, so
please carefully consider your use case before utilizing a more
aggressive heuristic.