1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
|
<!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>11.8. Handling compressed data</title>
<link rel="stylesheet" href="../diveintopython.css" type="text/css">
<link rev="made" href="mailto:f8dy@diveintopython.org">
<meta name="generator" content="DocBook XSL Stylesheets V1.52.2">
<meta name="keywords" content="Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free">
<meta name="description" content="Python from novice to pro">
<link rel="home" href="../toc/index.html" title="Dive Into Python">
<link rel="up" href="index.html" title="Chapter 11. HTTP Web Services">
<link rel="previous" href="redirects.html" title="11.7. Handling redirects">
<link rel="next" href="alltogether.html" title="11.9. Putting it all together">
</head>
<body>
<table id="Header" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td id="breadcrumb" colspan="5" align="left" valign="top">You are here: <a href="../index.html">Home</a> > <a href="../toc/index.html">Dive Into Python</a> > <a href="index.html">HTTP Web Services</a> > <span class="thispage">Handling compressed data</span></td>
<td id="navigation" align="right" valign="top"> <a href="redirects.html" title="Prev: “Handling redirects”"><<</a> <a href="alltogether.html" title="Next: “Putting it all together”">>></a></td>
</tr>
<tr>
<td colspan="3" id="logocontainer">
<h1 id="logo"><a href="../index.html" accesskey="1">Dive Into Python</a></h1>
<p id="tagline">Python from novice to pro</p>
</td>
<td colspan="3" align="right">
<form id="search" method="GET" action="http://www.google.com/custom">
<p><label for="q" accesskey="4">Find: </label><input type="text" id="q" name="q" size="20" maxlength="255" value=" "> <input type="submit" value="Search"><input type="hidden" name="cof" value="LW:752;L:http://diveintopython.org/images/diveintopython.png;LH:42;AH:left;GL:0;AWFID:3ced2bb1f7f1b212;"><input type="hidden" name="domains" value="diveintopython.org"><input type="hidden" name="sitesearch" value="diveintopython.org"></p>
</form>
</td>
</tr>
</table>
<!--#include virtual="/inc/ads" -->
<div class="section" lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a name="oa.gzip"></a>11.8. Handling compressed data
</h2>
</div>
</div>
<div></div>
</div>
<div class="abstract">
<p>The last important HTTP feature you want to support is compression. Many web services have the ability to send data compressed,
which can cut down the amount of data sent over the wire by 60% or more. This is especially true of XML web services, since
XML data compresses very well.
</p>
</div>
<p>Servers won't give you compressed data unless you tell them you can handle it.</p>
<div class="example"><a name="d0e29136"></a><h3 class="title">Example 11.14. Telling the server you would like compressed data</h3><pre class="screen">
<tt class="prompt">>>> </tt><span class="userinput"><span class='pykeyword'>import</span> urllib2, httplib</span>
<tt class="prompt">>>> </tt><span class="userinput">httplib.HTTPConnection.debuglevel = 1</span>
<tt class="prompt">>>> </tt><span class="userinput">request = urllib2.Request(<span class='pystring'>'http://diveintomark.org/xml/atom.xml'</span>)</span>
<tt class="prompt">>>> </tt><span class="userinput">request.add_header(<span class='pystring'>'Accept-encoding'</span>, <span class='pystring'>'gzip'</span>)</span> <a name="oa.gzip.1.1"></a><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput">opener = urllib2.build_opener()</span>
<tt class="prompt">>>> </tt><span class="userinput">f = opener.open(request)</span>
<span class="computeroutput">connect: (diveintomark.org, 80)
send: '
GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/2.1
Accept-encoding: gzip</span> <a name="oa.gzip.1.2"></a><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12">
<span class="computeroutput">'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 15 Apr 2004 22:24:39 GMT
header: Server: Apache/2.0.49 (Debian GNU/Linux)
header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT
header: ETag: "e842a-3e53-55d97640"
header: Accept-Ranges: bytes
header: Vary: Accept-Encoding
header: Content-Encoding: gzip</span> <a name="oa.gzip.1.3"></a><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12">
<span class="computeroutput">header: Content-Length: 6289</span> <a name="oa.gzip.1.4"></a><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12">
<span class="computeroutput">header: Connection: close
header: Content-Type: application/atom+xml</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.1.1"><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the key: once you've created your <tt class="classname">Request</tt> object, add an <tt class="literal">Accept-encoding</tt> header to tell the server you can accept gzip-encoded data. <tt class="literal">gzip</tt> is the name of the compression algorithm you're using. In theory there could be other compression algorithms, but <tt class="literal">gzip</tt> is the compression algorithm used by 99% of web servers.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.1.2"><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">There's your header going across the wire.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.1.3"><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">And here's what the server sends back: the <tt class="literal">Content-Encoding: gzip</tt> header means that the data you're about to receive has been gzip-compressed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.1.4"><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <tt class="literal">Content-Length</tt> header is the length of the compressed data, not the uncompressed data. As you'll see in a minute, the actual length of
the uncompressed data was 15955, so gzip compression cut your bandwidth by over 60%!
</td>
</tr>
</table>
</div>
</div>
<div class="example"><a name="d0e29222"></a><h3 class="title">Example 11.15. Decompressing the data</h3><pre class="screen">
<tt class="prompt">>>> </tt><span class="userinput">compresseddata = f.read()</span> <a name="oa.gzip.2.1"></a><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput">len(compresseddata)</span>
<span class="computeroutput">6289</span>
<tt class="prompt">>>> </tt><span class="userinput"><span class='pykeyword'>import</span> StringIO</span>
<tt class="prompt">>>> </tt><span class="userinput">compressedstream = StringIO.StringIO(compresseddata)</span> <a name="oa.gzip.2.2"></a><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput"><span class='pykeyword'>import</span> gzip</span>
<tt class="prompt">>>> </tt><span class="userinput">gzipper = gzip.GzipFile(fileobj=compressedstream)</span> <a name="oa.gzip.2.3"></a><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput">data = gzipper.read()</span> <a name="oa.gzip.2.4"></a><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput"><span class='pykeyword'>print</span> data</span> <a name="oa.gzip.2.5"></a><img src="../images/callouts/5.png" alt="5" border="0" width="12" height="12">
<span class="computeroutput"><?xml version="1.0" encoding="iso-8859-1"?>
<feed version="0.3"
xmlns="http://purl.org/atom/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:lang="en">
<title mode="escaped">dive into mark</title>
<link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
<-- rest of feed omitted for brevity --></span>
<tt class="prompt">>>> </tt><span class="userinput">len(data)</span>
<span class="computeroutput">15955</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.1"><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Continuing from the previous example, <tt class="varname">f</tt> is the file-like object returned from the URL opener. Using its <tt class="methodname">read()</tt> method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first
step towards getting the data you really want.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.2"><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">OK, this step is a little bit of messy workaround. <span class="application">Python</span> has a <tt class="filename">gzip</tt> module, which reads (and actually writes) gzip-compressed files on disk. But you don't have a file on disk, you have a gzip-compressed
buffer in memory, and you don't want to write out a temporary file just so you can uncompress it. So what you're going to
do is create a file-like object out of the in-memory data (<tt class="varname">compresseddata</tt>), using the <tt class="filename">StringIO</tt> module. You first saw the <tt class="filename">StringIO</tt> module in <a href="../scripts_and_streams/index.html#kgp.openanything.stringio.example" title="Example 10.4. Introducing StringIO">the previous chapter</a>, but now you've found another use for it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.3"><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now you can create an instance of <tt class="classname">GzipFile</tt>, and tell it that its “<span class="quote">file</span>” is the file-like object <tt class="varname">compressedstream</tt>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.4"><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">This is the line that does all the actual work: “<span class="quote">reading</span>” from <tt class="classname">GzipFile</tt> will decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. <tt class="varname">gzipper</tt> is a file-like object which represents a gzip-compressed file. That “<span class="quote">file</span>” is not a real file on disk, though; <tt class="varname">gzipper</tt> is really just “<span class="quote">reading</span>” from the file-like object you created with <tt class="filename">StringIO</tt> to wrap the compressed data, which is only in memory in the variable <tt class="varname">compresseddata</tt>. And where did that compressed data come from? You originally downloaded it from a remote HTTP server by “<span class="quote">reading</span>” from the file-like object you built with <tt class="function">urllib2.build_opener</tt>. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.2.5"><img src="../images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Look ma, real data. (15955 bytes of it, in fact.)</td>
</tr>
</table>
</div>
</div>
<p>“<span class="quote">But wait!</span>” I hear you cry. “<span class="quote">This could be even easier!</span>” I know what you're thinking. You're thinking that <tt class="varname">opener.open</tt> returns a file-like object, so why not cut out the <tt class="filename">StringIO</tt> middleman and just pass <tt class="varname">f</tt> directly to <tt class="methodname">GzipFile</tt>? OK, maybe you weren't thinking that, but don't worry about it, because it doesn't work.
</p>
<div class="example"><a name="d0e29389"></a><h3 class="title">Example 11.16. Decompressing the data directly from the server</h3><pre class="screen">
<tt class="prompt">>>> </tt><span class="userinput">f = opener.open(request)</span> <a name="oa.gzip.3.1"></a><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput">f.headers.get(<span class='pystring'>'Content-Encoding'</span>)</span> <a name="oa.gzip.3.2"></a><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12">
<span class="computeroutput">'gzip'</span>
<tt class="prompt">>>> </tt><span class="userinput">data = gzip.GzipFile(fileobj=f).read()</span> <a name="oa.gzip.3.3"></a><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12">
<span class="traceback">Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "c:\python23\lib\gzip.py", line 217, in read
self._read(readsize)
File "c:\python23\lib\gzip.py", line 252, in _read
pos = self.fileobj.tell() # Save current position
AttributeError: addinfourl instance has no attribute 'tell'</span>
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.3.1"><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Continuing from the previous example, you already have a <tt class="classname">Request</tt> object set up with an <tt class="literal">Accept-encoding: gzip</tt> header.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.3.2"><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Simply opening the request will get you the headers (though not download any data yet). As you can see from the returned
<tt class="literal">Content-Encoding</tt> header, this data has been sent gzip-compressed.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#oa.gzip.3.3"><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Since <tt class="methodname">opener.open</tt> returns a file-like object, and you know from the headers that when you read it, you're going to get gzip-compressed data,
why not simply pass that file-like object directly to <tt class="classname">GzipFile</tt>? As you “<span class="quote">read</span>” from the <tt class="classname">GzipFile</tt> instance, it will “<span class="quote">read</span>” compressed data from the remote HTTP server and decompress it on the fly. It's a good idea, but unfortunately it doesn't
work. Because of the way gzip compression works, <tt class="classname">GzipFile</tt> needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “<span class="quote">file</span>” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back
and forth through the data stream. So the inelegant hack of using <tt class="filename">StringIO</tt> is the best solution: download the compressed data, create a file-like object out of it with <tt class="filename">StringIO</tt>, and then decompress the data from that.
</td>
</tr>
</table>
</div>
</div>
</div>
<table class="Footer" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td width="35%" align="left"><br><a class="NavigationArrow" href="redirects.html"><< Handling redirects</a></td>
<td width="30%" align="center"><br> <span class="divider">|</span> <a href="index.html#oa.divein" title="11.1. Diving in">1</a> <span class="divider">|</span> <a href="review.html" title="11.2. How not to fetch data over HTTP">2</a> <span class="divider">|</span> <a href="http_features.html" title="11.3. Features of HTTP">3</a> <span class="divider">|</span> <a href="debugging.html" title="11.4. Debugging HTTP web services">4</a> <span class="divider">|</span> <a href="user_agent.html" title="11.5. Setting the User-Agent">5</a> <span class="divider">|</span> <a href="etags.html" title="11.6. Handling Last-Modified and ETag">6</a> <span class="divider">|</span> <a href="redirects.html" title="11.7. Handling redirects">7</a> <span class="divider">|</span> <span class="thispage">8</span> <span class="divider">|</span> <a href="alltogether.html" title="11.9. Putting it all together">9</a> <span class="divider">|</span> <a href="summary.html" title="11.10. Summary">10</a> <span class="divider">|</span>
</td>
<td width="35%" align="right"><br><a class="NavigationArrow" href="alltogether.html">Putting it all together >></a></td>
</tr>
<tr>
<td colspan="3"><br></td>
</tr>
</table>
<div class="Footer">
<p class="copyright">Copyright © 2000, 2001, 2002, 2003, 2004 <a href="mailto:mark@diveintopython.org">Mark Pilgrim</a></p>
</div>
</body>
</html>
|