1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Character Encoding Detection [Universal Feed Parser]</title>
<link rel="stylesheet" href="feedparser.css" type="text/css">
<link rev="made" href="mailto:mark@diveintomark.org">
<meta name="generator" content="DocBook XSL Stylesheets V1.65.1">
<meta name="keywords" content="RSS, Atom, CDF, XML, feed, parser, Python">
<link rel="start" href="index.html" title="Documentation">
<link rel="up" href="advanced.html" title="Advanced Features">
<link rel="prev" href="version-detection.html" title="Feed Type and Version Detection">
<link rel="next" href="bozo.html" title="Bozo Detection">
</head>
<body id="feedparser-org" class="docs">
<div class="z" id="intro"><div class="sectionInner"><div class="sectionInner2">
<div class="s" id="pageHeader">
<h1><a href="/"><span>Universal Feed Parser</span></a></h1>
<p><span>Parse RSS and Atom feeds in Python. 3000 unit tests. Open source.</span></p>
</div>
<div class="s" id="quickSummary"><ul>
<li class="li1">
<a href="http://sourceforge.net/projects/feedparser/"><span>Download</span></a> ·</li>
<li class="li2">
<a href="http://feedparser.org/docs/"><span>Documentation</span></a> ·</li>
<li class="li3">
<a href="http://feedparser.org/tests/"><span>Unit tests</span></a> ·</li>
<li class="li4"><a href="http://sourceforge.net/tracker/?func=browse&group_id=112328&atid=661937"><span>Report a bug</span></a></li>
</ul></div>
</div></div></div>
<div id="main"><div id="mainInner">
<p id="breadcrumb">You are here: <a href="index.html">Documentation</a> → <a href="advanced.html">Advanced Features</a> → <span class="thispage">Character Encoding Detection</span></p>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h2 class="title">
<a name="advanced.encoding" class="skip" href="#advanced.encoding" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Character Encoding Detection</h2></div></div>
<div></div>
</div>
<a name="id4960499"></a><table class="tip" border="0" summary="">
<tr><td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"></td></tr>
<tr><td colspan="2" align="left" valign="top" width="99%">Feeds may be published in any character encoding. <span class="application">Python</span> supports only a few character encodings by default. To support the maximum number of character encodings (and be able to parse the maximum number of feeds), you should install <tt class="filename">cjkcodecs</tt> and <tt class="filename">iconv_codec</tt>. Both are available at <a href="http://cjkpython.i18n.org/">http://cjkpython.i18n.org/</a>.</td></tr>
</table>
<div class="abstract"><p><a href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a> defines the interaction between <acronym title="Extensible Markup Language">XML</acronym> and <acronym title="Hypertext Transfer Protocol">HTTP</acronym> as it relates to character encoding. <acronym title="Extensible Markup Language">XML</acronym> and <acronym title="Hypertext Transfer Protocol">HTTP</acronym> have different ways of specifying character encoding and different defaults in case no encoding is specified, and determining which value takes precedence depends on a variety of factors.</p></div>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="advanced.encoding.intro" class="skip" href="#advanced.encoding.intro" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Introduction to Character Encoding</h3></div></div>
<div></div>
</div>
<p>In <acronym title="Extensible Markup Language">XML</acronym>, the character encoding is optional and may be given in the <acronym title="Extensible Markup Language">XML</acronym> declaration in the first line of the document, like this:</p>
<div class="informalexample"><pre class="programlisting "><xml version="1.0" encoding="utf-8"?></pre></div>
<p>If no encoding is given, XML supports the use of a Byte Order Mark to identify the document as some flavor of UTF-32, UTF-16, or UTF-8. <a href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">Section F of the <acronym title="Extensible Markup Language">XML</acronym> specification</a> outlines the process for determining the character encoding based on unique properties of the Byte Order Mark in the first two to four bytes of the document.</p>
<p>If no encoding is specified and no Byte Order Mark is present, <acronym title="Extensible Markup Language">XML</acronym> defaults to UTF-8.</p>
<p><acronym title="Hypertext Transfer Protocol">HTTP</acronym> uses <acronym>MIME</acronym> to define a method of specifying the character encoding, as part of the <tt class="literal">Content-Type</tt> <acronym title="Hypertext Transfer Protocol">HTTP</acronym> header, which looks like this:</p>
<div class="informalexample"><pre class="programlisting ">Content-Type: text/html; charset="utf-8"</pre></div>
<p>If no charset is specified, <acronym title="Hypertext Transfer Protocol">HTTP</acronym> defaults to <tt class="literal">iso-8859-1</tt>, but only for <tt class="literal">text/*</tt> media types. For other media types, the default encoding is undefined, which is where <acronym title="Request For Comments">RFC</acronym> 3023 comes in.</p>
<p>According to <acronym title="Request For Comments">RFC</acronym> 3023, if the media type given in the <tt class="literal">Content-Type</tt> <acronym title="Hypertext Transfer Protocol">HTTP</acronym> header is <tt class="literal">application/xml</tt>, <tt class="literal">application/xml-dtd</tt>, <tt class="literal">application/xml-external-parsed-entity</tt>, or any one of the subtypes of <tt class="literal">application/xml</tt> such as <tt class="literal">application/atom+xml</tt> or <tt class="literal">application/rss+xml</tt> or even <tt class="literal">application/rdf+xml</tt>, then the encoding is</p>
<div class="orderedlist"><ol type="1">
<li>the encoding given in the <tt class="varname">charset</tt> parameter of the <tt class="literal">Content-Type</tt> <acronym title="Hypertext Transfer Protocol">HTTP</acronym> header, or</li>
<li>the encoding given in the <tt class="sgmltag-attribute">encoding</tt> attribute of the <acronym title="Extensible Markup Language">XML</acronym> declaration within the document, or</li>
<li>
<tt class="literal">utf-8</tt>.</li>
</ol></div>
<p>On the other hand, if the media type given in the <tt class="literal">Content-Type</tt> <acronym title="Hypertext Transfer Protocol">HTTP</acronym> header is <tt class="literal">text/xml</tt>, <tt class="literal">text/xml-external-parsed-entity</tt>, or a subtype like <tt class="literal">text/AnythingAtAll+xml</tt>, then the encoding attribute of the <acronym title="Extensible Markup Language">XML</acronym> declaration within the document is ignored completely, and the encoding is</p>
<div class="orderedlist"><ol type="1">
<li>the encoding given in the charset parameter of the <tt class="literal">Content-Type</tt> <acronym title="Hypertext Transfer Protocol">HTTP</acronym> header, or</li>
<li>
<tt class="literal">us-ascii</tt>.</li>
</ol></div>
</div>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="advanced.encoding.override" class="skip" href="#advanced.encoding.override" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Handling Incorrectly-Declared Encodings</h3></div></div>
<div></div>
</div>
<p><span class="application">Universal Feed Parser</span> initially uses the rules specified in <acronym title="Request For Comments">RFC</acronym> 3023 to determine the character encoding of the feed. If parsing succeeds, then that's that. If parsing fails, <span class="application">Universal Feed Parser</span> sets the <tt class="varname">bozo</tt> bit to <tt class="constant">1</tt> and sets <tt class="varname">bozo_exception</tt> to <tt class="constant">feedparser.CharacterEncodingOverride</tt>. Then it tries to reparse the feed with the following character encodings:</p>
<div class="orderedlist"><ol type="1">
<li>the encoding specified in the <acronym title="Extensible Markup Language">XML</acronym> declaration</li>
<li>the encoding sniffed from the first four bytes of the document (as per <a href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">Section F</a>)</li>
<li>the encoding auto-detected by the <a href="http://chardet.feedparser.org/"><span class="application">Universal Encoding Detector</span></a>, if installed</li>
<li><tt class="literal">utf-8</tt></li>
<li><tt class="literal">windows-1252</tt></li>
</ol></div>
<p>If the character encoding can not be determined, <span class="application">Universal Feed Parser</span> sets the <tt class="varname">bozo</tt> bit to <tt class="constant">1</tt> and sets <tt class="varname">bozo_exception</tt> to <tt class="constant">feedparser.CharacterEncodingUnknown</tt>. In this case, parsed values will be strings, not Unicode strings.</p>
</div>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="advanced.encoding.nonxml" class="skip" href="#advanced.encoding.nonxml" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Handling Incorrectly-Declared Media Types</h3></div></div>
<div></div>
</div>
<p><acronym title="Request For Comments">RFC</acronym> 3023 only applies when the feed is served over <acronym title="Hypertext Transfer Protocol">HTTP</acronym> with a <tt class="literal">Content-Type</tt> that declares the feed to be some kind of <acronym title="Extensible Markup Language">XML</acronym>. However, some web servers are severely misconfigured and serve feeds with a <tt class="literal">Content-Type</tt> of <tt class="literal">text/plain</tt>, <tt class="literal">application/octet-stream</tt>, or some completely bogus media type.</p>
<p><span class="application">Universal Feed Parser</span> will attempt to parse such feeds, but it will set the <tt class="varname">bozo</tt> bit to <tt class="constant">1</tt> and set <tt class="varname">bozo_exception</tt> to <tt class="constant">feedparser.NonXMLContentType</tt>.</p>
<div class="furtherreading">
<h3>Elsewhere</h3>
<ul>
<li><a href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023</a></li>
<li><a href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info">Section F of the XML specification</a></li>
<li><a href="http://www.imc.org/atom-syntax/mail-archive/msg05575.html">On the well-formedness of <acronym title="Extensible Markup Language">XML</acronym> documents served as <tt class="literal">text/plain</tt></a></li>
<li><a href="http://cjkpython.i18n.org/">CJKCodecs and iconv_codec</a></li>
</ul>
</div>
</div>
</div>
<div style="float: left">← <a class="NavigationArrow" href="version-detection.html">Feed Type and Version Detection</a>
</div>
<div style="text-align: right">
<a class="NavigationArrow" href="bozo.html">Bozo Detection</a> →</div>
<hr style="clear:both">
<div class="footer"><p class="copyright">Copyright © 2004, 2005, 2006 Mark Pilgrim</p></div>
</div></div>
</body>
</html>
|