1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>HTML Sanitization [Universal Feed Parser]</title>
<link rel="stylesheet" href="feedparser.css" type="text/css">
<link rev="made" href="mailto:mark@diveintomark.org">
<meta name="generator" content="DocBook XSL Stylesheets V1.65.1">
<meta name="keywords" content="RSS, Atom, CDF, XML, feed, parser, Python">
<link rel="start" href="index.html" title="Documentation">
<link rel="up" href="advanced.html" title="Advanced Features">
<link rel="prev" href="date-parsing.html" title="Date Parsing">
<link rel="next" href="content-normalization.html" title="Content Normalization">
</head>
<body id="feedparser-org" class="docs">
<div class="z" id="intro"><div class="sectionInner"><div class="sectionInner2">
<div class="s" id="pageHeader">
<h1><a href="/"><span>Universal Feed Parser</span></a></h1>
<p><span>Parse RSS and Atom feeds in Python. 3000 unit tests. Open source.</span></p>
</div>
<div class="s" id="quickSummary"><ul>
<li class="li1">
<a href="http://sourceforge.net/projects/feedparser/"><span>Download</span></a> ·</li>
<li class="li2">
<a href="http://feedparser.org/docs/"><span>Documentation</span></a> ·</li>
<li class="li3">
<a href="http://feedparser.org/tests/"><span>Unit tests</span></a> ·</li>
<li class="li4"><a href="http://sourceforge.net/tracker/?func=browse&group_id=112328&atid=661937"><span>Report a bug</span></a></li>
</ul></div>
</div></div></div>
<div id="main"><div id="mainInner">
<p id="breadcrumb">You are here: <a href="index.html">Documentation</a> → <a href="advanced.html">Advanced Features</a> → <span class="thispage">HTML Sanitization</span></p>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h2 class="title">
<a name="advanced.sanitization" class="skip" href="#advanced.sanitization" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> <acronym title="HyperText Markup Language">HTML</acronym> Sanitization</h2></div></div>
<div></div>
</div>
<div class="abstract"><p>Many feed elements may contain <acronym title="HyperText Markup Language">HTML</acronym> markup, and many feed aggregators use a web browser (or browser component) to display content. By default, <span class="application">Universal Feed Parser</span> sanitizes <acronym title="HyperText Markup Language">HTML</acronym> markup in several elements, removing <acronym title="HyperText Markup Language">HTML</acronym> tags and attributes that could introduce Javascript or other security risks.</p></div>
<p>These elements are sanitized by default:</p>
<div class="itemizedlist"><ul>
<li><a href="reference-feed-title.html" title="feed.title">feed.title</a></li>
<li><a href="reference-feed-subtitle.html" title="feed.subtitle">feed.subtitle</a></li>
<li><a href="reference-feed-info.html" title="feed.info">feed.info</a></li>
<li><a href="reference-feed-rights.html" title="feed.rights">feed.rights</a></li>
<li><a href="reference-entry-title.html" title="entries[i].title">entries[i].title</a></li>
<li><a href="reference-entry-summary.html" title="entries[i].summary">entries[i].summary</a></li>
<li><a href="reference-entry-content.html" title="entries[i].content">entries[i].content</a></li>
</ul></div>
<p>The following <acronym title="HyperText Markup Language">HTML</acronym> tags are allowed by default (all others are stripped):
<span class="simplelist"><tt class="sgmltag-element">a</tt>, <tt class="sgmltag-element">abbr</tt>, <tt class="sgmltag-element">acronym</tt>, <tt class="sgmltag-element">address</tt>, <tt class="sgmltag-element">area</tt>, <tt class="sgmltag-element">b</tt>, <tt class="sgmltag-element">big</tt>, <tt class="sgmltag-element">blockquote</tt>, <tt class="sgmltag-element">br</tt>, <tt class="sgmltag-element">button</tt>, <tt class="sgmltag-element">caption</tt>, <tt class="sgmltag-element">center</tt>, <tt class="sgmltag-element">cite</tt>, <tt class="sgmltag-element">code</tt>, <tt class="sgmltag-element">col</tt>, <tt class="sgmltag-element">colgroup</tt>, <tt class="sgmltag-element">dd</tt>, <tt class="sgmltag-element">del</tt>, <tt class="sgmltag-element">dfn</tt>, <tt class="sgmltag-element">dir</tt>, <tt class="sgmltag-element">div</tt>, <tt class="sgmltag-element">dl</tt>, <tt class="sgmltag-element">dt</tt>, <tt class="sgmltag-element">em</tt>, <tt class="sgmltag-element">fieldset</tt>, <tt class="sgmltag-element">font</tt>, <tt class="sgmltag-element">form</tt>, <tt class="sgmltag-element">h1</tt>, <tt class="sgmltag-element">h2</tt>, <tt class="sgmltag-element">h3</tt>, <tt class="sgmltag-element">h4</tt>, <tt class="sgmltag-element">h5</tt>, <tt class="sgmltag-element">h6</tt>, <tt class="sgmltag-element">hr</tt>, <tt class="sgmltag-element">i</tt>, <tt class="sgmltag-element">img</tt>, <tt class="sgmltag-element">input</tt>, <tt class="sgmltag-element">ins</tt>, <tt class="sgmltag-element">kbd</tt>, <tt class="sgmltag-element">label</tt>, <tt class="sgmltag-element">legend</tt>, <tt class="sgmltag-element">li</tt>, <tt class="sgmltag-element">map</tt>, <tt class="sgmltag-element">menu</tt>, <tt class="sgmltag-element">ol</tt>, <tt class="sgmltag-element">optgroup</tt>, <tt class="sgmltag-element">option</tt>, <tt class="sgmltag-element">p</tt>, <tt class="sgmltag-element">pre</tt>, <tt class="sgmltag-element">q</tt>, <tt class="sgmltag-element">s</tt>, <tt class="sgmltag-element">samp</tt>, <tt class="sgmltag-element">select</tt>, <tt class="sgmltag-element">small</tt>, <tt class="sgmltag-element">span</tt>, <tt class="sgmltag-element">strike</tt>, <tt class="sgmltag-element">strong</tt>, <tt class="sgmltag-element">sub</tt>, <tt class="sgmltag-element">sup</tt>, <tt class="sgmltag-element">table</tt>, <tt class="sgmltag-element">tbody</tt>, <tt class="sgmltag-element">td</tt>, <tt class="sgmltag-element">textarea</tt>, <tt class="sgmltag-element">tfoot</tt>, <tt class="sgmltag-element">th</tt>, <tt class="sgmltag-element">thead</tt>, <tt class="sgmltag-element">tr</tt>, <tt class="sgmltag-element">tt</tt>, <tt class="sgmltag-element">u</tt>, <tt class="sgmltag-element">ul</tt>, <tt class="sgmltag-element">var</tt></span>
</p>
<p>The following <acronym title="HyperText Markup Language">HTML</acronym> attributes are allowed by default (all others are stripped):
<span class="simplelist"><tt class="sgmltag-attribute">abbr</tt>, <tt class="sgmltag-attribute">accept</tt>, <tt class="sgmltag-attribute">accept-charset</tt>, <tt class="sgmltag-attribute">accesskey</tt>, <tt class="sgmltag-attribute">action</tt>, <tt class="sgmltag-attribute">align</tt>, <tt class="sgmltag-attribute">alt</tt>, <tt class="sgmltag-attribute">axis</tt>, <tt class="sgmltag-attribute">border</tt>, <tt class="sgmltag-attribute">cellpadding</tt>, <tt class="sgmltag-attribute">cellspacing</tt>, <tt class="sgmltag-attribute">char</tt>, <tt class="sgmltag-attribute">charoff</tt>, <tt class="sgmltag-attribute">charset</tt>, <tt class="sgmltag-attribute">checked</tt>, <tt class="sgmltag-attribute">cite</tt>, <tt class="sgmltag-attribute">class</tt>, <tt class="sgmltag-attribute">clear</tt>, <tt class="sgmltag-attribute">cols</tt>, <tt class="sgmltag-attribute">colspan</tt>, <tt class="sgmltag-attribute">color</tt>, <tt class="sgmltag-attribute">compact</tt>, <tt class="sgmltag-attribute">coords</tt>, <tt class="sgmltag-attribute">datetime</tt>, <tt class="sgmltag-attribute">dir</tt>, <tt class="sgmltag-attribute">disabled</tt>, <tt class="sgmltag-attribute">enctype</tt>, <tt class="sgmltag-attribute">for</tt>, <tt class="sgmltag-attribute">frame</tt>, <tt class="sgmltag-attribute">headers</tt>, <tt class="sgmltag-attribute">height</tt>, <tt class="sgmltag-attribute">href</tt>, <tt class="sgmltag-attribute">hreflang</tt>, <tt class="sgmltag-attribute">hspace</tt>, <tt class="sgmltag-attribute">id</tt>, <tt class="sgmltag-attribute">ismap</tt>, <tt class="sgmltag-attribute">label</tt>, <tt class="sgmltag-attribute">lang</tt>, <tt class="sgmltag-attribute">longdesc</tt>, <tt class="sgmltag-attribute">maxlength</tt>, <tt class="sgmltag-attribute">media</tt>, <tt class="sgmltag-attribute">method</tt>, <tt class="sgmltag-attribute">multiple</tt>, <tt class="sgmltag-attribute">name</tt>, <tt class="sgmltag-attribute">nohref</tt>, <tt class="sgmltag-attribute">noshade</tt>, <tt class="sgmltag-attribute">nowrap</tt>, <tt class="sgmltag-attribute">prompt</tt>, <tt class="sgmltag-attribute">readonly</tt>, <tt class="sgmltag-attribute">rel</tt>, <tt class="sgmltag-attribute">rev</tt>, <tt class="sgmltag-attribute">rows</tt>, <tt class="sgmltag-attribute">rowspan</tt>, <tt class="sgmltag-attribute">rules</tt>, <tt class="sgmltag-attribute">scope</tt>, <tt class="sgmltag-attribute">selected</tt>, <tt class="sgmltag-attribute">shape</tt>, <tt class="sgmltag-attribute">size</tt>, <tt class="sgmltag-attribute">span</tt>, <tt class="sgmltag-attribute">src</tt>, <tt class="sgmltag-attribute">start</tt>, <tt class="sgmltag-attribute">summary</tt>, <tt class="sgmltag-attribute">tabindex</tt>, <tt class="sgmltag-attribute">target</tt>, <tt class="sgmltag-attribute">title</tt>, <tt class="sgmltag-attribute">type</tt>, <tt class="sgmltag-attribute">usemap</tt>, <tt class="sgmltag-attribute">valign</tt>, <tt class="sgmltag-attribute">value</tt>, <tt class="sgmltag-attribute">vspace</tt>, <tt class="sgmltag-attribute">width</tt></span>
</p>
<a name="id4956096"></a><table class="note" border="0" summary="">
<tr><td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td></tr>
<tr><td colspan="2" align="left" valign="top" width="99%">The <a href="http://feedparser.org/tests/wellformed/sanitize/">unit tests for <acronym title="HyperText Markup Language">HTML</acronym> sanitizing</a> show many different examples of dangerous markup that <span class="application">Universal Feed Parser</span> sanitizes by default.</td></tr>
</table>
<p>One emerging technology that affects feed parsing is the inclusion of <a href="http://microformats.org/">microformats</a> within syndicated content. Briefly, publishers can add additional semantics to their <acronym title="HyperText Markup Language">HTML</acronym> content using <tt class="sgmltag-attribute">rel</tt> and <tt class="sgmltag-attribute">class</tt> attributes. <span class="application">Universal Feed Parser</span> does not currently parse microformat content within embedded <acronym title="HyperText Markup Language">HTML</acronym> markup, but it doesn't destroy it either. Both the <tt class="sgmltag-attribute">rel</tt> and <tt class="sgmltag-attribute">class</tt> attributes survive <acronym title="HyperText Markup Language">HTML</acronym> sanitizing, so applications built on <span class="application">Universal Feed Parser</span> that wish to parse microformat content are free to do so.</p>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="advanced.sanitization.why" class="skip" href="#advanced.sanitization.why" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Whitelist, Don't Blacklist</h3></div></div>
<div></div>
</div>
<p>I am often asked why <span class="application">Universal Feed Parser</span> is so hard-assed about <acronym title="HyperText Markup Language">HTML</acronym> sanitizing. This topic usually comes up when someone notices that <span class="application">Universal Feed Parser</span> strips all <tt class="sgmltag-attribute">style</tt> attributes by default.</p>
<p>Here is an incomplete list of potentially dangerous <acronym title="HyperText Markup Language">HTML</acronym> tags and attributes:</p>
<div class="itemizedlist"><ul>
<li>
<tt class="sgmltag-element">script</tt>, which can contain malicious script</li>
<li>
<tt class="sgmltag-element">applet</tt>, <tt class="sgmltag-element">embed</tt>, and <tt class="sgmltag-element">object</tt>, which can automatically download and execute malicious code</li>
<li>
<tt class="sgmltag-element">meta</tt>, which can contain malicious redirects</li>
<li>
<tt class="sgmltag-attribute">onload</tt>, <tt class="sgmltag-attribute">onunload</tt>, and all other <tt class="sgmltag-attribute">on*</tt> attributes, which can contain malicious script</li>
<li>
<tt class="sgmltag-element">style</tt>, <tt class="sgmltag-element">link</tt>, and the <tt class="sgmltag-attribute">style</tt> attribute, which can contain malicious script</li>
</ul></div>
<p><span class="emphasis"><em><tt class="sgmltag-attribute">style</tt>?</em></span> Yes, <tt class="sgmltag-attribute">style</tt>. <acronym title="Cascading Style Sheets">CSS</acronym> definitions can contain executable code.</p>
<div class="example">
<a name="example.javascript" class="skip" href="#example.javascript" title="link to this example"><img src="images/permalink.gif" alt="[link]" title="link to this example" width="8" height="9"></a> <h3 class="title">Example: Embedding Javascript in <acronym title="Cascading Style Sheets">CSS</acronym></h3>
<p>This sample is taken from <a href="http://feedparser.org/docs/examples/rss20.xml">http://feedparser.org/docs/examples/rss20.xml</a>:</p>
<pre class="programlisting ">
<description>Watch out for
&lt;span style="background: url(javascript:window.location='http://example.org/')"&gt;
nasty tricks&lt;/span&gt;</description></pre>
<p>This sample is more advanced, and does not contain the keyword <tt class="literal">javascript:</tt> that many naive <acronym title="HyperText Markup Language">HTML</acronym> sanitizers scan for:</p>
<pre class="programlisting "><description>Watch out for
&lt;span style="any: expression(window.location='http://example.org/')"&gt;
nasty tricks&lt;/span&gt;</description></pre>
<p>Internet Explorer for Windows will execute the Javascript in both of these examples.</p>
</div>
<p>Now consider that in <acronym title="HyperText Markup Language">HTML</acronym>, attribute values may be entity-encoded in several different ways.</p>
<div class="example">
<a name="example.javascript.encoded" class="skip" href="#example.javascript.encoded" title="link to this example"><img src="images/permalink.gif" alt="[link]" title="link to this example" width="8" height="9"></a> <h3 class="title">Example: Embedding encoded Javascript in <acronym title="Cascading Style Sheets">CSS</acronym></h3>
<p>To a browser, this:</p>
<pre class="programlisting "><span style="any: expression(window.location='http://example.org/')"></pre>
<p>is the same as this (without the line breaks):</p>
<pre class="programlisting "><span style="&#97;&#110;&#121;&#58;&#32;&#101;&#120;&#112;&#114;&#101;
&#115;&#115;&#105;&#111;&#110;&#40;&#119;&#105;&#110;&#100;&#111;&#119;
&#46;&#108;&#111;&#99;&#97;&#116;&#105;&#111;&#110;&#61;&#39;&#104;
&#116;&#116;&#112;&#58;&#47;&#47;&#101;&#120;&#97;&#109;&#112;&#108;
&#101;&#46;&#111;&#114;&#103;&#47;&#39;&#41;"></pre>
<p>which is the same as this (without the line breaks):</p>
<pre class="programlisting "><span style="&#x61;&#x6e;&#x79;&#x3a;&#x20;&#x65;&#x78;&#x70;&#x72;
&#x65;&#x73;&#x73;&#x69;&#x6f;&#x6e;&#x28;&#x77;&#x69;&#x6e;
&#x64;&#x6f;&#x77;&#x2e;&#x6c;&#x6f;&#x63;&#x61;&#x74;&#x69;
&#x6f;&#x6e;&#x3d;&#x27;&#x68;&#x74;&#x74;&#x70;&#x3a;&#x2f;
&#x2f;&#x65;&#x78;&#x61;&#x6d;&#x70;&#x6c;&#x65;&#x2e;&#x6f;
&#x72;&#x67;&#x2f;&#x27;&#x29;"></pre>
<p>And so on, plus several other variations, plus every combination of every variation.</p>
</div>
<p>The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why <span class="application">Universal Feed Parser</span> uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about elements or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code. I will not attempt to preserve “<span class="quote">just the good styles</span>”. All styles are stripped.</p>
<div class="furtherreading">
<h3>Elsewhere</h3>
<ul><li><a href="http://diveintomark.org/archives/2003/06/12/how_to_consume_rss_safely">How to consume RSS safely</a></li></ul>
</div>
</div>
</div>
<div style="float: left">← <a class="NavigationArrow" href="date-parsing.html">Date Parsing</a>
</div>
<div style="text-align: right">
<a class="NavigationArrow" href="content-normalization.html">Content Normalization</a> →</div>
<hr style="clear:both">
<div class="footer"><p class="copyright">Copyright © 2004, 2005, 2006 Mark Pilgrim</p></div>
</div></div>
</body>
</html>
|