File: html-sanitization.html

package info (click to toggle)
nodebox-web 1.9.2-2
  • links: PTS
  • area: main
  • in suites: lenny
  • size: 1,724 kB
  • ctags: 1,254
  • sloc: python: 6,161; sh: 602; xml: 239; makefile: 33
file content (130 lines) | stat: -rw-r--r-- 17,609 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>HTML Sanitization [Universal Feed Parser]</title>
<link rel="stylesheet" href="feedparser.css" type="text/css">
<link rev="made" href="mailto:mark@diveintomark.org">
<meta name="generator" content="DocBook XSL Stylesheets V1.65.1">
<meta name="keywords" content="RSS, Atom, CDF, XML, feed, parser, Python">
<link rel="start" href="index.html" title="Documentation">
<link rel="up" href="advanced.html" title="Advanced Features">
<link rel="prev" href="date-parsing.html" title="Date Parsing">
<link rel="next" href="content-normalization.html" title="Content Normalization">
</head>
<body id="feedparser-org" class="docs">
<div class="z" id="intro"><div class="sectionInner"><div class="sectionInner2">
<div class="s" id="pageHeader">
<h1><a href="/"><span>Universal Feed Parser</span></a></h1>
<p><span>Parse RSS and Atom feeds in Python.  3000 unit tests.  Open source.</span></p>
</div>
<div class="s" id="quickSummary"><ul>
<li class="li1">
<a href="http://sourceforge.net/projects/feedparser/"><span>Download</span></a> ·</li>
<li class="li2">
<a href="http://feedparser.org/docs/"><span>Documentation</span></a> ·</li>
<li class="li3">
<a href="http://feedparser.org/tests/"><span>Unit tests</span></a> ·</li>
<li class="li4"><a href="http://sourceforge.net/tracker/?func=browse&amp;group_id=112328&amp;atid=661937"><span>Report a bug</span></a></li>
</ul></div>
</div></div></div>
<div id="main"><div id="mainInner">
<p id="breadcrumb">You are here: <a href="index.html">Documentation</a> → <a href="advanced.html">Advanced Features</a> → <span class="thispage">HTML Sanitization</span></p>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h2 class="title">
<a name="advanced.sanitization" class="skip" href="#advanced.sanitization" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> <acronym title="HyperText Markup Language">HTML</acronym> Sanitization</h2></div></div>
<div></div>
</div>
<div class="abstract"><p>Many feed elements may contain <acronym title="HyperText Markup Language">HTML</acronym> markup, and many feed aggregators use a web browser (or browser component) to display content.  By default, <span class="application">Universal Feed Parser</span> sanitizes <acronym title="HyperText Markup Language">HTML</acronym> markup in several elements, removing <acronym title="HyperText Markup Language">HTML</acronym> tags and attributes that could introduce Javascript or other security risks.</p></div>
<p>These elements are sanitized by default:</p>
<div class="itemizedlist"><ul>
<li><a href="reference-feed-title.html" title="feed.title">feed.title</a></li>
<li><a href="reference-feed-subtitle.html" title="feed.subtitle">feed.subtitle</a></li>
<li><a href="reference-feed-info.html" title="feed.info">feed.info</a></li>
<li><a href="reference-feed-rights.html" title="feed.rights">feed.rights</a></li>
<li><a href="reference-entry-title.html" title="entries[i].title">entries[i].title</a></li>
<li><a href="reference-entry-summary.html" title="entries[i].summary">entries[i].summary</a></li>
<li><a href="reference-entry-content.html" title="entries[i].content">entries[i].content</a></li>
</ul></div>
<p>The following <acronym title="HyperText Markup Language">HTML</acronym> tags are allowed by default (all others are stripped):
<span class="simplelist"><tt class="sgmltag-element">a</tt>, <tt class="sgmltag-element">abbr</tt>, <tt class="sgmltag-element">acronym</tt>, <tt class="sgmltag-element">address</tt>, <tt class="sgmltag-element">area</tt>, <tt class="sgmltag-element">b</tt>, <tt class="sgmltag-element">big</tt>, <tt class="sgmltag-element">blockquote</tt>, <tt class="sgmltag-element">br</tt>, <tt class="sgmltag-element">button</tt>, <tt class="sgmltag-element">caption</tt>, <tt class="sgmltag-element">center</tt>, <tt class="sgmltag-element">cite</tt>, <tt class="sgmltag-element">code</tt>, <tt class="sgmltag-element">col</tt>, <tt class="sgmltag-element">colgroup</tt>, <tt class="sgmltag-element">dd</tt>, <tt class="sgmltag-element">del</tt>, <tt class="sgmltag-element">dfn</tt>, <tt class="sgmltag-element">dir</tt>, <tt class="sgmltag-element">div</tt>, <tt class="sgmltag-element">dl</tt>, <tt class="sgmltag-element">dt</tt>, <tt class="sgmltag-element">em</tt>, <tt class="sgmltag-element">fieldset</tt>, <tt class="sgmltag-element">font</tt>, <tt class="sgmltag-element">form</tt>, <tt class="sgmltag-element">h1</tt>, <tt class="sgmltag-element">h2</tt>, <tt class="sgmltag-element">h3</tt>, <tt class="sgmltag-element">h4</tt>, <tt class="sgmltag-element">h5</tt>, <tt class="sgmltag-element">h6</tt>, <tt class="sgmltag-element">hr</tt>, <tt class="sgmltag-element">i</tt>, <tt class="sgmltag-element">img</tt>, <tt class="sgmltag-element">input</tt>, <tt class="sgmltag-element">ins</tt>, <tt class="sgmltag-element">kbd</tt>, <tt class="sgmltag-element">label</tt>, <tt class="sgmltag-element">legend</tt>, <tt class="sgmltag-element">li</tt>, <tt class="sgmltag-element">map</tt>, <tt class="sgmltag-element">menu</tt>, <tt class="sgmltag-element">ol</tt>, <tt class="sgmltag-element">optgroup</tt>, <tt class="sgmltag-element">option</tt>, <tt class="sgmltag-element">p</tt>, <tt class="sgmltag-element">pre</tt>, <tt class="sgmltag-element">q</tt>, <tt class="sgmltag-element">s</tt>, <tt class="sgmltag-element">samp</tt>, <tt class="sgmltag-element">select</tt>, <tt class="sgmltag-element">small</tt>, <tt class="sgmltag-element">span</tt>, <tt class="sgmltag-element">strike</tt>, <tt class="sgmltag-element">strong</tt>, <tt class="sgmltag-element">sub</tt>, <tt class="sgmltag-element">sup</tt>, <tt class="sgmltag-element">table</tt>, <tt class="sgmltag-element">tbody</tt>, <tt class="sgmltag-element">td</tt>, <tt class="sgmltag-element">textarea</tt>, <tt class="sgmltag-element">tfoot</tt>, <tt class="sgmltag-element">th</tt>, <tt class="sgmltag-element">thead</tt>, <tt class="sgmltag-element">tr</tt>, <tt class="sgmltag-element">tt</tt>, <tt class="sgmltag-element">u</tt>, <tt class="sgmltag-element">ul</tt>, <tt class="sgmltag-element">var</tt></span>
</p>
<p>The following <acronym title="HyperText Markup Language">HTML</acronym> attributes are allowed by default (all others are stripped):
<span class="simplelist"><tt class="sgmltag-attribute">abbr</tt>, <tt class="sgmltag-attribute">accept</tt>, <tt class="sgmltag-attribute">accept-charset</tt>, <tt class="sgmltag-attribute">accesskey</tt>, <tt class="sgmltag-attribute">action</tt>, <tt class="sgmltag-attribute">align</tt>, <tt class="sgmltag-attribute">alt</tt>, <tt class="sgmltag-attribute">axis</tt>, <tt class="sgmltag-attribute">border</tt>, <tt class="sgmltag-attribute">cellpadding</tt>, <tt class="sgmltag-attribute">cellspacing</tt>, <tt class="sgmltag-attribute">char</tt>, <tt class="sgmltag-attribute">charoff</tt>, <tt class="sgmltag-attribute">charset</tt>, <tt class="sgmltag-attribute">checked</tt>, <tt class="sgmltag-attribute">cite</tt>, <tt class="sgmltag-attribute">class</tt>, <tt class="sgmltag-attribute">clear</tt>, <tt class="sgmltag-attribute">cols</tt>, <tt class="sgmltag-attribute">colspan</tt>, <tt class="sgmltag-attribute">color</tt>, <tt class="sgmltag-attribute">compact</tt>, <tt class="sgmltag-attribute">coords</tt>, <tt class="sgmltag-attribute">datetime</tt>, <tt class="sgmltag-attribute">dir</tt>, <tt class="sgmltag-attribute">disabled</tt>, <tt class="sgmltag-attribute">enctype</tt>, <tt class="sgmltag-attribute">for</tt>, <tt class="sgmltag-attribute">frame</tt>, <tt class="sgmltag-attribute">headers</tt>, <tt class="sgmltag-attribute">height</tt>, <tt class="sgmltag-attribute">href</tt>, <tt class="sgmltag-attribute">hreflang</tt>, <tt class="sgmltag-attribute">hspace</tt>, <tt class="sgmltag-attribute">id</tt>, <tt class="sgmltag-attribute">ismap</tt>, <tt class="sgmltag-attribute">label</tt>, <tt class="sgmltag-attribute">lang</tt>, <tt class="sgmltag-attribute">longdesc</tt>, <tt class="sgmltag-attribute">maxlength</tt>, <tt class="sgmltag-attribute">media</tt>, <tt class="sgmltag-attribute">method</tt>, <tt class="sgmltag-attribute">multiple</tt>, <tt class="sgmltag-attribute">name</tt>, <tt class="sgmltag-attribute">nohref</tt>, <tt class="sgmltag-attribute">noshade</tt>, <tt class="sgmltag-attribute">nowrap</tt>, <tt class="sgmltag-attribute">prompt</tt>, <tt class="sgmltag-attribute">readonly</tt>, <tt class="sgmltag-attribute">rel</tt>, <tt class="sgmltag-attribute">rev</tt>, <tt class="sgmltag-attribute">rows</tt>, <tt class="sgmltag-attribute">rowspan</tt>, <tt class="sgmltag-attribute">rules</tt>, <tt class="sgmltag-attribute">scope</tt>, <tt class="sgmltag-attribute">selected</tt>, <tt class="sgmltag-attribute">shape</tt>, <tt class="sgmltag-attribute">size</tt>, <tt class="sgmltag-attribute">span</tt>, <tt class="sgmltag-attribute">src</tt>, <tt class="sgmltag-attribute">start</tt>, <tt class="sgmltag-attribute">summary</tt>, <tt class="sgmltag-attribute">tabindex</tt>, <tt class="sgmltag-attribute">target</tt>, <tt class="sgmltag-attribute">title</tt>, <tt class="sgmltag-attribute">type</tt>, <tt class="sgmltag-attribute">usemap</tt>, <tt class="sgmltag-attribute">valign</tt>, <tt class="sgmltag-attribute">value</tt>, <tt class="sgmltag-attribute">vspace</tt>, <tt class="sgmltag-attribute">width</tt></span>
</p>
<a name="id4956096"></a><table class="note" border="0" summary="">
<tr><td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"></td></tr>
<tr><td colspan="2" align="left" valign="top" width="99%">The <a href="http://feedparser.org/tests/wellformed/sanitize/">unit tests for <acronym title="HyperText Markup Language">HTML</acronym> sanitizing</a> show many different examples of dangerous markup that <span class="application">Universal Feed Parser</span> sanitizes by default.</td></tr>
</table>
<p>One emerging technology that affects feed parsing is the inclusion of <a href="http://microformats.org/">microformats</a> within syndicated content.  Briefly, publishers can add additional semantics to their <acronym title="HyperText Markup Language">HTML</acronym> content using <tt class="sgmltag-attribute">rel</tt> and <tt class="sgmltag-attribute">class</tt> attributes.  <span class="application">Universal Feed Parser</span> does not currently parse microformat content within embedded <acronym title="HyperText Markup Language">HTML</acronym> markup, but it doesn't destroy it either.  Both the <tt class="sgmltag-attribute">rel</tt> and <tt class="sgmltag-attribute">class</tt> attributes survive <acronym title="HyperText Markup Language">HTML</acronym> sanitizing, so applications built on <span class="application">Universal Feed Parser</span> that wish to parse microformat content are free to do so.</p>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="advanced.sanitization.why" class="skip" href="#advanced.sanitization.why" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Whitelist, Don't Blacklist</h3></div></div>
<div></div>
</div>
<p>I am often asked why <span class="application">Universal Feed Parser</span> is so hard-assed about <acronym title="HyperText Markup Language">HTML</acronym> sanitizing.  This topic usually comes up when someone notices that <span class="application">Universal Feed Parser</span> strips all <tt class="sgmltag-attribute">style</tt> attributes by default.</p>
<p>Here is an incomplete list of potentially dangerous <acronym title="HyperText Markup Language">HTML</acronym> tags and attributes:</p>
<div class="itemizedlist"><ul>
<li>
<tt class="sgmltag-element">script</tt>, which can contain malicious script</li>
<li>
<tt class="sgmltag-element">applet</tt>, <tt class="sgmltag-element">embed</tt>, and <tt class="sgmltag-element">object</tt>, which can automatically download and execute malicious code</li>
<li>
<tt class="sgmltag-element">meta</tt>, which can contain malicious redirects</li>
<li>
<tt class="sgmltag-attribute">onload</tt>, <tt class="sgmltag-attribute">onunload</tt>, and all other <tt class="sgmltag-attribute">on*</tt> attributes, which can contain malicious script</li>
<li>
<tt class="sgmltag-element">style</tt>, <tt class="sgmltag-element">link</tt>, and the <tt class="sgmltag-attribute">style</tt> attribute, which can contain malicious script</li>
</ul></div>
<p><span class="emphasis"><em><tt class="sgmltag-attribute">style</tt>?</em></span>  Yes, <tt class="sgmltag-attribute">style</tt>.  <acronym title="Cascading Style Sheets">CSS</acronym> definitions can contain executable code.</p>
<div class="example">
<a name="example.javascript" class="skip" href="#example.javascript" title="link to this example"><img src="images/permalink.gif" alt="[link]" title="link to this example" width="8" height="9"></a> <h3 class="title">Example: Embedding Javascript in <acronym title="Cascading Style Sheets">CSS</acronym></h3>
<p>This sample is taken from <a href="http://feedparser.org/docs/examples/rss20.xml">http://feedparser.org/docs/examples/rss20.xml</a>:</p>
<pre class="programlisting ">
&lt;description&gt;Watch out for
&amp;lt;span style="background: url(javascript:window.location='http://example.org/')"&amp;gt;
nasty tricks&amp;lt;/span&amp;gt;&lt;/description&gt;</pre>
<p>This sample is more advanced, and does not contain the keyword <tt class="literal">javascript:</tt> that many naive <acronym title="HyperText Markup Language">HTML</acronym> sanitizers scan for:</p>
<pre class="programlisting ">&lt;description&gt;Watch out for
&amp;lt;span style="any: expression(window.location='http://example.org/')"&amp;gt;
nasty tricks&amp;lt;/span&amp;gt;&lt;/description&gt;</pre>
<p>Internet Explorer for Windows will execute the Javascript in both of these examples.</p>
</div>
<p>Now consider that in <acronym title="HyperText Markup Language">HTML</acronym>, attribute values may be entity-encoded in several different ways.</p>
<div class="example">
<a name="example.javascript.encoded" class="skip" href="#example.javascript.encoded" title="link to this example"><img src="images/permalink.gif" alt="[link]" title="link to this example" width="8" height="9"></a> <h3 class="title">Example: Embedding encoded Javascript in <acronym title="Cascading Style Sheets">CSS</acronym></h3>
<p>To a browser, this:</p>
<pre class="programlisting ">&lt;span style="any: expression(window.location='http://example.org/')"&gt;</pre>
<p>is the same as this (without the line breaks):</p>
<pre class="programlisting ">&lt;span style="&amp;#97;&amp;#110;&amp;#121;&amp;#58;&amp;#32;&amp;#101;&amp;#120;&amp;#112;&amp;#114;&amp;#101;
&amp;#115;&amp;#115;&amp;#105;&amp;#111;&amp;#110;&amp;#40;&amp;#119;&amp;#105;&amp;#110;&amp;#100;&amp;#111;&amp;#119;
&amp;#46;&amp;#108;&amp;#111;&amp;#99;&amp;#97;&amp;#116;&amp;#105;&amp;#111;&amp;#110;&amp;#61;&amp;#39;&amp;#104;
&amp;#116;&amp;#116;&amp;#112;&amp;#58;&amp;#47;&amp;#47;&amp;#101;&amp;#120;&amp;#97;&amp;#109;&amp;#112;&amp;#108;
&amp;#101;&amp;#46;&amp;#111;&amp;#114;&amp;#103;&amp;#47;&amp;#39;&amp;#41;"&gt;</pre>
<p>which is the same as this (without the line breaks):</p>
<pre class="programlisting ">&lt;span style="&amp;#x61;&amp;#x6e;&amp;#x79;&amp;#x3a;&amp;#x20;&amp;#x65;&amp;#x78;&amp;#x70;&amp;#x72;
&amp;#x65;&amp;#x73;&amp;#x73;&amp;#x69;&amp;#x6f;&amp;#x6e;&amp;#x28;&amp;#x77;&amp;#x69;&amp;#x6e;
&amp;#x64;&amp;#x6f;&amp;#x77;&amp;#x2e;&amp;#x6c;&amp;#x6f;&amp;#x63;&amp;#x61;&amp;#x74;&amp;#x69;
&amp;#x6f;&amp;#x6e;&amp;#x3d;&amp;#x27;&amp;#x68;&amp;#x74;&amp;#x74;&amp;#x70;&amp;#x3a;&amp;#x2f;
&amp;#x2f;&amp;#x65;&amp;#x78;&amp;#x61;&amp;#x6d;&amp;#x70;&amp;#x6c;&amp;#x65;&amp;#x2e;&amp;#x6f;
&amp;#x72;&amp;#x67;&amp;#x2f;&amp;#x27;&amp;#x29;"&gt;</pre>
<p>And so on, plus several other variations, plus every combination of every variation.</p>
</div>
<p>The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it.  This is why <span class="application">Universal Feed Parser</span> uses a whitelist and not a blacklist.   I am reasonably confident that none of the elements or attributes on the whitelist are security risks.  I am not at all confident about elements or attributes that I have not explicitly investigated.  And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code.  I will not attempt to preserve “<span class="quote">just the good styles</span>”.  All styles are stripped.</p>
<div class="furtherreading">
<h3>Elsewhere</h3>
<ul><li><a href="http://diveintomark.org/archives/2003/06/12/how_to_consume_rss_safely">How to consume RSS safely</a></li></ul>
</div>
</div>
</div>
<div style="float: left">← <a class="NavigationArrow" href="date-parsing.html">Date Parsing</a>
</div>
<div style="text-align: right">
<a class="NavigationArrow" href="content-normalization.html">Content Normalization</a> →</div>
<hr style="clear:both">
<div class="footer"><p class="copyright">Copyright © 2004, 2005, 2006 Mark Pilgrim</p></div>
</div></div>
</body>
</html>