1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241
|
From: Ryan Grove <ryan@wonko.com>
Date: Wed, 25 Jan 2023 17:42:24 -0800
Subject: Forcibly escape content in "unescaped text" elements inside math or
svg namespaces
Origin: https://github.com/rgrove/sanitize/commit/b4ee521df0d0616340c9648444be488381c238b1
This fixes an edge case in which the contents of an "unescaped text"
element (such as `<noembed>` or `<xmp>`) were not properly escaped if
that element was allowlisted and was also inside an allowlisted `<math>`
or `<svg>` element.
The only way to encounter this situation was to ignore multiple warnings
in the readme and create a custom config that allowlisted all the
elements involved, including `<math>` or `<svg>`.
Please let this be a reminder that Sanitize cannot safely sanitize
MathML or SVG content and does not support this use case. The default
configs don't allow MathML or SVG elements, and allowlisting MathML or
SVG elements in a custom config may create a security vulnerability in
your application.
Documentation has been updated to add more warnings and to make the
existing warnings about this more prominent.
---
HISTORY.md | 25 +++++++
README.md | 19 ++---
lib/sanitize/config/default.rb | 5 ++
lib/sanitize/transformers/clean_element.rb | 35 +++++++++
test/test_malicious_html.rb | 86 ++++++++++++++++++++++
5 files changed, 161 insertions(+), 9 deletions(-)
diff --git a/README.md b/README.md
index 0f1f3fbfb352..0f6efb56fde0 100644
--- a/README.md
+++ b/README.md
@@ -72,10 +72,11 @@ Sanitize can sanitize the following types of input:
* Standalone CSS stylesheets
* Standalone CSS properties
-However, please note that Sanitize _cannot_ fully sanitize the contents of
-`<math>` or `<svg>` elements, since these elements don't follow the same parsing
-rules as the rest of HTML. If this is something you need, you may want to look
-for another solution.
+> **Warning**
+>
+> Sanitize cannot fully sanitize the contents of `<math>` or `<svg>` elements. MathML and SVG elements are [foreign elements](https://html.spec.whatwg.org/multipage/syntax.html#foreign-elements) that don't follow normal HTML parsing rules.
+>
+> By default, Sanitize will remove all MathML and SVG elements. If you add MathML or SVG elements to a custom element allowlist, you may create a security vulnerability in your application.
### HTML Fragments
@@ -420,11 +421,11 @@ elements not in this array will be removed.
]
```
-**Warning:** Sanitize cannot fully sanitize the contents of `<math>` or `<svg>`
-elements, since these elements don't follow the same parsing rules as the rest
-of HTML. If you add `math` or `svg` to the allowlist, you must assume that any
-content inside them will be allowed, even if that content would otherwise be
-removed by Sanitize.
+> **Warning**
+>
+> Sanitize cannot fully sanitize the contents of `<math>` or `<svg>` elements. MathML and SVG elements are [foreign elements](https://html.spec.whatwg.org/multipage/syntax.html#foreign-elements) that don't follow normal HTML parsing rules.
+>
+> By default, Sanitize will remove all MathML and SVG elements. If you add MathML or SVG elements to a custom element allowlist, you must assume that any content inside them will be allowed, even if that content would otherwise be removed or escaped by Sanitize. This may create a security vulnerability in your application.
#### :parser_options (Hash)
diff --git a/lib/sanitize/config/default.rb b/lib/sanitize/config/default.rb
index 4dd1d22d01f7..d22c1e84c326 100644
--- a/lib/sanitize/config/default.rb
+++ b/lib/sanitize/config/default.rb
@@ -54,6 +54,11 @@ class Sanitize
# HTML elements to allow. By default, no elements are allowed (which means
# that all HTML will be stripped).
+ #
+ # Warning: Sanitize cannot safely sanitize the contents of foreign
+ # elements (elements in the MathML or SVG namespaces). Do not add `math`
+ # or `svg` to this list! If you do, you may create a security
+ # vulnerability in your application.
:elements => [],
# HTML parsing options to pass to Nokogumbo.
diff --git a/lib/sanitize/transformers/clean_element.rb b/lib/sanitize/transformers/clean_element.rb
index a73994b0f0a8..a2bacd8198f7 100644
--- a/lib/sanitize/transformers/clean_element.rb
+++ b/lib/sanitize/transformers/clean_element.rb
@@ -1,5 +1,6 @@
# encoding: utf-8
+require 'cgi'
require 'set'
class Sanitize; module Transformers; class CleanElement
@@ -18,6 +19,18 @@ class Sanitize; module Transformers; class CleanElement
# http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#embedding-custom-non-visible-data-with-the-data-*-attributes
REGEX_DATA_ATTR = /\Adata-(?!xml)[a-z_][\w.\u00E0-\u00F6\u00F8-\u017F\u01DD-\u02AF-]*\z/u
+ # Elements whose content is treated as unescaped text by HTML parsers.
+ UNESCAPED_TEXT_ELEMENTS = Set.new(%w[
+ iframe
+ noembed
+ noframes
+ noscript
+ plaintext
+ script
+ style
+ xmp
+ ])
+
# Attributes that need additional escaping on `<a>` elements due to unsafe
# libxml2 behavior.
UNSAFE_LIBXML_ATTRS_A = Set.new(%w[
@@ -185,6 +198,28 @@ class Sanitize; module Transformers; class CleanElement
@add_attributes[name].each {|key, val| node[key] = val }
end
+ # Make a best effort to ensure that text nodes in invalid "unescaped text"
+ # elements that are inside a math or svg namespace are properly escaped so
+ # that they don't get parsed as HTML.
+ #
+ # Sanitize is explicitly documented as not supporting MathML or SVG, but
+ # people sometimes allow `<math>` and `<svg>` elements in their custom
+ # configs without realizing that it's not safe. This workaround makes it
+ # slightly less unsafe, but you still shouldn't allow `<math>` or `<svg>`
+ # because Nokogiri doesn't parse them the same way browsers do and Sanitize
+ # can't guarantee that their contents are safe.
+ unless node.namespace.nil?
+ prefix = node.namespace.prefix
+
+ if (prefix == 'math' || prefix == 'svg') && UNESCAPED_TEXT_ELEMENTS.include?(name)
+ node.children.each do |child|
+ if child.type == Nokogiri::XML::Node::TEXT_NODE
+ child.content = CGI.escapeHTML(child.content)
+ end
+ end
+ end
+ end
+
# Element-specific special cases.
case name
diff --git a/test/test_malicious_html.rb b/test/test_malicious_html.rb
index d0fc57bbc2ad..377e6deb0fb5 100644
--- a/test/test_malicious_html.rb
+++ b/test/test_malicious_html.rb
@@ -232,4 +232,90 @@ describe 'Malicious HTML' do
end
end
end
+
+ # These tests cover an unsupported and unsafe custom config that allows MathML
+ # and SVG elements, which Sanitize's docs specifically say multiple times in
+ # big prominent warnings that you SHOULD NOT DO because Sanitize doesn't
+ # support MathML or SVG.
+ #
+ # Do not use the custom configs you see in these tests! If you do, you may be
+ # creating XSS vulnerabilities in your application.
+ describe 'foreign content bypass in unsafe custom config that allows MathML or SVG' do
+ unescaped_content_elements = %w[
+ noembed
+ noframes
+ noscript
+ plaintext
+ script
+ xmp
+ ]
+
+ removed_content_elements = %w[
+ iframe
+ ]
+
+ removed_elements = %w[
+ style
+ ]
+
+ before do
+ @s = Sanitize.new(
+ Sanitize::Config.merge(
+ Sanitize::Config::RELAXED,
+ elements: Sanitize::Config::RELAXED[:elements] +
+ unescaped_content_elements +
+ removed_content_elements +
+ %w[math svg]
+ )
+ )
+ end
+
+ unescaped_content_elements.each do |name|
+ it "forcibly escapes text content inside `<#{name}>` in a MathML namespace" do
+ assert_equal(
+ "<math><#{name}><img src=x onerror=alert(1)></#{name}></math>",
+ @s.fragment("<math><#{name}><img src=x onerror=alert(1)></#{name}>")
+ )
+ end
+
+ it "forcibly escapes text content inside `<#{name}>` in an SVG namespace" do
+ assert_equal(
+ "<svg><#{name}><img src=x onerror=alert(1)></#{name}></svg>",
+ @s.fragment("<svg><#{name}><img src=x onerror=alert(1)></#{name}>")
+ )
+ end
+ end
+
+ removed_content_elements.each do |name|
+ it "removes text content inside `<#{name}>` in a MathML namespace" do
+ assert_equal(
+ "<math><#{name}></#{name}></math>",
+ @s.fragment("<math><#{name}><img src=x onerror=alert(1)></#{name}>")
+ )
+ end
+
+ it "removes text content inside `<#{name}>` in an SVG namespace" do
+ assert_equal(
+ "<svg><#{name}></#{name}></svg>",
+ @s.fragment("<svg><#{name}><img src=x onerror=alert(1)></#{name}>")
+ )
+ end
+ end
+
+ removed_elements.each do |name|
+ it "removes `<#{name}>` elements in a MathML namespace" do
+ assert_equal(
+ '<math></math>',
+ @s.fragment("<math><#{name}><img src=x onerror=alert(1)></#{name}>")
+ )
+ end
+
+ it "removes `<#{name}>` elements in an SVG namespace" do
+ assert_equal(
+ '<svg></svg>',
+ @s.fragment("<svg><#{name}><img src=x onerror=alert(1)></#{name}>")
+ )
+ end
+ end
+ end
end
--
2.39.1
|