1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
|
Cleaning up HTML
================
The module ``lxml_html_clean`` provides a ``Cleaner`` class for cleaning up
HTML pages. It supports removing embedded or script content, special tags,
CSS style annotations and much more.
Note: the HTML Cleaner in ``lxml_html_clean`` is **not** considered
appropriate **for security sensitive environments**.
See e.g. `bleach <https://pypi.org/project/bleach/>`_ for an alternative.
Say, you have an overburdened web page from a hideous source which contains
lots of content that upsets browsers and tries to run unnecessary code on the
client side:
.. sourcecode:: pycon
>>> html = '''\
... <html>
... <head>
... <script type="text/javascript" src="evil-site"></script>
... <link rel="alternate" type="text/rss" src="evil-rss">
... <style>
... body {background-image: url(javascript:do_evil)};
... div {color: expression(evil)};
... </style>
... </head>
... <body onload="evil_function()">
... <!-- I am interpreted for EVIL! -->
... <a href="javascript:evil_function()">a link</a>
... <a href="#" onclick="evil_function()">another link</a>
... <p onclick="evil_function()">a paragraph</p>
... <div style="display: none">secret EVIL!</div>
... <object> of EVIL! </object>
... <iframe src="evil-site"></iframe>
... <form action="evil-site">
... Password: <input type="password" name="password">
... </form>
... <blink>annoying EVIL!</blink>
... <a href="evil-site">spam spam SPAM!</a>
... <image src="evil!">
... </body>
... </html>'''
To remove the all superfluous content from this unparsed document, use the
``clean_html`` function:
.. sourcecode:: pycon
>>> from lxml_html_clean import clean_html
>>> print clean_html(html)
<div><style>/* deleted */</style><body>
<a href="">a link</a>
<a href="#">another link</a>
<p>a paragraph</p>
<div>secret EVIL!</div>
of EVIL!
Password:
annoying EVIL!<a href="evil-site">spam spam SPAM!</a>
<img src="evil!"></body></div>
The ``Cleaner`` class supports several keyword arguments to control exactly
which content is removed:
.. sourcecode:: pycon
>>> from lxml_html_clean import Cleaner
>>> cleaner = Cleaner(page_structure=False, links=False)
>>> print cleaner.clean_html(html)
<html>
<head>
<link rel="alternate" src="evil-rss" type="text/rss">
<style>/* deleted */</style>
</head>
<body>
<a href="">a link</a>
<a href="#">another link</a>
<p>a paragraph</p>
<div>secret EVIL!</div>
of EVIL!
Password:
annoying EVIL!
<a href="evil-site">spam spam SPAM!</a>
<img src="evil!">
</body>
</html>
>>> cleaner = Cleaner(style=True, links=True, add_nofollow=True,
... page_structure=False, safe_attrs_only=False)
>>> print cleaner.clean_html(html)
<html>
<head>
</head>
<body>
<a href="">a link</a>
<a href="#">another link</a>
<p>a paragraph</p>
<div>secret EVIL!</div>
of EVIL!
Password:
annoying EVIL!
<a href="evil-site" rel="nofollow">spam spam SPAM!</a>
<img src="evil!">
</body>
</html>
To control the removal of CSS styles, set the ``style`` and/or ``inline_style``
keyword arguments to ``True`` when creating a ``Cleaner`` instance.
If neither option is enabled, only ``@import`` rules are automatically removed
from CSS content.
You can also whitelist some otherwise dangerous content with
``Cleaner(host_whitelist=['www.youtube.com'])``, which would allow
embedded media from YouTube, while still filtering out embedded media
from other sites.
See the docstring of ``Cleaner`` for the details of what can be
cleaned.
autolink
--------
In addition to cleaning up malicious HTML, ``lxml_html_clean``
contains functions to do other things to your HTML. This includes
autolinking::
autolink(doc, ...)
autolink_html(html, ...)
This finds anything that looks like a link (e.g.,
``http://example.com``) in the *text* of an HTML document, and
turns it into an anchor. It avoids making bad links.
Links in the elements ``<textarea>``, ``<pre>``, ``<code>``,
anything in the head of the document. You can pass in a list of
elements to avoid in ``avoid_elements=['textarea', ...]``.
Links to some hosts can be avoided. By default links to
``localhost*``, ``example.*`` and ``127.0.0.1`` are not
autolinked. Pass in ``avoid_hosts=[list_of_regexes]`` to control
this.
Elements with the ``nolink`` CSS class are not autolinked. Pass
in ``avoid_classes=['code', ...]`` to control this.
The ``autolink_html()`` version of the function parses the HTML
string first, and returns a string.
wordwrap
--------
You can also wrap long words in your html::
word_break(doc, max_width=40, ...)
word_break_html(html, ...)
This finds any long words in the text of the document and inserts
``​`` in the document (which is the Unicode zero-width space).
This avoids the elements ``<pre>``, ``<textarea>``, and ``<code>``.
You can control this with ``avoid_elements=['textarea', ...]``.
It also avoids elements with the CSS class ``nobreak``. You can
control this with ``avoid_classes=['code', ...]``.
Lastly you can control the character that is inserted with
``break_character=u'\u200b'``. However, you cannot insert markup,
only text.
``word_break_html(html)`` parses the HTML document and returns a
string.
|