File: usage.rst

package info (click to toggle)
lxml-html-clean 0.4.3-1
  • links: PTS
  • area: main
  • in suites: forky, sid
  • size: 228 kB
  • sloc: python: 865; makefile: 12
file content (180 lines) | stat: -rw-r--r-- 5,986 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
Cleaning up HTML
================

The module ``lxml_html_clean`` provides a ``Cleaner`` class for cleaning up
HTML pages.  It supports removing embedded or script content, special tags,
CSS style annotations and much more.

Note: the HTML Cleaner in ``lxml_html_clean`` is **not** considered
appropriate **for security sensitive environments**.
See e.g. `bleach <https://pypi.org/project/bleach/>`_ for an alternative.

Say, you have an overburdened web page from a hideous source which contains
lots of content that upsets browsers and tries to run unnecessary code on the
client side:

.. sourcecode:: pycon

    >>> html = '''\
    ... <html>
    ...  <head>
    ...    <script type="text/javascript" src="evil-site"></script>
    ...    <link rel="alternate" type="text/rss" src="evil-rss">
    ...    <style>
    ...      body {background-image: url(javascript:do_evil)};
    ...      div {color: expression(evil)};
    ...    </style>
    ...  </head>
    ...  <body onload="evil_function()">
    ...    <!-- I am interpreted for EVIL! -->
    ...    <a href="javascript:evil_function()">a link</a>
    ...    <a href="#" onclick="evil_function()">another link</a>
    ...    <p onclick="evil_function()">a paragraph</p>
    ...    <div style="display: none">secret EVIL!</div>
    ...    <object> of EVIL! </object>
    ...    <iframe src="evil-site"></iframe>
    ...    <form action="evil-site">
    ...      Password: <input type="password" name="password">
    ...    </form>
    ...    <blink>annoying EVIL!</blink>
    ...    <a href="evil-site">spam spam SPAM!</a>
    ...    <image src="evil!">
    ...  </body>
    ... </html>'''

To remove the all superfluous content from this unparsed document, use the
``clean_html`` function:

.. sourcecode:: pycon

    >>> from lxml_html_clean import clean_html
    >>> print clean_html(html)
    <div><style>/* deleted */</style><body>
       
       <a href="">a link</a>
       <a href="#">another link</a>
       <p>a paragraph</p>
       <div>secret EVIL!</div>
        of EVIL! 
                                                                                                       
                                                                                                       
         Password:                                                                                     
       annoying EVIL!<a href="evil-site">spam spam SPAM!</a>                                           
       <img src="evil!"></body></div>   

The ``Cleaner`` class supports several keyword arguments to control exactly
which content is removed:

.. sourcecode:: pycon

    >>> from lxml_html_clean import Cleaner

    >>> cleaner = Cleaner(page_structure=False, links=False)
    >>> print cleaner.clean_html(html)
    <html>
      <head>
        <link rel="alternate" src="evil-rss" type="text/rss">
        <style>/* deleted */</style>
      </head>
      <body>
        <a href="">a link</a>
        <a href="#">another link</a>
        <p>a paragraph</p>
        <div>secret EVIL!</div>
        of EVIL!
        Password:
        annoying EVIL!
        <a href="evil-site">spam spam SPAM!</a>
        <img src="evil!">
      </body>
    </html>

    >>> cleaner = Cleaner(style=True, links=True, add_nofollow=True,
    ...                   page_structure=False, safe_attrs_only=False)
    
    >>> print cleaner.clean_html(html)
    <html>
      <head>
      </head>
      <body>
        <a href="">a link</a>
        <a href="#">another link</a>
        <p>a paragraph</p>
        <div>secret EVIL!</div>
        of EVIL!
        Password:
        annoying EVIL!
        <a href="evil-site" rel="nofollow">spam spam SPAM!</a>
        <img src="evil!">
      </body>
    </html>

To control the removal of CSS styles, set the ``style`` and/or ``inline_style``
keyword arguments to ``True`` when creating a ``Cleaner`` instance.
If neither option is enabled, only ``@import`` rules are automatically removed
from CSS content.

You can also whitelist some otherwise dangerous content with
``Cleaner(host_whitelist=['www.youtube.com'])``, which would allow
embedded media from YouTube, while still filtering out embedded media
from other sites.

See the docstring of ``Cleaner`` for the details of what can be
cleaned.


autolink
--------

In addition to cleaning up malicious HTML, ``lxml_html_clean``
contains functions to do other things to your HTML.  This includes
autolinking::

   autolink(doc, ...)

   autolink_html(html, ...)

This finds anything that looks like a link (e.g.,
``http://example.com``) in the *text* of an HTML document, and
turns it into an anchor.  It avoids making bad links.

Links in the elements ``<textarea>``, ``<pre>``, ``<code>``,
anything in the head of the document.  You can pass in a list of
elements to avoid in ``avoid_elements=['textarea', ...]``.

Links to some hosts can be avoided.  By default links to
``localhost*``, ``example.*`` and ``127.0.0.1`` are not
autolinked.  Pass in ``avoid_hosts=[list_of_regexes]`` to control
this.

Elements with the ``nolink`` CSS class are not autolinked.  Pass
in ``avoid_classes=['code', ...]`` to control this.

The ``autolink_html()`` version of the function parses the HTML
string first, and returns a string.


wordwrap
--------

You can also wrap long words in your html::

   word_break(doc, max_width=40, ...)

   word_break_html(html, ...)

This finds any long words in the text of the document and inserts
``&#8203;`` in the document (which is the Unicode zero-width space).

This avoids the elements ``<pre>``, ``<textarea>``, and ``<code>``.
You can control this with ``avoid_elements=['textarea', ...]``.

It also avoids elements with the CSS class ``nobreak``.  You can
control this with ``avoid_classes=['code', ...]``.

Lastly you can control the character that is inserted with
``break_character=u'\u200b'``.  However, you cannot insert markup,
only text.

``word_break_html(html)`` parses the HTML document and returns a
string.