1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
|
The moving parts
================
html5lib consists of a number of components, which are responsible for
handling its features.
Tree builders
-------------
The parser reads HTML by tokenizing the content and building a tree that
the user can later access. There are three main types of trees that
html5lib can build:
* ``etree`` - this is the default; builds a tree based on ``xml.etree``,
which can be found in the standard library. Whenever possible, the
accelerated ``ElementTree`` implementation (i.e.
``xml.etree.cElementTree`` on Python 2.x) is used.
* ``dom`` - builds a tree based on ``xml.dom.minidom``.
* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
API. The performance gains are relatively small compared to using the
accelerated ``ElementTree`` module.
You can specify the builder by name when using the shorthand API:
.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
When instantiating a parser object, you have to pass a tree builder
class in the ``tree`` keyword attribute:
.. code-block:: python
import html5lib
parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
document = parser.parse("<p>Hello World!")
To get a builder class by name, use the ``getTreeBuilder`` function:
.. code-block:: python
import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Hello World!")
The implementation of builders can be found in `html5lib/treebuilders/
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_.
Tree walkers
------------
Once a tree is ready, you can work on it either manually, or using
a tree walker, which provides a streaming view of the tree. html5lib
provides walkers for all three supported types of trees (``etree``,
``dom`` and ``lxml``).
The implementation of walkers can be found in `html5lib/treewalkers/
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
Walkers make consuming HTML easier. html5lib uses them to provide you
with has a couple of handy tools.
HTMLSerializer
~~~~~~~~~~~~~~
The serializer lets you write HTML back as a stream of bytes.
.. code-block:: pycon
>>> import html5lib
>>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich')
>>> walker = html5lib.getTreeWalker("etree")
>>> stream = walker(element)
>>> s = html5lib.serializer.HTMLSerializer()
>>> output = s.serialize(stream)
>>> for item in output:
... print("%r" % item)
'<p'
' '
'xml:lang'
'='
'pl'
'>'
'Witam wszystkich'
You can customize the serializer behaviour in a variety of ways, consult
the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
documentation.
Filters
~~~~~~~
You can alter the stream content with filters provided by html5lib:
* :class:`alphabeticalattributes.Filter
<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
tags to be in alphabetical order
* :class:`inject_meta_charset.Filter
<html5lib.filters.inject_meta_charset.Filter>` sets a user-specified
encoding in the correct ``<meta>`` tag in the ``<head>`` section of
the document
* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
``LintError`` exceptions on invalid tag and attribute names, invalid
PCDATA, etc.
* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
removes tags from the stream which are not necessary to produce valid
HTML
* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
unsafe markup and CSS. Elements that are known to be safe are passed
through and the rest is converted to visible text. The default
configuration of the sanitizer follows the `WHATWG Sanitization Rules
<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
collapses all whitespace characters to single spaces unless they're in
``<pre/>`` or ``textarea`` tags.
To use a filter, simply wrap it around a stream:
.. code-block:: python
>>> import html5lib
>>> from html5lib.filters import sanitizer
>>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom")
>>> walker = html5lib.getTreeWalker("dom")
>>> stream = walker(dom)
>>> sane_stream = sanitizer.Filter(stream) clean_stream = sanitizer.Filter(stream)
Tree adapters
-------------
Used to translate one type of tree to another. More documentation
pending, sorry.
Encoding discovery
------------------
Parsed trees are always Unicode. However a large variety of input
encodings are supported. The encoding of the document is determined in
the following way:
* The encoding may be explicitly specified by passing the name of the
encoding as the encoding parameter to the
:meth:`~html5lib.html5parser.HTMLParser.parse` method on
``HTMLParser`` objects.
* If no encoding is specified, the parser will attempt to detect the
encoding from a ``<meta>`` element in the first 512 bytes of the
document (this is only a partial implementation of the current HTML
5 specification).
* If no encoding can be found and the chardet library is available, an
attempt will be made to sniff the encoding from the byte pattern.
* If all else fails, the default encoding will be used. This is usually
`Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
a common fallback used by Web browsers.
Tokenizers
----------
The part of the parser responsible for translating a raw input stream
into meaningful tokens is the tokenizer. Currently html5lib provides
two.
To set up a tokenizer, simply pass it when instantiating
a :class:`~html5lib.html5parser.HTMLParser`:
.. code-block:: python
import html5lib
from html5lib import sanitizer
p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
p.parse("<p>Surprise!<script>alert('Boo!');</script>")
HTMLTokenizer
~~~~~~~~~~~~~
This is the default tokenizer, the heart of html5lib. The implementation
can be found in `html5lib/tokenizer.py
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.
HTMLSanitizer
~~~~~~~~~~~~~
This is a tokenizer that removes unsafe markup and CSS styles from the
input. Elements that are known to be safe are passed through and the
rest is converted to visible text. The default configuration of the
sanitizer follows the `WHATWG Sanitization Rules
<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
The implementation can be found in `html5lib/sanitizer.py
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.
|