File: README.rst

package info (click to toggle)
python-html5rdf 1.2.1-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 3,468 kB
  • sloc: python: 12,794; makefile: 3
file content (140 lines) | stat: -rw-r--r-- 3,941 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
html5rdf
========

.. image:: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml/badge.svg
    :target: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml

html5rdf is a pure-python library for parsing HTML to DOMFragment objects for the use in RDFLib.
html5rdf is a fork of html5lib-python. See below for the html5lib README.

----

It is designed to conform to the WHATWG HTML specification, as is implemented by all major
web browsers.

htm5lib-modern is designed as a drop-in replacement for ``html5lib`` that exposes a new
``html5lib`` module without Python 2 support and without the legacy dependencies on
``six``, and ``webencodings``. Note, you should not have the old deprecated ``html5lib``
and ``html5lib-modern`` in your dependency tree at the same time, because they alias.


Usage
-----

Simple usage follows this pattern:

.. code-block:: python

  import html5rdf
  with open("mydocument.html", "rb") as f:
      document = html5rdf.parse(f)

or:

.. code-block:: python

  import html5rdf
  document = html5rdf.parse("<p>Hello World!")

By default, the ``document`` will be an ``xml.etree`` element instance.
Whenever possible, html5lib chooses the accelerated ``ElementTree``
implementation.

Two other tree types are supported: ``xml.dom.minidom`` and
``lxml.etree``. To use an alternative format, specify the name of
a treebuilder:

.. code-block:: python

  import html5rdf
  with open("mydocument.html", "rb") as f:
      lxml_etree_document = html5rdf.parse(f, treebuilder="lxml")

When using with ``urllib.request`` (Python 3), the charset from HTTP
should be pass into html5rdf as follows:

.. code-block:: python

  from urllib.request import urlopen
  import html5rdf

  with urlopen("http://example.com/") as f:
      document = html5rdf.parse(f, transport_encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly.
For instance, to make the parser raise exceptions on parse errors, use:

.. code-block:: python

  import html5rdf
  with open("mydocument.html", "rb") as f:
      parser = html5rdf.HTMLParser(strict=True)
      document = parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilder
class as the ``tree`` keyword argument to use an alternative document
format:

.. code-block:: python

  import html5rdf
  parser = html5rdf.HTMLParser(tree=html5rdf.getTreeBuilder("dom"))
  minidom_document = parser.parse("<p>Hello World!")

More documentation is available at https://html5lib.readthedocs.io/.


Installation
------------

html5rdf works on CPython 3.8+ and PyPy. To install:

.. code-block:: bash

    $ pip install html5rdf

The goal is to support a (non-strict) superset of the versions that `pip
supports
<https://pip.pypa.io/en/stable/installing/#python-and-os-compatibility>`_.


Optional Dependencies
---------------------

The following third-party libraries may be used for additional
functionality:

- ``lxml`` is supported as a tree format (for both building and
  walking) under CPython (but *not* PyPy where it is known to cause
  segfaults);

- ``genshi`` has a treewalker (but not builder); and

- ``chardet`` can be used as a fallback when character encoding cannot
  be determined.


Bugs
----

Please report any bugs on the `issue tracker
<https://github.com/html5lib/html5lib-python/issues>`_.


Tests
-----

Unit tests require the ``pytest`` and ``mock`` libraries and can be
run using the ``pytest`` command in the root directory.

Test data are contained in a separate `html5lib-tests
<https://github.com/html5lib/html5lib-tests>`_ repository and included
as a submodule, thus for git checkouts they must be initialized::

  $ git submodule init
  $ git submodule update

If you have all compatible Python implementations available on your
system, you can run tests on all of them using the ``tox`` utility,
which can be found on PyPI.