File: __init__.py

package info (click to toggle)
linkchecker 5.2-2
  • links: PTS
  • area: main
  • in suites: squeeze
  • size: 3,508 kB
  • ctags: 3,805
  • sloc: python: 22,666; lex: 1,114; yacc: 785; makefile: 276; ansic: 95; sh: 68; sql: 19; awk: 4
file content (258 lines) | stat: -rw-r--r-- 7,704 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# -*- coding: iso-8859-1 -*-
# Copyright (C) 2000-2009 Bastian Kleineidam
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
"""
Fast HTML parser module written in C with the following features:

- Reentrant
  As soon as any HTML string data is available, we try to feed it
  to the HTML parser. This means that the parser has to scan possible
  incomplete data, recognizing as much as it can. Incomplete trailing
  data is saved for subsequent calls, or it is just flushed into the
  output buffer with the flush() function.
  A reset() brings the parser back to its initial state, throwing away all
  buffered data.

- Coping with HTML syntax errors
  The parser recognizes as much as it can and passes the rest
  of the data as TEXT tokens.
  The scanner only passes complete recognized HTML syntax elements to
  the parser. Invalid syntax elements are passed as TEXT. This way we do
  not need the bison error recovery.
  Incomplete data is rescanned the next time the parser calls yylex() or
  when it is being flush()ed.

  The following syntax errors will be recognized correctly:

    - Unquoted attribute values.
    - Missing beginning quote of attribute values.
    - Invalid "</...>" end tags in script modus.
    - Missing ">" in tags.
    - Invalid characters in tag or attribute names.

 The following syntax errors will not be recognized:

    - Missing end quote of attribute values. On the TODO list.
    - Unknown HTML tag or attribute names.
    - Invalid nesting of tags.

  Additionally the parser has the following features:

    - NULL bytes are changed into spaces
    - <!-- ... --> inside a <script> or <style> are not treated as
       comments but as DATA
    - Rewrites all tag and attribute names to lowercase for easier
       matching.

- Speed
  The FLEX code is configured to generate a large but fast scanner.
  The parser ignores forbidden or unnecessary HTML end tags.
  The parser converts tag and attribute names to lower case for easier
  matching.
  The parser quotes all attribute values.
  Python memory management interface is used.

- Character encoding aware
  The parser itself is not encoding aware, but output strings are
  always Python Unicode strings.

- Retain HTML attribute order
  The parser keeps the order in which HTML tag attributes are parsed.
  The attributes are stored in a custom dictionary class ListDict which
  iterates over the dictionary keys in insertion order.

USAGE

First make a HTML SAX handler object. Missing callback functions are
ignored. The object returned from callbacks is also ignored.
Note that a missing attribute value is stored as the value None
in the ListDict (ie. "<a href>" with lead to a {href: None} dict entry).

Used callbacks of a handler are:

- Comments: <!--data-->
  def comment (data)
  @param data:
  @type data: Unicode string

- Start tag: <tag {attr1:value1, attr2:value2, ..}>
  def start_element (tag, attrs)
  @param tag: tag name
  @type tag: Unicode string
  @param attrs: tag attributes
  @type attrs: ListDict

- Start-end tag: <tag {attr1:value1, attr2:value2, ..}/>
  def start_end_element(tag, attrs):
  @param tag: tag name
  @type tag: Unicode string
  @param attrs: tag attributes
  @type attrs: ListDict

- End tag: </tag>
  def end_element (tag)
  @param tag: tag name
  @type tag: Unicode string

- Document type: <!DOCTYPE data>
  def doctype (data)
  @param data: doctype string data
  @type data: Unicode string

- Processing instruction (PI): <?name data?>
  def pi (name, data=None)
  @param name: instruction name
  @type name: Unicode string
  @param data: instruction data
  @type data: Unicode string

- Character data: <![CDATA[data]]>
  def cdata (data)
  @param data: character data
  @type data: Unicode string

- Characters: data
  def characters(data): data
  @param data: data
  @type data: Unicode string

Additionally, there are error and warning callbacks:

- Parser warning.
  def warning (msg)
  @param msg: warning message
  @type msg: Unicode string

- Parser error.
  def error (msg)
  @param msg: error message
  @type msg: Unicode string

- Fatal parser error
  def fatal_error (msg)
  @param msg: error message
  @type msg: Unicode string

EXAMPLE

 # This handler prints out the parsed HTML.
 handler = HtmlParser.htmllib.HtmlPrettyPrinter()
 # Create a new HTML parser object with the handler as parameter.
 parser = HtmlParser.htmlsax.parser(handler)
 # Feed data.
 parser.feed("<html><body>Blubb</body></html>")
 # Flush for finishing things up.
 parser.flush()

"""

import re
import codecs
import htmlentitydefs


def _resolve_entity (mo):
    """
    Resolve a HTML entity.

    @param mo: matched _entity_re object with a "entity" match group
    @type mo: MatchObject instance
    @return: resolved entity char, or empty string on error
    @rtype: unicode string
    """
    ent = mo.group("entity")
    s = mo.group()
    if s.startswith('&#'):
        if s[2] in 'xX':
            radix = 16
        else:
            radix = 10
        try:
            num = int(ent, radix)
        except (ValueError, OverflowError):
            return u''
    else:
        num = htmlentitydefs.name2codepoint.get(ent)
    if num is None or num < 0:
        # unknown entity -> ignore
        return u''
    try:
        return unichr(num)
    except ValueError:
        return u''


_entity_re = re.compile(ur'(?i)&(#x?)?(?P<entity>[0-9a-z]+);')

def resolve_entities (s):
    """
    Resolve HTML entities in s.

    @param s: string with entities
    @type s: string
    @return: string with resolved entities
    @rtype: string
    """
    return _entity_re.sub(_resolve_entity, s)

SUPPORTED_CHARSETS = ["utf-8", "iso-8859-1", "iso-8859-15"]

_encoding_ro = re.compile(r"charset=(?P<encoding>[-0-9a-zA-Z]+)")

def set_encoding (parsobj, attrs):
    """
    Set document encoding for the HTML parser according to the <meta>
    tag attribute information.

    @param attrs: attributes of a <meta> HTML tag
    @type attrs: dict
    @return: None
    """
    if attrs.get_true('http-equiv', u'').lower() == u"content-type":
        charset = attrs.get_true('content', u'')
        charset = get_ctype_charset(charset.encode('ascii', 'ignore'))
        if charset and charset.lower() in SUPPORTED_CHARSETS:
            parsobj.encoding = charset


def get_ctype_charset (text):
    """
    Extract charset information from mime content type string, eg.
    "text/html; charset=iso8859-1".
    """
    for param in text.lower().split(';'):
        param = param.strip()
        if param.startswith('charset='):
            charset = param[8:]
            try:
                codecs.lookup(charset)
                return charset
            except (LookupError, ValueError):
                pass
    return None


def set_doctype (parsobj, doctype):
    """
    Set document type of the HTML parser according to the given
    document type string.

    @param doctype: document type
    @type doctype: string
    @return: None
    """
    if u"XHTML" in doctype:
        parsobj.doctype = "XHTML"