File: test_autolink.txt

package info (click to toggle)
lxml-html-clean 0.4.2-1
  • links: PTS
  • area: main
  • in suites: forky, sid, trixie
  • size: 228 kB
  • sloc: python: 865; makefile: 12
file content (79 lines) | stat: -rw-r--r-- 3,534 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
This tests autolink::

    >>> from lxml.html import usedoctest
    >>> from lxml_html_clean import autolink_html
    >>> print(autolink_html('''
    ... <div>Link here: http://test.com/foo.html.</div>
    ... '''))
    <div>Link here: <a href="http://test.com/foo.html">http://test.com/foo.html</a>.</div>
    >>> print(autolink_html('''
    ... <div>Mail me at mailto:ianb@test.com or http://myhome.com</div>
    ... '''))
    <div>Mail me at <a href="mailto:ianb@test.com">ianb@test.com</a>
    or <a href="http://myhome.com">http://myhome.com</a></div>
    >>> print(autolink_html('''
    ... <div>The <b>great</b> thing is the http://link.com links <i>and</i>
    ... the http://foobar.com links.</div>'''))
    <div>The <b>great</b> thing is the <a href="http://link.com">http://link.com</a> links <i>and</i>
    the <a href="http://foobar.com">http://foobar.com</a> links.</div>
    >>> print(autolink_html('''
    ... <div>Link: &lt;http://foobar.com&gt;</div>'''))
    <div>Link: &lt;<a href="http://foobar.com">http://foobar.com</a>&gt;</div>
    >>> print(autolink_html('''
    ... <div>Link: (http://foobar.com)</div>'''))
    <div>Link: (<a href="http://foobar.com">http://foobar.com</a>)</div>

Parenthesis are tricky, we'll do our best::

    >>> print(autolink_html('''
    ... <div>(Link: http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software))</div>
    ... '''))
    <div>(Link: <a href="http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)">http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)</a>)</div>
    >>> print(autolink_html('''
    ... <div>... a link: http://foo.com)</div>
    ... '''))
    <div>... a link: <a href="http://foo.com">http://foo.com</a>)</div>

Some cases that won't be caught (on purpose)::

    >>> print(autolink_html('''
    ... <div>A link to http://localhost/foo/bar won't, but a link to
    ...  http://test.com will</div>'''))
    <div>A link to http://localhost/foo/bar won't, but a link to
    <a href="http://test.com">http://test.com</a> will</div>
    >>> print(autolink_html('''
    ... <div>A link in <textarea>http://test.com</textarea></div>'''))
    <div>A link in <textarea>http://test.com</textarea></div>
    >>> print(autolink_html('''
    ... <div>A link in <a href="http://foo.com">http://bar.com</a></div>'''))
    <div>A link in <a href="http://foo.com">http://bar.com</a></div>
    >>> print(autolink_html('''
    ... <div>A link in <code>http://foo.com</code> or
    ... <span class="nolink">http://bar.com</span></div>'''))
    <div>A link in <code>http://foo.com</code> or
    <span class="nolink">http://bar.com</span></div>

There's also a word wrapping function, that should probably be run
after autolink::

    >>> from lxml_html_clean import word_break_html
    >>> def pascii(s):
    ...     print(s.encode('ascii', 'xmlcharrefreplace').decode('ascii'))
    >>> pascii(word_break_html( u'''
    ... <div>Hey you
    ... 12345678901234567890123456789012345678901234567890</div>'''))
    <div>Hey you
    1234567890123456789012345678901234567890&#8203;1234567890</div>

Not everything is broken:

    >>> pascii(word_break_html('''
    ... <div>Hey you
    ... <code>12345678901234567890123456789012345678901234567890</code></div>'''))
    <div>Hey you
    <code>12345678901234567890123456789012345678901234567890</code></div>
    >>> pascii(word_break_html('''
    ... <a href="12345678901234567890123456789012345678901234567890">text</a>'''))
    <a href="12345678901234567890123456789012345678901234567890">text</a>