This tests autolink:: >>> from lxml.html import usedoctest >>> from lxml_html_clean import autolink_html >>> print(autolink_html(''' ...
Link here: http://test.com/foo.html.
... '''))
Link here: http://test.com/foo.html.
>>> print(autolink_html(''' ...
Mail me at mailto:ianb@test.com or http://myhome.com
... '''))
Mail me at ianb@test.com or http://myhome.com
>>> print(autolink_html(''' ...
The great thing is the http://link.com links and ... the http://foobar.com links.
'''))
The great thing is the http://link.com links and the http://foobar.com links.
>>> print(autolink_html(''' ...
Link: <http://foobar.com>
'''))
Link: <http://foobar.com>
>>> print(autolink_html(''' ...
Link: (http://foobar.com)
'''))
Link: (http://foobar.com)
Parenthesis are tricky, we'll do our best:: >>> print(autolink_html(''' ...
(Link: http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software))
... '''))
(Link: http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software))
>>> print(autolink_html(''' ...
... a link: http://foo.com)
... '''))
... a link: http://foo.com)
Some cases that won't be caught (on purpose):: >>> print(autolink_html(''' ...
A link to http://localhost/foo/bar won't, but a link to ... http://test.com will
'''))
A link to http://localhost/foo/bar won't, but a link to http://test.com will
>>> print(autolink_html(''' ...
A link in
'''))
A link in
>>> print(autolink_html(''' ...
A link in http://bar.com
'''))
A link in http://bar.com
>>> print(autolink_html(''' ...
A link in http://foo.com or ... http://bar.com
'''))
A link in http://foo.com or http://bar.com
There's also a word wrapping function, that should probably be run after autolink:: >>> from lxml_html_clean import word_break_html >>> def pascii(s): ... print(s.encode('ascii', 'xmlcharrefreplace').decode('ascii')) >>> pascii(word_break_html( u''' ...
Hey you ... 12345678901234567890123456789012345678901234567890
'''))
Hey you 1234567890123456789012345678901234567890​1234567890
Not everything is broken: >>> pascii(word_break_html(''' ...
Hey you ... 12345678901234567890123456789012345678901234567890
'''))
Hey you 12345678901234567890123456789012345678901234567890
>>> pascii(word_break_html(''' ... text''')) text