This tests autolink::
>>> from lxml.html import usedoctest
>>> from lxml_html_clean import autolink_html
>>> print(autolink_html('''
...
Link here: http://test.com/foo.html.
... '''))
>>> print(autolink_html('''
... Mail me at mailto:ianb@test.com or http://myhome.com
... '''))
>>> print(autolink_html('''
... The great thing is the http://link.com links and
... the http://foobar.com links.
'''))
>>> print(autolink_html('''
... Link: <http://foobar.com>
'''))
>>> print(autolink_html('''
... Link: (http://foobar.com)
'''))
Parenthesis are tricky, we'll do our best::
>>> print(autolink_html('''
... (Link: http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software))
... '''))
>>> print(autolink_html('''
... ... a link: http://foo.com)
... '''))
Some cases that won't be caught (on purpose)::
>>> print(autolink_html('''
... A link to http://localhost/foo/bar won't, but a link to
... http://test.com will
'''))
A link to http://localhost/foo/bar won't, but a link to
http://test.com will
>>> print(autolink_html('''
... A link in
'''))
A link in
>>> print(autolink_html('''
... '''))
>>> print(autolink_html('''
... A link in http://foo.com
or
... http://bar.com
'''))
A link in http://foo.com
or
http://bar.com
There's also a word wrapping function, that should probably be run
after autolink::
>>> from lxml_html_clean import word_break_html
>>> def pascii(s):
... print(s.encode('ascii', 'xmlcharrefreplace').decode('ascii'))
>>> pascii(word_break_html( u'''
... Hey you
... 12345678901234567890123456789012345678901234567890
'''))
Hey you
12345678901234567890123456789012345678901234567890
Not everything is broken:
>>> pascii(word_break_html('''
... Hey you
... 12345678901234567890123456789012345678901234567890
'''))
Hey you
12345678901234567890123456789012345678901234567890
>>> pascii(word_break_html('''
... text'''))
text