The two fundamental use-cases that are handled by the parser are extraction and transformation (the syntheses use-case, where HTML pages are created from scratch, is better handled by other tools closer to the source of data). While prior versions concentrated on data extraction from web pages, Version 1.4 of the HTMLParser has substantial improvements in the area of transforming web pages, with simplified tag creation and editing, and verbatim toHtml() method output.
In general, to use the HTMLParser you will need to be able to write code in the Java programming language. Although some example programs are provided that may be useful as they stand, it's more than likely you will need (or want) to create your own programs or modify the ones provided to match your intended application.
To use the library, you will need to add either the htmllexer.jar or
htmlparser.jar to your classpath when compiling and running. The
htmllexer.jar provides low level access to generic string, remark and tag nodes on
the page in a linear, flat, sequential manner. The htmlparser.jar, which
includes the classes found in htmllexer.jar, provides access to a page as a
sequence of nested differentiated tags containing string, remark and other
tag nodes. So where the output from calls to the lexer
method might be:
The output from the parser NodeIterator would
nest the tags as children of the <html>, <head> and other nodes
(here represented by indentation):
<html> <head> <title> "Welcome" </title> </head> <body> etc...The parser attempts to balance opening tags with ending tags to present the structure of the page, while the lexer simply spits out nodes. If your application requires only modest structural knowledge of the page, and is primarily concerned with individual, isolated nodes, you should consider using the lightweight lexer. But if your application requires knowledge of the nested structure of the page, for example processing tables, you will probably want to use the full parser.
The HTML Parser is an open source library released under GNU Lesser General Public License, which basically says you are free to use the library "as is" in other (even proprietary) products, as long as due credit is given to the authors and the source code for the HTMLParser is included or available with the other product. For modified or embedded use, please consult the LGPL license.