1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210
|
$Id: Readme,v 1.4 2004/06/02 14:33:38 hfuecks Exp $
++Introduction
XML_HTMLSax3 is a SAX based XML parser for badly formed XML documents,
such as HTML.
The original code base was developed by Alexander Zhukov and published at
http://sourceforge.net/projects/phpshelve/. Alexander kindly gave permission
to modify the code and license for inclusion in PEAR.
PEAR::XML_HTMLSax3 provides an API very similar to the native PHP SAX
extension (http://www.php.net/xml), allowing handlers using one to be
easily adapted to the other. The key difference is HTMLSax will not
break on badly formed XML, allowing it to be used for parsing HTML
documents. Otherwise HTMLSax supports all the handlers available from
Expat except namespace and external entity handlers. Provides methods
for handling XML escapes as well as JSP/ASP opening and close tags.
Version 1.x introduced an API similar to the native SAX extension but
used a slow character by character approach to parsing.
Version 2.x has had it's internals completely overhauled to use a Lexer,
delivering performance *approaching* that of the native XML extension,
as well as a radically improved, modular design that makes adding
further functionality easy.
Version 3.x is about fine tuning the API, behaviour and providing a
mechanism to distinguish HTML "quirks" from badly formed HTML
A big thanks to Jeff Moore (lead developer of WACT:
http://wact.sourceforge.net) who's largely responsible for new design, as well
input from members at Sitepoint's Advanced PHP forums:
http://www.sitepointforums.com/showthread.php?threadid=121246.
Thanks also to Marcus Baker (lead developer of SimpleTest:
http://www.lastcraft.com/simple_test.php) for sorting out the unit tests.
++Uses
Some particular situations where XML_HTMLSax3 can be useful include;
- Template Engines (see WACT for example: http://wact.sf.net)
- Parsing XML documents (such as those online) where the source is
out of your control and Expat is choking because it's badly formed.
- Converting HTML to XHTML
- Reading HTML based content from a database and converting to PDF (with
help from a PDF generation library and probably PEAR::XML_SaxFilters as
well)
- Parsing ASP(.NET) and JSP pages.
- Creating a PHP-GTK based web browser? A PHP CSS Parser exists:
http://www.phpclasses.org/browse.html/package/1081.html
++Features
- Won't "break" on badly formed XML. May in some instances get it "wrong"
(see Limitations) but will continue parsing.
- Provides an API similar to the native PHP XML extension so switching code
from one to the other is typically minimal effort.
- Can be instructed to behave in more or less the same manner as SAX,
when dealing with linefeeds, tabs and XML entities
- In addition to handling basic XML elements attributes and data also
capable of dealing with;
- Processing instructions e.g. <?php ?> / <?xml ?> etc. Within PI's
XML entities are not parsed (i.e. ignore < and > )
- XML Escape markup such as <! >, <!-- --> and <![CDATA[ ]]>. Within
this XML entities are not parsed (useful for JavaScript, for example)
- JSP / ASP (JASP) marked up with <% %>. Note: You will need to
deal with <%@ %> and <%= %> yourself. With JASP markup XML entities
are not parsed
++Usage Notes
- Performance-wise, it runs faster on PHP 4.3.0 thanks to strspn() and
strcspn() supporting position arguments. For older PHP versions while
loops are used to achieve the same effect, meaning a slightly higher
overhead. Note also that setting XML options with XML_HTMLSax3::set_option()
also slows down the parser, the options being handled by "decorators"
which perform some further formatting on XML events which have already
been parsed.
- By default, no parser options are set
- Regarding the XML_OPTION_ENTITIES_PARSED, this uses the html_entity_decode()
function which is only available in PHP 4.3.0+. To get round this, HTMLSax
checks your PHP version and for the function name html_entity_decode. If not
found, it defines a function which mirrors the behavior of the native PHP
html_entity_decode().
Both XML_OPTION_ENTITIES_PARSED and XML_OPTION_ENTITIES_UNPARSED can be used
down to PHP version 4.0.5, due to the regular expression used to find entities.
- For attributes which have just a name but no value e.g.
<option value="bar" selected>
HTMLSax will return a NULL value for that attribute name, when
calling the opening tag handler;
function myOpenHandler($parser,$name,$attrs) {
print_r ( $attrs );
}
This would produce;
Array
(
[value] => bar
[selected] => NULL
)
- JASP directives like <%@ and <%= will be not be regarded as special.
I.e. you will get back the @ or % from the contents of the JASP block
and have to deal with these yourself.
++ Limitations
- XML_HTMLSax3 only supports use of PHP classes as callback handlers;
there is no support for using PHP functions as handlers.
- The only weird behaviour is for attributes which only have a left quote
or apostrophe e.g. <tag foo="bar>Some Text</tag> will give you an
attribute like $attrs['foo']="bar>Some Text"; This is a trade off against
allowing XML entities like < and > to appear inside attributes.
- Although the package name might suggest otherwise, XML_HTMLSax3 currently
has no special knowledge of HTML (i.e. there is no understanding of whether
a given HTML document is well formed or not, according to HTML's rules).
XML_HTMLSax3 is primarily intended as a SAX based parser that will not
complain about structure of a document it is asked to parse. It even does
a fair job of parsing http://static.php.net/www.php.net/images/php.gif ...
Version 3.x will introduce some basic knowledge of HTML grammar to help with
identifying HTML "quirks".
- <script /> elements containing < or > characters; these will be treated as new
elements triggering the listeners. Make sure that any JavaScript inside is marked
either with an XML comment <!-- --> or a CDATA block <[CDATA[ ]]>
</script>
Alternatively define open / close handlers which watch for <script /> elements
as a special case, so that any further events triggered within them are handled
as part of the <script /> element.
- If you change the handlers once parsing has started, you will need to re-set
and parser options you have defined
++ Example Use
Further examples are available in the examples directory of this package.
<?php
// Include HTMLSax
require_once('XML/HTMLSax3.php');
// Define a customer handler class
class MyHandler {
function MyHandler(){}
// Opening tags
function openHandler(& $parser,$name,$attrs) {
echo ( 'Open Tag Handler: '.$name );
echo ( 'Attrs:' );
print_r($attrs);
}
// Closing tags
function closeHandler(& $parser,$name) {
echo ( 'Close Tag Handler: '.$name );
}
// Text node handler
function dataHandler(& $parser,$data) {
echo ( 'Data Handler: '.$data );
}
// XML escape handler (e.g. HTML comments)
function escapeHandler(& $parser,$data) {
echo ( 'Escape Handler: '.$data );
}
// Processing instruction handler
function piHandler(& $parser,$target,$data) {
echo ( 'PI Handler: '.$target.' - '.$data );
}
// JSP / ASP markup handler
function jaspHandler(& $parser,$data) {
echo ( 'Jasp Handler: '.$data );
}
}
// Get some HTML document
$doc = file_get_contents('http://www.php.net');
// Instantiate the handler
$handler=new MyHandler();
// Instantiate the parser
$parser=& new XML_HTMLSax3();
// Register the handler with the parser
$parser->set_object($handler);
// Set a parser option
$parser->set_option('XML_OPTION_TRIM_DATA_NODES');
// Set the callback handlers (MyHandler methods)
$parser->set_element_handler('openHandler','closeHandler');
$parser->set_data_handler('dataHandler');
$parser->set_escape_handler('escapeHandler');
$parser->set_pi_handler('piHandler');
$parser->set_jasp_handler('jaspHandler');
// Parse the document
$parser->parse($doc);
?>
|