1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493
|
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>REXML</title>
<style media="all" type="text/css">@import "http://www.germane-software.com/~ser/Software/style.css";</style>
</head>
<body>
<div id="banner">
<IMG SRC="img/rexml.png"></div>
<h4 align="center">1.2.5</h4>
<div align="right">
<A href="../archives/rexml_1.2.5.tgz">Current release (1.2.5, unix)</A>
</div>
<div align="right">
<A href="../archives/rexml_1.2.5.zip">Current release (1.2.5, dos)</A>
</div>
<div id="centercontent">
<h2>Overview</h2>
<h3>Abstract</h3>
<p>REXML stands for "Ruby Electric XML". Sorry. I'm not
very creative when it comes to working names for my software, and I
invariably use the working names as the final product names. The
"Ruby" comes from the Ruby language, obviously. The
"Electric XML" comes from the inspiration for this project,
the Electric XML Java processing library.</p>
<p>This software is distribute under the <a href="LICENSE.txt">Ruby license</a>.</p>
<p>1.2.5: Bug fixes: doctypes that had spaces between the closing ]
and > generated errors. There was a small bug that caused too
many newlines to be generated in some output. Eelis van der Weegen
(what a great name!) pointed out one of the numerous API errors.
Julian requested that add_attributes take both Hash (original) and
array of arrays (as produced by StreamListener). I killed the
mailing list, accidentally, and fixed it again. Fixed a bug in
next_sibling, caused by a combination of mixing overriding
<=>() and using Array.index().</p>
<p>1.2.4: Changes since 1.1b: 100% OASIS valid tests passed.
UTF-8/16 support. Many bug fixes. to_a() added to Parent and
Element.elements. Updated tutorial. Added variable IOSource
buffer size, for stream parsing. delete() now fails silently
rather than throwing an exception if it can't find the elemnt to
delete. Added a patch to support REXMLBuilder. Reorganized file
layout in distribution; added a repackaging program; added the
logo.</p>
<p>1.1b: Changes since 1.1a: Stream parsing added. Bug fixes in entity
parsing. New XPath implementation, fixing many bugs and making
feature complete. Completed whitespace handling, adding much
functionality and fixing several bugs. Added convenience methods
for inserting elememnts. Improved error reporting. Fixed
attribute content to correctly handle quotes and apostrophes. Added
mechanisms for handling raw text. Cleaned up utility programs
(profile.rb, comparison.rb, etc.). Improved speed a little.
Brought REXML up to 98.9% OASIS valid source compliance.</p>
<h3>Introduction</h3>
<p>Why REXML? There, at the time of this writing, already two XML
parsers for Ruby. The first is a Ruby binding to a native XML
parser. This is a fast parser, using proven technology. However,
it isn't very portable. The second is a native Ruby
implementation, and as useful as it is, it has (IMO) a difficult
API.</p>
<p>I have this problem: I dislike obscifucated APIs. There are
several XML parser APIs for Java. Most of them follow DOM or SAX,
and are very similar in philosophy with an increasing number of
Java APIs. Namely, they look like they were designed by theorists
who never had to use their own APIs. The extant XML APIs, in
general, suck. They take a markup language which was specifically
designed to be very simple, elegant, and powerful, and wrap an
obnoxious, bloated, and large API around it. I was always having
to refer to the API documentation to do even the most basic XML
tree manipulations; nothing was intuitive, and almost every
operation was complex. </p>
<p>Then along came Electric XML. </p>
<p>Ah, bliss. Look at the Electric XML API. First, the library is
small; less that 500K. Next, the API is intuitive. You want to
parse a document? doc = new Document( some_file ). Create and add
a new element? element = parent.addElement( tag_name ). Write out
a subtree?? element.write( writer ). Now how about DOM? To parse
some file: parser = new DOMParser(); parser.parse( new
InputSource( new FileInputStream( some_file ) ) ). Create a new
element? First you have to know the owning document of the
to-be-created node (can anyone say "global variables, or
obtuse, multi-argument methods"?) and call element =
doc.createElement( tag_name ). Then you get to call
parent.appendChild( element ). "appendChild"? Where did
they get that from? How many different methods do we have in Java
in how many different classes for adding children to parents?
addElement()? add()? put()? appendChild()? Heaven forbid that you
want to create an Element elsewhere in the code without having
access to the owning document. I'm not even going to go into
what travesty of code you have to go through to write out an XML
sub-tree in DOM. </p>
<p>So, I use Electric XML extensively. It is small, fast, and
intuitive. IE, the API doesn't add a bunch of work to the task
of writing software. When I started to write more software in
Ruby, I needed an XML parser. I wasn't keen on the native
library binding, "XMLParser", because I try to avoid
complex library dependancies in my software, when I can. For a
long time, I used NQXML, because it was the only other parser out
there. However, the NQXML API can be even more painful than the
Java DOM API. Almost all element operations requires accessing
some indirect node access... you had to do something like
element.node.attr['key'], and it is never obvious to me
when you access the element directly, or the node.. or, really,
why they're two different objects, anyway. This is even more
unfortunate since Ruby is so elegent and intuitive, and bad APIs
really stand out. I'm not, by the way, trying to insult NQXML; I
just don't like the API.</p>
<p>I wrote the people at TheMind (Electric XML... get it?) and asked
them if I could do a translation to Ruby. They said yes. After a
few weeks of hacking on it for a couple of hours each week, and
after having gone down a few blind alleys in the translation, I
had a working beta. IE, it parsed, but hadn't gone through a
lot of strenuous testing. Along the way, I had made a few changes
to the API, and a lot of changes to the code. First off, Ruby does
iterators differently than Java. Java uses a lot of helper
classes. Helper classes are exactly the kinds of things that
theorists come up with... they look good on paper, but using them
is like chewing glass. You find that you spend 50% of your time
writing helper classes just to support the other 50% of the code
that actually does the job you were trying to solve in the first
place. In this case, the Java helper classes are either
Enumerations or Iterators. Ruby, on the other hand, uses blocks,
which is much more elegant. Rather than:</p>
<div class="example">
<pre>for (Enumeration e=parent.getChildren(); e.hasMoreElements(); ) {
Element child = (Element)e.nextElement();
// Do something with child
}</pre>
</div>
<p>you get:</p>
<div class="example">
<pre>parent.each_child{ |child| # Do something with child }</pre>
</div>
<p>Can't you feel the peace and contentment in this block of
code? Ruby is the language Buddha would have programmed in.</p>
<p>Anyhoo, I chose to use blocks in REXML directly, since this is
more common to Ruby code than <TT>for x in y ... end</TT>, which
is as orthoganal to the original Java as possible.</p>
<p>Also, I changed the naming conventions to more Ruby-esque method
names. For example, the Java method <TT>getAttributeValue()</TT>
becomes in Ruby <TT>get_attribute_value()</TT>. This is a
toss-up. I actually like the Java naming convention more, but the
latter is more common in Ruby code, and I'm trying to make things
easy for Ruby programmers, not Java programmers.</p>
<p>The biggest change was in the code. The Java version of Electric
XML did a lot of efficient String-array parsing, character by
character. Ruby, however, has ubiquitous, efficient, and powerful
regular expression support. All regex functions are done in native
code, so it is very fast, and the power of Ruby regex rivals that of
Perl. Therefore, a direct conversion of the Java code to Ruby would
have been more difficult, and much slower, than using Ruby regexps. I
therefore used regexs. In doing so, I cut the number of lines of
sourcecode by half<SUP><A href="#N73">1</A></SUP>.</p>
<p>Finally, by this point the API looks almost nothing like the
original Electric XML API, and practically none of the code is even
vaguely similar. However, even though the actual code is completely
different, I did borrow the same process of processing XML as
Electric, and am deeply indebted to the Electric XML code for
inspiration.</p>
<h3>Features</h3>
<ul>
<li>Simple API</li>
<li>Both stream (SAX) and tree (DOM) parsing<SUP><A href="#N83">2</A></SUP>
</li>
<li>Small</li>
<li>Native Ruby</li>
<li>Documentation(!)</li>
</ul>
<h2>Operation</h2>
<h3>Installation</h3>
<p>Run 'ruby install.rb'. By the way, you really should
look at these sorts of files before you run them as root. They
could contain anything, and since (in Ruby, at least) they tend to
be mercifully short, it doesn't hurt to glance over them.</p>
<h3>General Usage</h3>
<p>Please see <a href="tutorial.html">the Tutorial</a>
</p>
<p>The API documentation is <a href="api/rexml/index.html">here</a>. Some examples using
REXML are included in the distribution archive, and the Tutorial
provides examples with commentary.</p>
<h2>Status</h2>
<h3>Speed and Completeness</h3>
<p>Unfortunately, NQXML is the only package REXML can be compared
against; XMLParser uses Jade, which is a native library, and
really is a different beast altogether. So in comparing NQXML and
REXML you can look at three things: speed, size, and API.</p>
<p>
<a href="benchmarks/index.html">Benchmarks</a>
</p>
<p>REXML is faster than NQXML in some things, and
slower than NQXML in a couple of things. You can see this for
yourself by running the supplied benchmarks, although it may not
be clear what operations are slower from these. Most of the
places where REXML are slower are because of the convenience
methods<SUP><A href="#NBC">3</A></SUP>. On the
positive side,
most of the convenience methods can be bypassed if you know what
you are doing. Check the <a href="benchmarks/index.html">
benchmark comparison page</a> for a general
comparison. You can look at the benchmark code yourself to decide
how much salt to take with them.</p>
<p>The sizes of the distributions are very close. NQXML has about
1400 non-blank, non-comment lines of code; REXML
1823<SUP><A href="#NCE">4</A></SUP>
</p>
<p>The last thing is the API, and this is where I think REXML wins,
hands down. The core API is clean and intuitive, and things
work the way you would expect them to. Convenience methods
abound, and you can code for either convenience or speed.
REXML code is terse, and readable, like Ruby code should be.
The best way to decide which you like more is to write a couple
of small applications in each, then use the one you're more
comfortable with.</p>
<p>It should be noted that NQXML does not support XPath searches.</p>
<h3>XPath</h3>
<p>Here is the status of the XPath implementation.</p>
<div class="example">
<div class="exampletitle">Implemented</div>
<pre>/ root
. self
.. parent
* all element children
// all elements in document
//child all "child" elements in document
parent//child all "child" descendants of child element "parent"
parent/child all "child" elements of "parent"
[...] all predicates (attribute, index, text)
[...][...] compound predicates
element child element "element"
function() (partially)
axe:: (partially)</pre>
</div>
<p>Some of this API (the API dealing with function() handling, in particular) is subject to change.</p>
<h3>Namespaces</h3>
<p>Namespace support is now fairly stable. One thing to be aware
of is that REXML is not (yet) a validating parser. This means
that some invalid namespace declarations are not caught.</p>
<h3>Known Bugs</h3>
<ul>
<li>Tobias has once again done the unmentionable, and completely
overturned my comfortable little world. In this case, he's shown
XPath to be broken, in a way I hadn't anticipated. He's got an
XPath that does some really gnarly things, from an evaluation point
of view, such as having predicates containing functions which
themselves have arguments defined by xpaths containing predicates
and functions. This is going to take some work.</li>
<li>There may be a problem with over-escaping characters in
attribute values.</li>
<li>Sometimes the test suite hangs or segfaults the Ruby
interpreter. If this is something that I can fix, then it is a
bug, and I will fix it.</li>
</ul>
<h3>To Do</h3>
<ul>
<li>Markus Jais would like to know if REXML should indent output
by default, as it does, or whether it wouldn't be better if the
default behavior would be to not indent output.</li>
<li>Make a REXML mailing list</li>
<li>What should the XPath "/" return? What should "item/ancestor::"
return? According to the XPath spec, "/" should return the root
element... however, "/root" should also return the root element
(assuming the root element is "root"). This is stupid. The
XPath spec appears to be ambiguous on this point.</li>
<li>I put up an RFC about Element.elements.each("xpath") { ...
}, saying that I'd like to change it. So far, the response has
been mixed, so maybe I'll leave it. The jury is still out.</li>
<li>Add a default listener that constructs trees based on an
event map. NQMXML does something like this:<TT>
nd = NQXML::Dispatcher.new(file)
nd.handle(:start_element, %w(root level1 level2)) { | e |
# do something with e
}
nd.handle(:text, %w(root level1 level2 level3)) { | e |
# reads text inside <level3> tag
}
nd.start() </TT> I'd want it to look similar; basically, the user
passes a set of tags to the parser which instructs the Stream
Listener to build sub-trees for. When a sub-tree is finished
being built, some event is triggered.</li>
<li>Allow the user to add entity conversions</li>
<li>I'd like to hack the nacent SVG tool and XMLRPC4R to use REXML, for my own purposes.</li>
</ul>
<h3>Requested features</h3>
<ul>
<li>Optionally not process character entities</li>
<li>Process entity declarations in DocType.</li>
<li>Overload Element constructor to allow passing
a hash list of attributes. This will slow down REXML, probably
significantly.</li>
</ul>
<h2>Credits</h2>
<p>I've had help from a number of resources; if I haven't listed you
here, it means that I just haven't gotten around to adding you, or
that I'm a dork and have forgotten. In either case, feel free to
write me and complain. I may ignore you, but at least you
tried. (Actually, I don't conciously ignore anybody except spammers.)</p>
<ul>
<li>
<a href="mailto:erik@solidcode.net">Erik Terpstra</a>
heard my pleas and submitted several logos for REXML. After sagely
avoiding choosing one for several weeks, I finally forced my poor
slave of a wife to pick one (this is what we call "delegation").
She did, with caveats; Erik quickly made the changes, and the
result is what you now see at the top of this page. He also
supplied a <a href="img/rexml_50p.png">smaller version</a>
that you can include with your projects that use REXML, if you'd
like.
</li>
<li>Bug fixes provided by: <a href="mailto:ukai@debian.or.jp">Fumitoshi UKAI</a> (CData
metacharacter quoting bug)</li>
<li>
<a href="mailto:oliver@debian.org">Oliver M . Bolzer</a>
is maintaining a Debian package distribution of REXML. He also has
provided good feedback and bug reports about namespace support.</li>
<li>
<a href="mailto:erne@powernav.com">Ernest Ellingson</a>
contributed the sourcecode for turning UTF16 and UNILE encodings
into UTF8, which allowed REXML to get the 100% OASIS valid tests
rating.</li>
<li>
<a href="mailto:maki@inac.co.jp">TAKAHASHI Masayoshi</a>,
for information on UTF</li>
<li>
<a href="mailto:james@rubyxml.com">James Britt</a> contributed
code that makes using Document.parse_stream easier to use by allowing
it to be passed either a Source, File, or String.
</li>
<li>
<a href="http://www.themindelectric.com/products/xml/xml.html">Electric
XML</a>: This was, after all, the inspiration for REXML.
Originally, I was just going to do a straight port, and although
REXML doesn't in any way, shape or form resemble Electric XML,
still the basic framework and philosophy was inspired by E-XML.
And I still use E-XML in my Java projects.</li>
<li>
<a href="mailto:tobiasreif@pinkjuice.com">Tobias
Reif</a>: Numerous bug reports, and suggestions for
improvement.</li>
<li>
<a href="http://www.io.com/~jimm/downloads/nqxml/index.html">NQXML</a>:
While I may complain about the NQXML API, I wrote a few
applications using it that wouldn't have been written otherwise,
and it was very useful to me. It also encouraged me to write
REXML. Never complain about free software *slap*.</li>
<li>
<a href="mailto:feldt@ce.chalmers.se">Robert
Feldt</a>: Bug reports and suggestions/recommendations about
improving REXML. Testing is one of the most important aspects of
software development.</li>
</ul>
</div>
<div class="footnotes">
<div class="footnote">
<A name="N73">1) </A>It might interest you to know that at
last count, Electric XML had ~3,700 non-comment, non-empty lines of
code. REXML had ~1,550. This illustrates the marvelous efficiency
and power of Ruby.</div>
<div class="footnote">
<A name="N83">2) </A>Be aware, however, that REXML is not DOM nor SAX compliant, and will never be. The DOM and SAX APIs are unwieldy.</div>
<div class="footnote">
<A name="NBC">3) </A>For example, <TT>element.elements[index]</TT>
isn't
really an array operation; index can be an Integer or an XPath,
and this feature is relatively time expensive.</div>
<div class="footnote">
<A name="NCE">4) </A>REXML started out with about 1200, but that
number has been steadily increasing as features are added.
XPath and the helper class StreamListener account for about
320 lines of that code.</div>
</div>
<div id="footer">
<div style="float:left;">
<a href="http://www.germane-software.com/~ser">[ Home ]</a>
</div>
<div style="float:right;">
<a href="mailto:ser@germane-software.com">[ EMail ]</a>
</div>
<a href="http://www.germane-software.com/~ser/Software">[ Software ]</a>
</div>
</body>
</html>
|