1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434
|
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="http://www.germane-software.com/~ser/Software/documentation.xsl"?>
<!DOCTYPE documentation>
<documentation>
<head>
<title>REXML</title>
<banner href="img/rexml.png"/>
<version>1.2.5</version>
<date>*2002-11</date>
<home>http://www.germane-software.com/~ser/software/rexml</home>
<base>rexml</base>
<archive type="unix">Current release</archive>
<archive type="dos">Current release</archive>
<language>ruby</language>
<author email="ser@germane-software.com" href="http://www.germane-software.com/~ser">Sean Russell</author>
</head>
<overview>
<purpose lang="en">
<p>REXML stands for "Ruby Electric XML". Sorry. I'm not
very creative when it comes to working names for my software, and I
invariably use the working names as the final product names. The
"Ruby" comes from the Ruby language, obviously. The
"Electric XML" comes from the inspiration for this project,
the Electric XML Java processing library.</p>
<p>This software is distribute under the <link
href="LICENSE.txt">Ruby license</link>.</p>
<p>1.2.5: Bug fixes: doctypes that had spaces between the closing ]
and > generated errors. There was a small bug that caused too
many newlines to be generated in some output. Eelis van der Weegen
(what a great name!) pointed out one of the numerous API errors.
Julian requested that add_attributes take both Hash (original) and
array of arrays (as produced by StreamListener). I killed the
mailing list, accidentally, and fixed it again. Fixed a bug in
next_sibling, caused by a combination of mixing overriding
<=>() and using Array.index().</p>
<p>1.2.4: Changes since 1.1b: 100% OASIS valid tests passed.
UTF-8/16 support. Many bug fixes. to_a() added to Parent and
Element.elements. Updated tutorial. Added variable IOSource
buffer size, for stream parsing. delete() now fails silently
rather than throwing an exception if it can't find the elemnt to
delete. Added a patch to support REXMLBuilder. Reorganized file
layout in distribution; added a repackaging program; added the
logo.</p>
<p>1.1b: Changes since 1.1a: Stream parsing added. Bug fixes in entity
parsing. New XPath implementation, fixing many bugs and making
feature complete. Completed whitespace handling, adding much
functionality and fixing several bugs. Added convenience methods
for inserting elememnts. Improved error reporting. Fixed
attribute content to correctly handle quotes and apostrophes. Added
mechanisms for handling raw text. Cleaned up utility programs
(profile.rb, comparison.rb, etc.). Improved speed a little.
Brought REXML up to 98.9% OASIS valid source compliance.</p>
</purpose>
<general>
<p>Why REXML? There, at the time of this writing, already two XML
parsers for Ruby. The first is a Ruby binding to a native XML
parser. This is a fast parser, using proven technology. However,
it isn't very portable. The second is a native Ruby
implementation, and as useful as it is, it has (IMO) a difficult
API.</p>
<p>I have this problem: I dislike obscifucated APIs. There are
several XML parser APIs for Java. Most of them follow DOM or SAX,
and are very similar in philosophy with an increasing number of
Java APIs. Namely, they look like they were designed by theorists
who never had to use their own APIs. The extant XML APIs, in
general, suck. They take a markup language which was specifically
designed to be very simple, elegant, and powerful, and wrap an
obnoxious, bloated, and large API around it. I was always having
to refer to the API documentation to do even the most basic XML
tree manipulations; nothing was intuitive, and almost every
operation was complex. </p>
<p>Then along came Electric XML. </p>
<p>Ah, bliss. Look at the Electric XML API. First, the library is
small; less that 500K. Next, the API is intuitive. You want to
parse a document? doc = new Document( some_file ). Create and add
a new element? element = parent.addElement( tag_name ). Write out
a subtree?? element.write( writer ). Now how about DOM? To parse
some file: parser = new DOMParser(); parser.parse( new
InputSource( new FileInputStream( some_file ) ) ). Create a new
element? First you have to know the owning document of the
to-be-created node (can anyone say "global variables, or
obtuse, multi-argument methods"?) and call element =
doc.createElement( tag_name ). Then you get to call
parent.appendChild( element ). "appendChild"? Where did
they get that from? How many different methods do we have in Java
in how many different classes for adding children to parents?
addElement()? add()? put()? appendChild()? Heaven forbid that you
want to create an Element elsewhere in the code without having
access to the owning document. I'm not even going to go into
what travesty of code you have to go through to write out an XML
sub-tree in DOM. </p>
<p>So, I use Electric XML extensively. It is small, fast, and
intuitive. IE, the API doesn't add a bunch of work to the task
of writing software. When I started to write more software in
Ruby, I needed an XML parser. I wasn't keen on the native
library binding, "XMLParser", because I try to avoid
complex library dependancies in my software, when I can. For a
long time, I used NQXML, because it was the only other parser out
there. However, the NQXML API can be even more painful than the
Java DOM API. Almost all element operations requires accessing
some indirect node access... you had to do something like
element.node.attr['key'], and it is never obvious to me
when you access the element directly, or the node.. or, really,
why they're two different objects, anyway. This is even more
unfortunate since Ruby is so elegent and intuitive, and bad APIs
really stand out. I'm not, by the way, trying to insult NQXML; I
just don't like the API.</p>
<p>I wrote the people at TheMind (Electric XML... get it?) and asked
them if I could do a translation to Ruby. They said yes. After a
few weeks of hacking on it for a couple of hours each week, and
after having gone down a few blind alleys in the translation, I
had a working beta. IE, it parsed, but hadn't gone through a
lot of strenuous testing. Along the way, I had made a few changes
to the API, and a lot of changes to the code. First off, Ruby does
iterators differently than Java. Java uses a lot of helper
classes. Helper classes are exactly the kinds of things that
theorists come up with... they look good on paper, but using them
is like chewing glass. You find that you spend 50% of your time
writing helper classes just to support the other 50% of the code
that actually does the job you were trying to solve in the first
place. In this case, the Java helper classes are either
Enumerations or Iterators. Ruby, on the other hand, uses blocks,
which is much more elegant. Rather than:</p>
<example>for (Enumeration e=parent.getChildren(); e.hasMoreElements(); ) {
Element child = (Element)e.nextElement();
// Do something with child
}</example><p>you get:</p>
<example>parent.each_child{ |child| # Do something with child }</example>
<p>Can't you feel the peace and contentment in this block of
code? Ruby is the language Buddha would have programmed in.</p>
<p>Anyhoo, I chose to use blocks in REXML directly, since this is
more common to Ruby code than <code>for x in y ... end</code>, which
is as orthoganal to the original Java as possible.</p>
<p>Also, I changed the naming conventions to more Ruby-esque method
names. For example, the Java method <code>getAttributeValue()</code>
becomes in Ruby <code>get_attribute_value()</code>. This is a
toss-up. I actually like the Java naming convention more, but the
latter is more common in Ruby code, and I'm trying to make things
easy for Ruby programmers, not Java programmers.</p>
<p>The biggest change was in the code. The Java version of Electric
XML did a lot of efficient String-array parsing, character by
character. Ruby, however, has ubiquitous, efficient, and powerful
regular expression support. All regex functions are done in native
code, so it is very fast, and the power of Ruby regex rivals that of
Perl. Therefore, a direct conversion of the Java code to Ruby would
have been more difficult, and much slower, than using Ruby regexps. I
therefore used regexs. In doing so, I cut the number of lines of
sourcecode by half<footnote>It might interest you to know that at
last count, Electric XML had ~3,700 non-comment, non-empty lines of
code. REXML had ~1,550. This illustrates the marvelous efficiency
and power of Ruby.</footnote>.</p>
<p>Finally, by this point the API looks almost nothing like the
original Electric XML API, and practically none of the code is even
vaguely similar. However, even though the actual code is completely
different, I did borrow the same process of processing XML as
Electric, and am deeply indebted to the Electric XML code for
inspiration.</p>
</general>
<features lang="en">
<item>Simple API</item>
<item>Both stream (SAX) and tree (DOM) parsing<footnote>Be aware, however, that REXML is not DOM nor SAX compliant, and will never be. The DOM and SAX APIs are unwieldy.</footnote></item>
<item>Small</item>
<item>Native Ruby</item>
<item>Documentation(!)</item>
</features>
</overview>
<operation lang="en">
<subsection title="Installation">
<p>Run 'ruby install.rb'. By the way, you really should
look at these sorts of files before you run them as root. They
could contain anything, and since (in Ruby, at least) they tend to
be mercifully short, it doesn't hurt to glance over them.</p>
</subsection>
<subsection title="General Usage">
<p>Please see <link href="tutorial.html">the Tutorial</link></p>
<p>The API documentation is <link
href="api/rexml/index.html">here</link>. Some examples using
REXML are included in the distribution archive, and the Tutorial
provides examples with commentary.</p>
</subsection>
</operation>
<status>
<subsection title="Speed and Completeness">
<p>Unfortunately, NQXML is the only package REXML can be compared
against; XMLParser uses Jade, which is a native library, and
really is a different beast altogether. So in comparing NQXML and
REXML you can look at three things: speed, size, and API.</p>
<p><link href="benchmarks/index.html">Benchmarks</link></p>
<p>REXML is faster than NQXML in some things, and
slower than NQXML in a couple of things. You can see this for
yourself by running the supplied benchmarks, although it may not
be clear what operations are slower from these. Most of the
places where REXML are slower are because of the convenience
methods<footnote>For example, <code>element.elements[index]</code>
isn't
really an array operation; index can be an Integer or an XPath,
and this feature is relatively time expensive.</footnote>. On the
positive side,
most of the convenience methods can be bypassed if you know what
you are doing. Check the <link href="benchmarks/index.html">
benchmark comparison page</link> for a <em>general</em>
comparison. You can look at the benchmark code yourself to decide
how much salt to take with them.</p>
<!-- ruby -nle 'print unless /^\s*(#.*|)$/' rexml/*.rb | wc -l -->
<p>The sizes of the distributions are very close. NQXML has about
1400 non-blank, non-comment lines of code; REXML
1823<footnote>REXML started out with about 1200, but that
number has been steadily increasing as features are added.
XPath and the helper class StreamListener account for about
320 lines of that code.</footnote></p>
<p>The last thing is the API, and this is where I think REXML wins,
hands down. The core API is clean and intuitive, and things
work the way you would expect them to. Convenience methods
abound, and you can code for either convenience or speed.
REXML code is terse, and readable, like Ruby code should be.
The best way to decide which you like more is to write a couple
of small applications in each, then use the one you're more
comfortable with.</p>
<p>It should be noted that NQXML does not support XPath searches.</p>
</subsection>
<subsection title="XPath">
<p>Here is the status of the XPath implementation.</p>
<example title="Implemented"><![CDATA[/ root
. self
.. parent
* all element children
// all elements in document
//child all "child" elements in document
parent//child all "child" descendants of child element "parent"
parent/child all "child" elements of "parent"
[...] all predicates (attribute, index, text)
[...][...] compound predicates
element child element "element"
function() (partially)
axe:: (partially)]]></example>
<p>Some of this API (the API dealing with function() handling, in particular) is subject to change.</p>
</subsection>
<subsection title="Namespaces">
<p>Namespace support is now fairly stable. One thing to be aware
of is that REXML is not (yet) a validating parser. This means
that some invalid namespace declarations are not caught.</p>
</subsection>
<bugs lang="en">
<item>Tobias has once again done the unmentionable, and completely
overturned my comfortable little world. In this case, he's shown
XPath to be broken, in a way I hadn't anticipated. He's got an
XPath that does some really gnarly things, from an evaluation point
of view, such as having predicates containing functions which
themselves have arguments defined by xpaths containing predicates
and functions. This is going to take some work.</item>
<item>There may be a problem with over-escaping characters in
attribute values.</item>
<item>Sometimes the test suite hangs or segfaults the Ruby
interpreter. If this is something that I can fix, then it is a
bug, and I will fix it.</item>
<!--
<item status="fixed">Is "." a valid element name character? If it is,
there is a bug in the element name regexp. (Tobias Reif)</item>
<item status="fixed">There seems to be a bug in the line reporting
code.</item>
<item status="fixed">Have trouble dealing with Attribute values that
contain apostrophes. (<link href="mailto:murphybryanp@yahoo.com">Bryan
Murphy</link>)</item>
<item status="fixed">Michael Neumann pointed out that in some cases
the close tags were not expanded.</item>
<item status="fixed">Entities such as &#233; are not handled
properly. (Thanks to Tobias for noticing this one.)</item>
<item status="fixed">Namespaces are not fully tested, and if they
work at all, they'll be buggy.</item>
<item status="fixed">Only the most primative DocType declarations
are tested; if you declare entities in your doctypes, your mileage
may vary.</item>
<item status="fixed">I'm pretty sure that the Node .*_sibling
methods don't work in all cases, because I know that some classes
that extend node aren't maintaining the node lists.</item>
<item status="fixed">I don't think the XPath "..." is working
properly ('cause I don't know what it <em>should</em> do), and "*"
might be incorrectly implemented.</item>
-->
</bugs>
<todo lang="en">
<!-- http://www.oasis-open.org/committees/xml-conformance/ -->
<item>Markus Jais would like to know if REXML should indent output
by default, as it does, or whether it wouldn't be better if the
default behavior would be to not indent output.</item>
<item>Make a REXML mailing list</item>
<item>What should the XPath "/" return? What should "item/ancestor::"
return? According to the XPath spec, "/" should return the root
element... however, "/root" should also return the root element
(assuming the root element is "root"). This is stupid. The
XPath spec appears to be ambiguous on this point.</item>
<item>I put up an RFC about Element.elements.each("xpath") { ...
}, saying that I'd like to change it. So far, the response has
been mixed, so maybe I'll leave it. The jury is still out.</item>
<item>Add a default listener that constructs trees based on an
event map. NQMXML does something like this:<code><![CDATA[
nd = NQXML::Dispatcher.new(file)
nd.handle(:start_element, %w(root level1 level2)) { | e |
# do something with e
}
nd.handle(:text, %w(root level1 level2 level3)) { | e |
# reads text inside <level3> tag
}
nd.start() ]]></code> I'd want it to look similar; basically, the user
passes a set of tags to the parser which instructs the Stream
Listener to build sub-trees for. When a sub-tree is finished
being built, some event is triggered.</item>
<item>Allow the user to add entity conversions</item>
<item>I'd like to hack the nacent SVG tool and XMLRPC4R to use REXML, for my own purposes.</item>
<!--
<item status='fixed'>Should insert_after insert an element anywhere in
the tree, or just in the children of the current element?</item>
<item status='fixed'>Add to_a to Parent and Element.Elements.
(Requested by
<link href="mailto:jesusluv@tampabay.rr.com">Jonothon Ortiz</link>)
</item>
<item status='fixed'>It looks like XPath is going to require yet
<em>another</em>
rewrite to take it to the next level; either that, or I'm going to
have to do some ugly character parsing. Any way it happens, it
isn't going to be pleasant. The good thing about a rewrite is that
I might be able to get the speed of XPath up significantly, and
simplify the XPath code at the same time.</item>
<item status="fixed">UTF support. This probably won't happen until the Ruby core
classes themselves support UTF, or until I find an extension that
makes supporting UTF with the Ruby core classes easy.</item>
<item status="fixed">Complete XPath</item>
<item status="fixed">Complete functions</item>
<item status="fixed">Logo. I make terrible logos. (Erik Terpstra has
donated several. Thanks!)</item>
<item status="fixed">Add an :all feature to :respect_whitespace</item>
<item status="fixed">Make sure that whitespace is respected during
programmatic document creation. (add a unit test)</item>
<item status="fixed">There is no way for the user to specify when
text is RAW.</item>
<item status="fixed">Improve whitespace handling, to be more flexible.
This will
require allowing the user to specify which elements to ignore
whitespace in. How should this look? Somehow, the user has
too be able to tell the parser which tags to process raw. I'm
thinking of something like Document.new( source, *tags ).</item>
<item status="fixed">Better error reporting (such as at which line in
the parsed
document the error occurs).</item>
<item status="fixed">Inserting elements should be easier. I'm partial to
<code>b_element.next_sibling = c_element</code>, but another
good suggestion was <code>parent.insert_after("xpath",
element)</code>. <note>I implemented both.</note></item>
<item status="fixed">Streamed document parsing</item>
<item status="fixed">Improve the benchmark</item>
<item status="fixed">Better test suite</item>
<item status="fixed">Finish Namespace support and testing</item>
<item status="fixed">Finish DocType support and testing</item>
<item status="fixed">Comparison benchmarks from Electric XML</item>
-->
<item status="request">Optionally not process character entities</item>
<item status="request">Process entity declarations in DocType.</item>
<item status="request">Overload Element constructor to allow passing
a hash list of attributes. This will slow down REXML, probably
significantly.</item>
</todo>
</status>
<credits>
<p>I've had help from a number of resources; if I haven't listed you
here, it means that I just haven't gotten around to adding you, or
that I'm a dork and have forgotten. In either case, feel free to
write me and complain. I may ignore you, but at least you
tried. (Actually, I don't conciously ignore anybody except spammers.)</p>
<list>
<item><link href="mailto:erik@solidcode.net">Erik Terpstra</link>
heard my pleas and submitted several logos for REXML. After sagely
avoiding choosing one for several weeks, I finally forced my poor
slave of a wife to pick one (this is what we call "delegation").
She did, with caveats; Erik quickly made the changes, and the
result is what you now see at the top of this page. He also
supplied a <link href="img/rexml_50p.png">smaller version</link>
that you can include with your projects that use REXML, if you'd
like.
</item>
<item>Bug fixes provided by: <link
href="mailto:ukai@debian.or.jp">Fumitoshi UKAI</link> (CData
metacharacter quoting bug)</item>
<item><link href="mailto:oliver@debian.org">Oliver M . Bolzer</link>
is maintaining a Debian package distribution of REXML. He also has
provided good feedback and bug reports about namespace support.</item>
<item><link href="mailto:erne@powernav.com">Ernest Ellingson</link>
contributed the sourcecode for turning UTF16 and UNILE encodings
into UTF8, which allowed REXML to get the 100% OASIS valid tests
rating.</item>
<item><link href="mailto:maki@inac.co.jp">TAKAHASHI Masayoshi</link>,
for information on UTF</item>
<item><link href="mailto:james@rubyxml.com">James Britt</link> contributed
code that makes using Document.parse_stream easier to use by allowing
it to be passed either a Source, File, or String.
</item>
<item><link
href="http://www.themindelectric.com/products/xml/xml.html">Electric
XML</link>: This was, after all, the inspiration for REXML.
Originally, I was just going to do a straight port, and although
REXML doesn't in any way, shape or form resemble Electric XML,
still the basic framework and philosophy was inspired by E-XML.
And I still use E-XML in my Java projects.</item>
<item><link href="mailto:tobiasreif@pinkjuice.com">Tobias
Reif</link>: Numerous bug reports, and suggestions for
improvement.</item>
<item><link href="http://www.io.com/~jimm/downloads/nqxml/index.html">NQXML</link>:
While I may complain about the NQXML API, I wrote a few
applications using it that wouldn't have been written otherwise,
and it was very useful to me. It also encouraged me to write
REXML. Never complain about free software *slap*.</item>
<item><link href="mailto:feldt@ce.chalmers.se">Robert
Feldt</link>: Bug reports and suggestions/recommendations about
improving REXML. Testing is one of the most important aspects of
software development.</item>
</list>
</credits>
</documentation>
|