File: documentation.xml

package info (click to toggle)
librexml-ruby 1.2.5-1
links: PTS
area: main
in suites: woody
size: 792 kB
ctags: 655
sloc: ruby: 3,778; xml: 1,609; java: 109; makefile: 43
file content (434 lines) | stat: -rw-r--r-- 23,404 bytes
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="http://www.germane-software.com/~ser/Software/documentation.xsl"?>
<!DOCTYPE documentation>
<documentation> 
	<head> 
		<title>REXML</title>
		<banner href="img/rexml.png"/>
		<version>1.2.5</version> 
		<date>*2002-11</date>
		<home>http://www.germane-software.com/~ser/software/rexml</home>
		<base>rexml</base>
		<archive type="unix">Current release</archive> 
		<archive type="dos">Current release</archive>
		<language>ruby</language>
		<author email="ser@germane-software.com" href="http://www.germane-software.com/~ser">Sean Russell</author>
	</head>


	<overview> 
		<purpose lang="en"> 
			<p>REXML stands for &#34;Ruby Electric XML&#34;. Sorry. I&#39;m not
				very creative when it comes to working names for my software, and I
				invariably use the working names as the final product names. The
				&#34;Ruby&#34; comes from the Ruby language, obviously. The
				&#34;Electric XML&#34; comes from the inspiration for this project,
				the Electric XML Java processing library.</p>
			<p>This software is distribute under the <link
					href="LICENSE.txt">Ruby license</link>.</p>
			<p>1.2.5: Bug fixes: doctypes that had spaces between the closing ]
				and &gt; generated errors.  There was a small bug that caused too
				many newlines to be generated in some output.  Eelis van der Weegen
				(what a great name!) pointed out one of the numerous API errors.
				Julian requested that add_attributes take both Hash (original) and
				array of arrays (as produced by StreamListener).  I killed the
				mailing list, accidentally, and fixed it again.  Fixed a bug in
				next_sibling, caused by a combination of mixing overriding
				&lt;=&gt;() and using Array.index().</p>
			<p>1.2.4: Changes since 1.1b: 100% OASIS valid tests passed.
				UTF-8/16 support.  Many bug fixes. to_a() added to Parent and
				Element.elements.  Updated tutorial.  Added variable IOSource
				buffer size, for stream parsing.  delete() now fails silently
				rather than throwing an exception if it can't find the elemnt to
				delete.  Added a patch to support REXMLBuilder. Reorganized file
				layout in distribution; added a repackaging program; added the
				logo.</p>
			<p>1.1b: Changes since 1.1a: Stream parsing added. Bug fixes in entity 
				parsing.  New XPath implementation, fixing many bugs and making
				feature complete.  Completed whitespace handling, adding much
				functionality and fixing several bugs.  Added convenience methods
				for inserting elememnts.  Improved error reporting.  Fixed
				attribute content to correctly handle quotes and apostrophes. Added
				mechanisms for handling raw text.  Cleaned up utility programs
				(profile.rb, comparison.rb, etc.).  Improved speed a little.
				Brought REXML up to 98.9% OASIS valid source compliance.</p>
	</purpose> 


	<general> 
		<p>Why REXML? There, at the time of this writing, already two XML
			parsers for Ruby. The first is a Ruby binding to a native XML
			parser. This is a fast parser, using proven technology. However,
			it isn&#39;t very portable. The second is a native Ruby
			implementation, and as useful as it is, it has (IMO) a difficult
			API.</p>
		<p>I have this problem: I dislike obscifucated APIs. There are
			several XML parser APIs for Java. Most of them follow DOM or SAX,
			and are very similar in philosophy with an increasing number of
			Java APIs. Namely, they look like they were designed by theorists
			who never had to use their own APIs. The extant XML APIs, in
			general, suck. They take a markup language which was specifically
			designed to be very simple, elegant, and powerful, and wrap an
			obnoxious, bloated, and large API around it. I was always having
			to refer to the API documentation to do even the most basic XML
			tree manipulations; nothing was intuitive, and almost every
			operation was complex. </p>
		<p>Then along came Electric XML. </p> 
		<p>Ah, bliss. Look at the Electric XML API. First, the library is
			small; less that 500K. Next, the API is intuitive. You want to
			parse a document? doc = new Document( some_file ). Create and add
			a new element? element = parent.addElement( tag_name ). Write out
			a subtree?? element.write( writer ). Now how about DOM? To parse
			some file: parser = new DOMParser(); parser.parse( new
			InputSource( new FileInputStream( some_file ) ) ). Create a new
			element? First you have to know the owning document of the
			to-be-created node (can anyone say &#34;global variables, or
			obtuse, multi-argument methods&#34;?) and call element =
			doc.createElement( tag_name ). Then you get to call
			parent.appendChild( element ). &#34;appendChild&#34;? Where did
			they get that from? How many different methods do we have in Java
			in how many different classes for adding children to parents?
			addElement()? add()? put()? appendChild()? Heaven forbid that you
			want to create an Element elsewhere in the code without having
			access to the owning document. I&#39;m not even going to go into
			what travesty of code you have to go through to write out an XML
			sub-tree in DOM. </p>
		<p>So, I use Electric XML extensively. It is small, fast, and
			intuitive. IE, the API doesn&#39;t add a bunch of work to the task
			of writing software. When I started to write more software in
			Ruby, I needed an XML parser. I wasn&#39;t keen on the native
			library binding, &#34;XMLParser&#34;, because I try to avoid
			complex library dependancies in my software, when I can. For a
			long time, I used NQXML, because it was the only other parser out
			there. However, the NQXML API can be even more painful than the
			Java DOM API. Almost all element operations requires accessing
			some indirect node access... you had to do something like
			element.node.attr[&#39;key&#39;], and it is never obvious to me
			when you access the element directly, or the node.. or, really,
			why they&#39;re two different objects, anyway. This is even more
			unfortunate since Ruby is so elegent and intuitive, and bad APIs
			really stand out.  I'm not, by the way, trying to insult NQXML; I
			just don't like the API.</p>
		<p>I wrote the people at TheMind (Electric XML... get it?) and asked
			them if I could do a translation to Ruby. They said yes. After a
			few weeks of hacking on it for a couple of hours each week, and
			after having gone down a few blind alleys in the translation, I
			had a working beta. IE, it parsed, but hadn&#39;t gone through a
			lot of strenuous testing. Along the way, I had made a few changes
			to the API, and a lot of changes to the code. First off, Ruby does
			iterators differently than Java. Java uses a lot of helper
			classes. Helper classes are exactly the kinds of things that
			theorists come up with... they look good on paper, but using them
			is like chewing glass. You find that you spend 50% of your time
			writing helper classes just to support the other 50% of the code
			that actually does the job you were trying to solve in the first
			place. In this case, the Java helper classes are either
			Enumerations or Iterators.  Ruby, on the other hand, uses blocks,
			which is much more elegant. Rather than:</p>
<example>for (Enumeration e=parent.getChildren(); e.hasMoreElements(); ) {
   Element child = (Element)e.nextElement();
   // Do something with child
}</example><p>you get:</p>
<example>parent.each_child{ |child| # Do something with child }</example>
		<p>Can&#39;t you feel the peace and contentment in this block of
		code? Ruby is the language Buddha would have programmed in.</p>
		<p>Anyhoo, I chose to use blocks in REXML directly, since this is
		more common to Ruby code than <code>for x in y ... end</code>, which
		is as orthoganal to the original Java as possible.</p>
		<p>Also, I changed the naming conventions to more Ruby-esque method
		names. For example, the Java method <code>getAttributeValue()</code>
		becomes in Ruby <code>get_attribute_value()</code>. This is a
		toss-up. I actually like the Java naming convention more, but the
		latter is more common in Ruby code, and I&#39;m trying to make things
		easy for Ruby programmers, not Java programmers.</p>
		<p>The biggest change was in the code. The Java version of Electric
		XML did a lot of efficient String-array parsing, character by
		character. Ruby, however, has ubiquitous, efficient, and powerful
		regular expression support. All regex functions are done in native
		code, so it is very fast, and the power of Ruby regex rivals that of
		Perl. Therefore, a direct conversion of the Java code to Ruby would
		have been more difficult, and much slower, than using Ruby regexps. I
		therefore used regexs. In doing so, I cut the number of lines of
		sourcecode by half<footnote>It might interest you to know that at
		last count, Electric XML had ~3,700 non-comment, non-empty lines of
		code.  REXML had ~1,550. This illustrates the marvelous efficiency
		and power of Ruby.</footnote>.</p>
		<p>Finally, by this point the API looks almost nothing like the
		original Electric XML API, and practically none of the code is even
		vaguely similar.  However, even though the actual code is completely
		different, I did borrow the same process of processing XML as
		Electric, and am deeply indebted to the Electric XML code for
		inspiration.</p>
	  	</general>

		<features lang="en">
			<item>Simple API</item>
			<item>Both stream (SAX) and tree (DOM) parsing<footnote>Be aware, however, that REXML is not DOM nor SAX compliant, and will never be.  The DOM and SAX APIs are unwieldy.</footnote></item>
			<item>Small</item>
			<item>Native Ruby</item>
			<item>Documentation(!)</item>
		</features>
	</overview>

	<operation lang="en">
		<subsection title="Installation">
			<p>Run &#39;ruby install.rb&#39;.  By the way, you really should
			look at these sorts of files before you run them as root.  They
			could contain anything, and since (in Ruby, at least) they tend to
			be mercifully short, it doesn't hurt to glance over them.</p>
		</subsection>

		<subsection title="General Usage">
			<p>Please see <link href="tutorial.html">the Tutorial</link></p>
			<p>The API documentation is <link
			href="api/rexml/index.html">here</link>.  Some examples using
			REXML are included in the distribution archive, and the Tutorial
			provides examples with commentary.</p>
		</subsection>
	</operation>

	<status>
		<subsection title="Speed and Completeness">
			<p>Unfortunately, NQXML is the only package REXML can be compared
				against; XMLParser uses Jade, which is a native library, and
				really is a different beast altogether.  So in comparing NQXML and
				REXML you can look at three things: speed, size, and API.</p>
			<p><link href="benchmarks/index.html">Benchmarks</link></p>
			<p>REXML is faster than NQXML in some things, and 
				slower than NQXML in a couple of things.  You can see this for
				yourself by running the supplied benchmarks, although it may not
				be clear what operations are slower from these.  Most of the
				places where REXML are slower are because of the convenience
				methods<footnote>For example, <code>element.elements[index]</code> 
					isn't
				really an array operation; index can be an Integer or an XPath,
				and this feature is relatively time expensive.</footnote>.  On the 
				positive side,
				most of the convenience methods can be bypassed if you know what
				you are doing.  Check the <link href="benchmarks/index.html">
				benchmark comparison page</link> for a <em>general</em>
				comparison.  You can look at the benchmark code yourself to decide
				how much salt to take with them.</p>
			<!-- ruby -nle 'print unless /^\s*(#.*|)$/' rexml/*.rb | wc -l -->
			<p>The sizes of the distributions are very close.  NQXML has about
				1400 non-blank, non-comment lines of code; REXML
				1823<footnote>REXML started out with about 1200, but that
				number has been steadily increasing as features are added.
				XPath and the helper class StreamListener account for about
				320 lines of that code.</footnote></p>
			<p>The last thing is the API, and this is where I think REXML wins,
				hands down.  The core API is clean and intuitive, and things
				work the way you would expect them to.  Convenience methods
				abound, and you can code for either convenience or speed.
				REXML code is terse, and readable, like Ruby code should be.
				The best way to decide which you like more is to write a couple
				of small applications in each, then use the one you're more
				comfortable with.</p>
			<p>It should be noted that NQXML does not support XPath searches.</p>
		</subsection>


		<subsection title="XPath">
			<p>Here is the status of the XPath implementation.</p>
			<example title="Implemented"><![CDATA[/                    root
.                    self
..                   parent
*                    all element children
//                   all elements in document
//child              all "child" elements in document
parent//child        all "child" descendants of child element "parent"
parent/child         all "child" elements of "parent"
[...]                all predicates (attribute, index, text)
[...][...]           compound predicates
element              child element "element"
function()           (partially)
axe::                (partially)]]></example>
			<p>Some of this API (the API dealing with function() handling, in particular) is subject to change.</p>
		</subsection>


		<subsection title="Namespaces">
			<p>Namespace support is now fairly stable.  One thing to be aware
			of is that REXML is not (yet) a validating parser.  This means
			that some invalid namespace declarations are not caught.</p>
		</subsection>


		<bugs lang="en">
			<item>Tobias has once again done the unmentionable, and completely
				overturned my comfortable little world.  In this case, he's shown
				XPath to be broken, in a way I hadn't anticipated.  He's got an
				XPath that does some really gnarly things, from an evaluation point
				of view, such as having predicates containing functions which
				themselves have arguments defined by xpaths containing predicates
				and functions.  This is going to take some work.</item>
			<item>There may be a problem with over-escaping characters in
				attribute values.</item>
			<item>Sometimes the test suite hangs or segfaults the Ruby
				interpreter.  If this is something that I can fix, then it is a
				bug, and I will fix it.</item>
			<!--
			<item status="fixed">Is "." a valid element name character?  If it is, 
				there is a bug in the element name regexp. (Tobias Reif)</item>
			<item status="fixed">There seems to be a bug in the line reporting 
				code.</item>
			<item status="fixed">Have trouble dealing with Attribute values that 
				contain apostrophes. (<link href="mailto:murphybryanp@yahoo.com">Bryan
				Murphy</link>)</item>
			<item status="fixed">Michael Neumann pointed out that in some cases
				the close tags were not expanded.</item>
			<item status="fixed">Entities such as &amp;#233; are not handled
				properly.  (Thanks to Tobias for noticing this one.)</item>
			<item status="fixed">Namespaces are not fully tested, and if they
			work at all, they'll be buggy.</item>
			<item status="fixed">Only the most primative DocType declarations
			are tested; if you declare entities in your doctypes, your mileage
			may vary.</item>
			<item status="fixed">I'm pretty sure that the Node .*_sibling
			methods don't work in all cases, because I know that some classes
			that extend node aren't maintaining the node lists.</item>
			<item status="fixed">I don't think the XPath "..." is working
			properly ('cause I don't know what it <em>should</em> do), and "*"
			might be incorrectly implemented.</item>
			-->
		</bugs>

		<todo lang="en">
			<!-- http://www.oasis-open.org/committees/xml-conformance/ -->
			<item>Markus Jais would like to know if REXML should indent output
				by default, as it does, or whether it wouldn't be better if the
				default behavior would be to not indent output.</item>
			<item>Make a REXML mailing list</item>
			<item>What should the XPath "/" return?  What should "item/ancestor::" 
				return?  According to the XPath spec, "/" should return the root
				element... however, "/root" should also return the root element
				(assuming the root element is "root").  This is stupid.  The
				XPath spec appears to be ambiguous on this point.</item>
			<item>I put up an RFC about Element.elements.each("xpath") { ...
				}, saying that I'd like to change it.  So far, the response has
				been mixed, so maybe I'll leave it.  The jury is still out.</item>
			<item>Add a default listener that constructs trees based on an
				event map.  NQMXML does something like this:<code><![CDATA[
nd = NQXML::Dispatcher.new(file)
nd.handle(:start_element, %w(root level1 level2)) { | e |
	# do something with e
}
nd.handle(:text, %w(root level1 level2 level3)) { | e |
	# reads text inside <level3> tag
}
nd.start() ]]></code>  I'd want it to look similar; basically, the user
				passes a set of tags to the parser which instructs the Stream
				Listener to build sub-trees for.  When a sub-tree is finished
				being built, some event is triggered.</item>
			<item>Allow the user to add entity conversions</item>
			<item>I'd like to hack the nacent SVG tool and XMLRPC4R to use REXML, for my own purposes.</item>
			<!--
			<item status='fixed'>Should insert_after insert an element anywhere in 
				the tree, or just in the children of the current element?</item>
			<item status='fixed'>Add to_a to Parent and Element.Elements.  
				(Requested by 
				<link href="mailto:jesusluv@tampabay.rr.com">Jonothon Ortiz</link>)
			</item>
			<item status='fixed'>It looks like XPath is going to require yet 
				<em>another</em>
				rewrite to take it to the next level; either that, or I'm going to
				have to do some ugly character parsing.  Any way it happens, it
				isn't going to be pleasant.  The good thing about a rewrite is that
				I might be able to get the speed of XPath up significantly, and
				simplify the XPath code at the same time.</item>
			<item status="fixed">UTF support.  This probably won't happen until the Ruby core
				classes themselves support UTF, or until I find an extension that
				makes supporting UTF with the Ruby core classes easy.</item>
			<item status="fixed">Complete XPath</item>
			<item status="fixed">Complete functions</item>
			<item status="fixed">Logo.  I make terrible logos. (Erik Terpstra has
				donated several.  Thanks!)</item>
			<item status="fixed">Add an :all feature to :respect_whitespace</item>
			<item status="fixed">Make sure that whitespace is respected during 
				programmatic document creation. (add a unit test)</item>
			<item status="fixed">There is no way for the user to specify when 
				text is RAW.</item>
			<item status="fixed">Improve whitespace handling, to be more flexible.  
				This will
				require allowing the user to specify which elements to ignore
				whitespace in.  How should this look?  Somehow, the user has
				too be able to tell the parser which tags to process raw.  I'm
				thinking of something like Document.new( source, *tags ).</item>
			<item status="fixed">Better error reporting (such as at which line in 
				the parsed
				document the error occurs).</item>
			<item status="fixed">Inserting elements should be easier.  I'm partial to
				<code>b_element.next_sibling = c_element</code>, but another
				good suggestion was <code>parent.insert_after("xpath",
					element)</code>.  <note>I implemented both.</note></item>
			<item status="fixed">Streamed document parsing</item>
			<item status="fixed">Improve the benchmark</item>
			<item status="fixed">Better test suite</item>
			<item status="fixed">Finish Namespace support and testing</item>
			<item status="fixed">Finish DocType support and testing</item>
			<item status="fixed">Comparison benchmarks from Electric XML</item>
			-->
			<item status="request">Optionally not process character entities</item>
			<item status="request">Process entity declarations in DocType.</item>
			<item status="request">Overload Element constructor to allow passing
				a hash list of attributes.  This will slow down REXML, probably
				significantly.</item>
		</todo>
	</status>


	<credits>
		<p>I've had help from a number of resources; if I haven't listed you
		here, it means that I just haven't gotten around to adding you, or
		that I'm a dork and have forgotten.  In either case, feel free to
		write me and complain.  I may ignore you, but at least you
		tried. (Actually, I don't conciously ignore anybody except spammers.)</p>
		<list>
			<item><link href="mailto:erik@solidcode.net">Erik Terpstra</link>
				heard my pleas and submitted several logos for REXML.  After sagely
				avoiding choosing one for several weeks, I finally forced my poor
				slave of a wife to pick one (this is what we call "delegation").
				She did, with caveats; Erik quickly made the changes, and the
				result is what you now see at the top of this page.  He also
				supplied a <link href="img/rexml_50p.png">smaller version</link>
				that you can include with your projects that use REXML, if you'd
				like.
			</item>
			<item>Bug fixes provided by: <link
					href="mailto:ukai@debian.or.jp">Fumitoshi UKAI</link> (CData
				metacharacter quoting bug)</item>
			<item><link href="mailto:oliver@debian.org">Oliver M . Bolzer</link>
				is maintaining a Debian package distribution of REXML.  He also has
				provided good feedback and bug reports about namespace support.</item>
			<item><link href="mailto:erne@powernav.com">Ernest Ellingson</link>
				contributed the sourcecode for turning UTF16 and UNILE encodings
				into UTF8, which allowed REXML to get the 100% OASIS valid tests
				rating.</item>
			<item><link href="mailto:maki@inac.co.jp">TAKAHASHI Masayoshi</link>, 
				for information on UTF</item>
			<item><link href="mailto:james@rubyxml.com">James Britt</link> contributed
				code that makes using Document.parse_stream easier to use by allowing 
				it to be passed either a Source, File, or String.
			</item>
			<item><link
			href="http://www.themindelectric.com/products/xml/xml.html">Electric
			XML</link>: This was, after all, the inspiration for REXML.
			Originally, I was just going to do a straight port, and although
			REXML doesn't in any way, shape or form resemble Electric XML,
			still the basic framework and philosophy was inspired by E-XML.
			And I still use E-XML in my Java projects.</item>
			<item><link href="mailto:tobiasreif@pinkjuice.com">Tobias
					Reif</link>: Numerous bug reports, and suggestions for
				improvement.</item>
			<item><link href="http://www.io.com/~jimm/downloads/nqxml/index.html">NQXML</link>:
			While I may complain about the NQXML API, I wrote a few
			applications using it that wouldn't have been written otherwise,
			and it was very useful to me.  It also encouraged me to write
			REXML.  Never complain about free software *slap*.</item>
			<item><link href="mailto:feldt@ce.chalmers.se">Robert
			Feldt</link>: Bug reports and suggestions/recommendations about
			improving REXML.  Testing is one of the most important aspects of
			software development.</item>
		</list>
	</credits>
</documentation>