SourceForge Logo
Visit our Sponsor

Project Summary

FleXML - XML Processor Generator

See also the manual page and a short white paper. Or peek into the master source archive.

FleXML reads a DTD (Document Type Definition) describing the format of XML (Extensible Markup Language) documents; it may be specified as a URI to the DTD on the web. From this FleXML produces a validating XML processor with an interface to support XML applications. Proper applications can be generated optionally from special action files, either for linking or textual combination with the processor.

FleXML is specifically developed for XML applications where a fixed data format suffices in the sense that a single DTD is used without individual extensions for a large number of documents. (Specifically it restricts XML rule [28] to

  [28r] doctypedecl ::= '<!DOCTYPE' S Name S ExternalID S? '>'

where the ExternalId denotes the used DTD - one might say, in fact, that FleXML implements ``non-extensible'' markup. :)

With this restriction we can do much better because the possible tags and attributes are static: FleXML-generated XML processors read the XML document character by character and can immediately dispatch the actions associated with each element (or reject the document as invalid). Technically this is done by using the Flex scanner generator to produce a deterministic finite automaton with an element context stack for the DTD, which means that there is almost no overhead for XML processing.

Furthermore we have devised a simple extension of the C programming language that facilitates the writing of `element actions' in C, making it easy to write really fast XML applications. In particular we represent XML attribute values efficiently in C when this is possible, thus avoiding the otherwise ubiquitous conversions between strings and data values.

Compared to SAX and its XSL-based friends, FleXML immediately produces efficient code in that the interdiction of extension makes it possible to encode efficiently, FleXML for example uses native C `enum' types to implement enumeration attribute types. However, the above limitation does prevent uses in more complex settings.

As an example: the following is all that is needed to produce a fast program that prints all href-attributes in <a...> tags in XHTML documents (and rejects invalid XHTML documents).

  <!DOCTYPE actions SYSTEM "flexml-act.dtd">
  <top><![CDATA[           #include <stdio.h>                  ]]></top>
  <start tag='a'><![CDATA[ if ({href}) printf("%s\n", {href}); ]]></start>

In general, action files are themselves XML documents conforming to the DTD

   <!ELEMENT actions ((top|start|end)*,main?)>
   <!ENTITY % C-code "(#PCDATA)">
   <!ELEMENT top   %C-code;>
   <!ELEMENT start %C-code;>  <!ATTLIST start tag NMTOKEN #REQUIRED>
   <!ELEMENT end   %C-code;>  <!ATTLIST end   tag NMTOKEN #REQUIRED>
   <!ELEMENT main  %C-code;>

with %C-code; segments being in C enriched as described below. The elements are used as follows:


Use for top-level C code such as global declarations, utility functions, etc.


Attaches the code as an action to the element with the name of the required ``tag'' attribute. The ``%C-code;'' component should be C code suitable for inclusion in a C block (i.e., within {...} so it may contain local variables); furthermore the following extensions are available: {attribute} Can be used to access the value of the attribute as set with attribute=value in the start tag. In C, {attribute} will be interpreted depending on the declaration of the attribute. If the attribute is declared as an enumerated type like

  <!ATTLIST attrib (alt1 | alt2 |...) ...>

then the C attribute value is of an enumerated type with the elements written {attrib=alt1}, {attrib=alt2}, etc.; furthermore an unset attribute has the ``value'' {!attrib}. If the attribute is not an enumeration then {attrib} is a null-terminated C string (of type char*) and {!attrib} is NULL.


Similarly attaches the code as an action to the end tag with the name of the required ``tag'' attribute; also here the ``%C-code;'' component should be C code suitable for inclusion in a C block. In case the element has ``Mixed'' contents, i.e, was declared to permit #PCDATA, then the special variable {#PCDATA} contains the text (#PCDATA) of the element as a null-terminated C string (of type char*). In case the Mixed contents element actually mixed text and child elements then {#PCDATA} contains the plain concatenation of the text fragments as one string.


Finally, an optional ``main'' element can contain the C main function of the XML application. Normally the main function should include (at least) one call of the XML processor yylex.

The program is freely redistributable and modifiable (under GNU `copyleft').

Copyright (C) Kristoffer Rose. Last modified: Tue Feb 11 18:06:44 EST 2003

$Id: FleXML.html,v 1.5 2005/04/06 10:05:15 mquinson Exp $