File: paper.html

package info (click to toggle)
flexml 1.9.6-1
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 640 kB
  • ctags: 225
  • sloc: perl: 1,304; makefile: 261; xml: 188; ansic: 117
file content (455 lines) | stat: -rw-r--r-- 18,118 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
        "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>
  <head>
    <title>Generating Fast Validating XML Processors</title>
    <!--$Id: paper.html,v 1.13 2006/07/18 18:21:13 mquinson Exp $-->
    <style type="text/css">
<!--
 BODY { BACKGROUND-COLOR: #ffffff; FONT-FAMILY: arial, times new roman, sans-serif; }
 A:link, A:visited, A:active { TEXT-DECORATION: none; FONT-WEIGHT: bold; COLOR: #0000FF}
  
 H1, H2 { TEXT-ALIGN: center; FONT-WEIGHT: bold; }
 H3, H4, H5 { TEXT-ALIGN: left; FONT-WEIGHT: bold; }
 H6 { TEXT-ALIGN: center; FONT-WEIGHT: bold; FONT-SIZE: small; }
 H6.CAPTION { TEXT-ALIGN: center; FONT-WEIGHT: bold; FONT-SIZE: small; FONT-STYLE: italic; }
  
 P { TEXT-INDENT: 1em; }
 P.CODE { TEXT-INDENT: 0; COLOR: #FF0000; }
  
 UL, OL, DL { FONT-SIZE: small; }
 UL { list-style: square; }
 LI { FONT-SIZE: small; FONT-WEIGHT: bold; }
 UL EM, OL EM { FONT-WEIGHT: bold; }
 CODE, CITE { FONT-WEIGHT: bold; }
  
 BLOCKQUOTE { MARGIN-LEFT: 1em; MARGIN-RIGHT: 1em }
 IMG { VERTICAL-ALIGN: top; ALIGN: center; }
 SUP { COLOR: #0000FF; FONT-SIZE: small; }
-->
    </style>
  </head>
  <body>

    <h2>
      Generating Fast Validating XML Processors
    </h2>

    <h6>
      (Extended Abstract)
    </h6>

    <h6>
      Kristoffer Rose<br/>
      LIP, ENS-Lyon<br/>
      <a href="mailto:krisrose@debian.org">krisrose@debian.org</a>
    </h6>

    <h6> Abstract </h6>

    <p>We present <em>FleXML</em>, a program that generates fast
    validating XML processors from `self-contained' XML DTDs.  It uses
    the <em>flex</em> (lexical analyser generator) program to
    translate the DTD into a <em>finite automaton</em> enriched with a
    stack with the `element context'.  This means that the XML
    processor will act directly on each character received.  The
    program is freely redistributable and modifyable (under GNU
    `copyleft').</p>

    <h6> Keywords </h6>

    <p>Validating XML, DTD, lexical analysis, finite automata.</p>


    <h4> Overview </h4>

    <p>The `X' of XML stands for <em>Extensible</em> [<cite><a
    href="#XML">XML</a></cite>].  This signifies that each and every
    XML document specifies in its header the details of the format
    that it will use and <em>may</em> change its format a bit relative
    to the used base format.</p>
    <p>
      This has influenced the tools available to write validating XML
      processors: they use a <em>call-back</em> model where the XML
      processor passes strings with the tags and attributes names and
      values to the application.  These call-backs must be generic
      because one cannot know whether a document will start by
      extending its own notation with more tags and attributes.  For
      <em>well-formed</em> but non-validated XML documents this makes
      a lot of sense, of course, but we would in general like to
      exploit the information in the DTD for optimizations.</p>
    <p>
      In particular, for many applications a <em>fixed</em> format
      suffices in the sense that a single DTD is used without
      individual extensions for a large number of documents.  In that
      case we can do much better because the possible tags and
      attributes are static.</p>
    <p>
      We have implemented an XML processor <em>generator</em> using
      the <cite><a href="#Flex">Flex</a></cite> scanner generator that
      produces deterministic finite automata [<cite><a
      href="#ASU">ASU</a></cite>].  Which means that there is almost
      no overhead for XML processing: the generated XML processors
      read the XML document character by character and can immediately
      dispatch the actions associated with each element (or reject the
      document as invalid).</p>
    <p>
      Furthermore we have devised a simple extension of the C
      programming language that facilitates the writing of `element
      actions' in C, making it easy to write really fast XML
      applications.  In particular we represent XML attribute values
      efficiently in C when this is possible, thus avoiding the
      otherwise ubiquitous conversions between strings and data
      values.</p>
    <p>
      FleXML is available for free
      (from <a href="http://flexml.sourceforge.net">SourceForge</a>).
      In this paper we present FleXML through an
      elaborated <a href="#what">example</a> and discuss some of the
      <a href="#how">technical issues</a>.</p>


    <h4><a name="what">What can it do?</a></h4>

      <p>Assume that we have an XML document <code>my-joke.xml</code>
      containing the classical joke

<blockquote><code><pre>&lt;!DOCTYPE joke SYSTEM &quot;my.dtd&quot;&gt;
&lt;joke&gt;&lt;line&gt;My appartment is so small&lt;/line&gt; &lt;suspense/&gt;
&lt;line type='punch-line'&gt;the mice are round-shouldered&lt;/line&gt;&lt;/joke&gt;
</pre></code></blockquote>

      (and many others like it, of course).  We wish to create an XML
      processor to validate it with respect to the DTD in the file
      <code>my.dtd</code> containing

<blockquote><code><pre>&lt;!-- my.dtd: Small DTD for jokes (just for fun). --&gt;
&lt;!ELEMENT joke (line,(line|suspense)*)&gt;
&lt;!ELEMENT line (#PCDATA)&gt;
&lt;!ATTLIST line type (normal|question|punch-line) 'normal'&gt;
&lt;!ELEMENT suspense EMPTY&gt;
</pre></code></blockquote>

      and, furthermore, we wish to write an XML application for
      displaying such messages in an amusing way.</p>
    <p>
      With FleXML this can be done by creating an `actions file'
      <code>my-show.act</code> which implements the desired actions
      for each element.  The remainder of this section explains the
      contents of such an actions file.</p>
    <p>
      An actions file is itself an XML document which must begin with

<blockquote><code><pre>&lt;!DOCTYPE actions SYSTEM &quot;flexml-act.dtd&quot;&gt;
&lt;actions&gt;
</pre></code></blockquote>

      (the <code>flexml-act.dtd</code> DTD is part of the FleXML
      system and is reproduced in the manual page.</p>
    <p>
      We decide that our application should react to a
      <code>line</code> element by printing the text inside it, and
      that it should differentiate between the three possible `type'
      attribute values by printing corresponding trailing punctuation.</p>
    <p>
      This introduces a slight complication, because the attribute
      values are available when parsing the start tag whereas the
      element text is not available until we parse the end tag (where
      it has been read).</p>
    <p>
      This means that we must declare a top-level variable.

<blockquote><code><pre>&lt;top&gt;&lt;![CDATA[
char* terminator = &quot;.&quot;;
]]&gt;&lt;/top&gt;
</pre></code></blockquote>

      Notice how we use <code>CDATA</code> sections to make sure that
      all characters (including white-space) are passed unmodified to
      the C compiler.</p>
    <p>
      With this we can write the action to set it when reading the
      <code>line</code> start tag as

<blockquote><code><pre>&lt;start tag='line'&gt;&lt;![CDATA[
  switch ( {type} ) {
    case {!type}: terminator = &quot;...&quot;; break;
    case {type=normal}: terminator = &quot;.&quot;; break;
    case {type=question}: terminator = &quot;??&quot;; break;
    case {type=punch-line}: terminator = &quot;!!&quot;; break;
  }
]]&gt;&lt;/start&gt;
</pre></code></blockquote></p>

    <p>
      The idea is that the enumeration attribute <code>type</code> is
      implemented in C as if it had been declared by

<blockquote><code><pre>enum { {!type}, {type=normal}, {type=question}, {type=punch-line} } {type};
</pre></code></blockquote>

      (understanding the <code>{...}</code> units as C identifiers),
      hence the possibility of using the fast C <code>switch</code>
      statement to pick the right choice directly.  Note that the
      first choice, <code>{!<em>type</em>}</code>, corresponds to not
      setting the attribute; in this example the attribute has a
      default value so this can never happen, however, we include the
      choice anyway to prevent the C compiler from issuing warnings
      about missing choices in <code>switch</code> statements.</p>
    <p>
      With this in place we can write the action for
      <code>&lt;/line&gt;</code>.  Since it prints something, however,
      we first need to add the appropriate header inclusion.

<blockquote><code><pre>&lt;top&gt;&lt;![CDATA[
#include &lt;stdio.h&gt;
]]&gt;&lt;/top&gt;

&lt;end tag='line'&gt;&lt;![CDATA[
  printf(&quot;%s%s\n&quot;, pcdata, terminator);
]]&gt;&lt;/end&gt;
</pre></code></blockquote></p>

    <p>
      Finally, we will make the application amusing by `displaying'
      the <code>&lt;suspense/&gt;</code> empty tag as a short delay;
      this also involves a header inclusion.

<blockquote><code><pre>&lt;top&gt;&lt;![CDATA[
#include &lt;unistd.h&gt;
]]&gt;&lt;/top&gt;

&lt;start tag='suspense'&gt;&lt;![CDATA[
  sleep(2);
]]&gt;&lt;/start&gt;
</pre></code></blockquote></p>

    <p>
      That is all; the only remaining thing is to terminate the action
      file properly.

<blockquote><code><pre>&lt;/actions&gt;
</pre></code></blockquote>

    <p>
      We can now run FleXML with the DTD and the actions file as input
      and will get an XML application as output that, when run (after
      processing by flex and a C compiler), will indeed print

<blockquote><code><pre>My appartment is so small.
the mice are round-shouldered!!
</pre></code></blockquote>

      as expected, pausing duly for two seconds between the lines.  On
      the authors system the above output was achieved with the
      command sequence

<blockquote><code><pre>flexml -A -a my-show.act my.dtd 
flex -omy-show.c my-show.l
cc -omy-show my-show.c
./my-show <./my-joke.xml
</pre></code></blockquote>

    (see the manual page for the exact meaning of the FleXML options).

    <p>
      An important aspect of the design of FleXML is that the only
      thing that should matter to the programmer should be the
      complexity of the <em>application</em>, not of the used DTD.  As
      an example the following action file prints the
      <code>href</code> attribute of all hyperlinks in an XHTML
      document:

<blockquote><code><pre>&lt;!DOCTYPE actions SYSTEM &quot;flexml-act.dtd&quot;&gt;
&lt;actions&gt;

&lt;top&gt;&lt;![CDATA[
#include &lt;stdio.h&gt;
]]&gt;&lt;/top&gt;

&lt;start tag='a'&gt;&lt;![CDATA[
if ({href}) printf(&quot;%s\n&quot;, {href});
]]&gt;&lt;/start&gt;

&lt;/actions&gt;
</pre></code></blockquote>

      which was compiled into a running application on the author's
      system with the commands

<blockquote><code><pre>flexml $(FLEXDEBUG) -rhtml -p'-//IETF//DTD XHTML 1.0 Transitional//EN' \
  'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
gcc -Wall -ansi -pedantic -O2 -g -c xhtml-href.c -o xhtml-href.o
flex -Bsv -Ca -oxhtml1-transitional.c xhtml1-transitional.l
gcc -Wall -ansi -pedantic -O2 -g   -c xhtml1-transitional.c -o xhtml1-transitional.o
gcc -Wall -ansi -pedantic   xhtml-href.o xhtml1-transitional.o   -o xhtml-href
</pre></code></blockquote>

      generating the XML processor directly from the official DTD on
      the web (which in fact required a patch to flex to enlarge the
      possible table size).


    <h4><a name="how">How does it work?</a></h4>

    <p>
      FleXML is a perl script [<cite><a href="#Perl">Perl</a></cite>]
      which reads and interprets a DTD and subsequently produces an
      <em>XML processor</em> source file for the <em>flex</em> scanner
      and optionally an <em>XML application</em> with the C source of
      the element actions from an actions file.  The DTD is used to
      construct the rules used by flex to match the individual XML
      components in such a way that only valid documents match.
    <p>
      For example, the flex code for scanning an attribute declaration
      of the <code>line</code> tag is the following:

<blockquote><code><pre>&lt;AL_line&gt;{
 "type"{Eq}"'normal'"  |
 "type"{Eq}"\"normal\""  A_line_type = A_line_type_normal;
 "type"{Eq}"'question'"  |
 "type"{Eq}"\"question\""  A_line_type = A_line_type_question;
 "type"{Eq}"'punch-line'"  |
 "type"{Eq}"\"punch-line\""  A_line_type = A_line_type_punch_d_line;

 "&gt;" {
  LEAVE; STag_line(); pcdata = BUFFERSET(pcdata); ENTER(IN_line);
 }
 "/&gt;" {
  LEAVE; STag_line(); pcdata = ""; ETag_line();
  switch (YY_START) {
   case ROOT_line: SET(EPILOG); break;
   case S_joke: SET(S_joke_1); break;
   case S_joke_1: case S_joke_2: case S_joke_3: SET(S_joke_3); break;
}}}
</pre></code></blockquote>

    (with <code>{Eq}</code> an alias for the regular expression
    matching an equal sign (corresponding to production `[25] Eq' of
    the XML specification).
    <p>
      This reads as follows: when the XML processor is reading the
      attribute list of the <code>line</code> tag, i.e., when it is in
      the <code>&lt;AL_line&gt;</code> state, a `<code>t</code>' will
      enter an internal state where a `<code>y</code>' proceeds to
      another internal state but other characters makes the document
      invalid (because no rule permits it).  Once the equal sign has
      been scanned, the next characters determine the attribute value,
      and at the end one of the flex actions is performed, setting the
      attribute value (<code>A_line_type</code> is the real C for what
      we wrote as <code>{type}</code>, etc.).  The important thing is
      that one can ensure by careful tuning of the flex rules that a
      valid document will proceed only by looking each character up in
      a table and determining the subsequent `state' and `action'.
      One must avoid pairs of rules such as

<blockquote><code><pre>"-->"		LEAVE(COMMENT);
.		SKIP;
</pre></code></blockquote>

    (a single `<code>.</code>' matches any character) because they
    mean that the scanner will not be sure after having read a
    `<code>-</code>' character whether it is part of a comment
    terminator or `just' a dash.  In such cases an extra rule must be
    inserted because for the set

<blockquote><code><pre>"-->"		LEAVE(COMMENT);
"--"            |
.		SKIP;
</pre></code></blockquote>

    the problem goes away.
    <p>
      After the actual attribute rules, two rules handle termination
      of the attribute list.  There are two cases corresponding to
      whether we just read a start tag or an empty element.  In case
      it was a start tag then we must enter the `inner' mode of the
      element called <code>IN_line</code> for the <code>line</code>
      element.  The <code>switch</code> handles the state changes
      needed for the line element resulting from the fact that the
      element can appear in different contexts.  This is always
      possible to construct because of the requirement that an XML DTD
      must be <em>deterministic</em>: we just need an element content
      stack (this is what the <code>LEAVE</code> and
      <code>ENTER</code> macros are for).

    <h4><a name="why">Why is it useful?</a></h4>

    <p>
      In comparison with the forthcoming XML Style-sheet Language
      [<cite><a href="#XSL">XSL</a></cite>] our approach is much more
      primitive for better and worse: only a limited range of
      applications can be produced with FleXML but those will be very
      fast.
    <p>
      This is useful for XML applications that are meant to process a
      large number of documents in a fixed format.  One such
      application is the NTSys u-payment transaction server which is
      implemented as an Apache module where performance is of premium
      importance.  Using FleXML permits a highly modular development
      of modules for the various transaction types based on a common
      DTD and a collection of applications that are generated
      separately and all linked together with the common processor.
    <p>
      FleXML is still under development: please try it out (either
      from <a "http://flexml.sourceforge.net">SourceForge</a>
      or from the <a href="www.debian.org">Debian
      GNU/Linux</a> distribution where FleXML is include from release
      2.2.  The author would welcome comments as to how the system can
      best evolve.  Two things that are definitely on the agenda is a
      limited `context condition' language for expressing constraints
      on the position of an element in a document (corresponding to
      the Xpath subset of XSL), and an efficient way to combine
      several DTDs into one large t facilitate general XML servers
      (that can even dispatch to a generic XML interface in cases
      where the FleXML restrictions are not respected by a document).


    <h4> Acknowledgements </h4>

    <p>
      I am grateful to <a href="http://www.ntsys.fr">NTSys</a> for
      supporting the development of FleXML. Finally extend my sincere
      thanks to Jef Poskanzer, Vern Paxson, and the rest of the
      <em>flex</em> maintainers for a great tool.

    <h4> References </h4>

    <dl compact>

      <dt><a name="ASU"><cite>ASU</cite></a>
      <dd>
	Alfred Aho, Ravi Sethi and Jeffrey Ullman: <em>Compilers:
	Principles, Techniques and Tools</em>, Addison-Wesley
	(1986).<p>

      <dt><a name="Flex"><cite>Flex</cite></a>
      <dd>
	Jef Poskanzer, Vern Paxson, <em>et. al.</em>: <em>Flex - fast
	lexical analyzer generator</em>.<p>

      <dt><a name="Perl"><cite>Perl</cite></a>
      <dd>
	Larry Wall, <em>Perl - Practical Extraction and Report
	  Language</em>.<p>

      <dt><a name="XML"><cite>XML</cite></a>
      <dd>
	Extensible Markup Language (XML) 1.0 (W3C
	Recommendation REC-xml-1998-0210).<p>

      <dt><a name="XSL"><cite>XSL</cite></a>
      <dd>
	Extensible Stylesheet Language (XSL) (W3C Working Draft).<p>

    </dl>
    <hr>
    <address>Copyright (c)
      <a href="mailto:krisrose@debian.org">Kristoffer Rose</a>.
<!-- Created: Thu Dec  9 08:09:41 CET 1999 -->
<!-- hhmts start -->Last modified: Tue Feb 11 13:56:40 EST 2003 <!-- hhmts end -->
    </address>

  </body>
</html>