File: faq-grammars.xml

package info (click to toggle)
libxerces2-java 2.12.1-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, sid
  • size: 12,596 kB
  • sloc: java: 127,777; xml: 13,728; sh: 39; makefile: 10
file content (399 lines) | stat: -rw-r--r-- 19,592 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
<?xml version='1.0' encoding='UTF-8'?>
<!--
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 * 
 *      http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
-->
<!DOCTYPE faqs SYSTEM 'dtd/faqs.dtd'>
<faqs title='Caching and Preparsing Grammars'>
 <faq title='Caching Grammars'>
  <q>I have a set of (DTD or XML Schema) grammars that I
    use a lot.  How can I make Xerces
    reuse the representations it builds for these grammars,
    instead of parsing them anew with every new document?
  </q>
  <a>
    <p>
        Before answering this question, it will greatly help to
        understand how Xerces handles grammars internally.  To do
        this, here are some terms:
    </p>
    <anchor name="grammar-terms"/>
    <ul>
        <li><code>Grammar</code>:  defined in the
        <code>org.apache.xerces.xni.grammars.Grammar</code>
        interface; simply differentiates objects that are Xerces
        grammars from other objects, as well as providing a means
        to get at the location information (<code>XMLGrammarDescription</code>) for the grammar represented..</li>
        <li><code>XMLGrammarDescription</code>:  defined by the
        <code>org.apache.xerces.xni.grammars.XMLGrammarDescription</code>
        interface, holds some basic location information common to all grammars.
        This can be used to distinguish one
        <code>Grammar</code> object from another, and also
        contains information about the type of the grammar.</li>
        <li>Validator:  A generic term used in Xerces to denote
        an object which compares the structure of an XML document
        with the expectations of a certain type of grammar.
        Currently, we have DTD and XML Schema validators.</li>
        <li><code>XMLGrammarPool</code>:  Defined by the
        <code>org.apache.xerces.xni.grammars.XMLGrammarPool</code>
        interface, this object is owned by the application and it
        is the means by which the application and Xerces pass
        complex grammars to one another.</li>
        <li>Grammar bucket:  An internal data structure owned by
        a Xerces validator in which grammars--and information
        related to grammars--to be used in a given validation
        episode is stored.</li>
        <li><code>XMLGrammarLoader</code>:  defined in the
        <code>org.apache.xerces.xni.grammars.XMLGrammarLoader</code>
        interface, this defines an object that "knows how" to
        read the XML representation of a particular kind of
        grammar and construct a Xerces-internal representation (a
        <code>Grammar</code> object) out of it.  These objects
        may interact with validators during parsing of instance
        documents, or with external code during grammar
        preparsing.</li>
    </ul>
    <p>Now that the terminology is out of the way, it's possible
        to relate all these objects together.  At the commencement of
        a validation episode, a validator will call the
        <code>retrieveInitialGrammarSet(String grammarType)</code> method of the
        <code>XMLGrammarPool</code> instance to which it has access.  It
        will use the <code>Grammar</code> objects it procures in this
        way to seed its grammar bucket.
    </p>
    <p>
        When the validator determines that it needs a grammar, it
        will consult its grammar bucket.  If it finds a matching
        grammar, it will attempt to use it.  Otherwise, if it has
        access to an <code>XMLGrammarPool</code> instance, it
        will request a grammar from that object with the
        <code>retrieveGrammar(XMLGrammarDescription desc)</code>
        method.  Only if both of these steps fail will it fall
        back to attempting to resolve the grammar entity and
        calling the appropriate <code>XMLGrammarLoader</code> 
        to actually create a new Grammar object.
    </p>
    <p>
        At the end of the validation episode, the validator will
        call the <code>cacheGrammars(String grammarType,
        Grammar[] grammars)</code> method of the
        <code>XMLGrammarPool</code> (if any) to which it has
        access.  There is no guarantee grammars that the grammar
        pool itself supplied to the validator will not be
        included in this set, so a grammar pool implementation
        cannot rely only on new grammars to be passed back in
        this situation.
    </p>
    <p>
        At long last, it's now possible to answer the original
        question--how can one cache grammars?  Assuming one has a
        reasonable <code>XMLGrammarPool</code>
        implementation--such as that provided with Xerces--there are two
        answers:
    </p>
    <ol>
        <li><anchor name="passive"/>The "passive" approach:  Don't do any preparsing,
        just register the grammar pool implementation with the
        parser, and as new grammars are requested by instance
        documents, simply let the validators add them to the
        pool.  This is very unobtrusive to the application, but
        doesn't provide that much control over what grammars are
        added; even if a custom EntityResolver is registered,
        it's still possible that unwanted grammars will make it
        into the pool.</li>
        <li>The "active" approach:  Preload a grammar pool
        implementation with all the grammars you'll need, then
        lock it so that no new grammars will be added.  Then
        registering this on the configuration will allow
        validators to make use of this set; registering a
        do-nothing EntityResolver will allow the application to
        deny validators from using any but the "approved" grammar
        set.  This will oblige the application to use more Xerces
        code, but provides a far more fine-grained approach to
        controlling what grammars may be used.</li>
    </ol>
    <p>
        We discuss both these approaches in a bit more detail
        below, complete with some (broad) examples.  
        As a starting point, though, the
        <code>XMLGrammarBuilder</code> sample, from the
        <code>xni</code> package, should provide a starting-point
        for implementing either the active or passive approach.
    </p>
  </a>
 </faq>
 <faq title="Xerces Default Grammar Caching Implementation">
  <q>Exactly how does Xerces default implementation of things
  like the grammar pool work?</q>
  <a>
    <p>
        Before proceeding further, let there be no doubt that, by default, Xerces
        does not cache grammars at all.  In order to trigger Xerces grammar caching, an <code>XMLGrammarPool</code>
        must be set, using the <code>setProperty</code> method,
        on a Xerces configuration that supports grammar pools.  On the other hand,
        you could simply use the <code>XMLGrammarCachingConfiguration</code> as
        discussed briefly <jump href="#caching-w-standards">below</jump>.
    </p>
    <p>
        When enabled, by default, Xerces's grammar pool implementation stores
        any grammar offered to it (provided it does not already
        have a reference matching that grammar).  It also makes
        available all grammars it has, of a particular type, on
        calls to <code>retrieveInitialGrammarSet</code>.  It will
        also try and retrieve a matching grammar on calls to
        <code>retrieveGrammar</code>.
    </p>
    <p>
        Xerces uses hashing to distinguish different grammar
        objects, by hashing on the
        <code>XMLGrammarDescription</code> objects that those
        grammars contain.  Thus, both of Xerces implementations
        of XMLGrammarDescription--for DTD's and XML
        Schemas--provide implementations of <code>hashCode():
        int</code> and <code>equals(Object):boolean</code> that
        are used by the hashing algorithm.
    </p>
    <p>
        In XML Schemas, hashing is simply carried out on the
        target namespace of the schema.  Thus, two grammars are
        considered equal (by our default implementation) if and
        only if their XMLGrammarDescriptions are instances of
        <code>org.apache.xerces.impl.xs.XSDDescription</code> (our schema implementation of
        XMLGrammarDescription) and the targetNamespace fields of
        those objects are identical.
    </p>
    <p>
        The case in DTD's is much more difficult.  Here is the
        algorithm, which describes the conditions under which two
        DTD grammars will be considered equal:
    </p>
    <ul>
        <li>Both grammars must have XMLGrammarDescriptions that
        are instances of
        <code>org.apache.xerces.impl.dtd.XMLDTDDescription</code>.</li>
        <li>If their publicId or expandedSystemId fields are
        non-null they must be identical;</li>
        <li>If one of the descriptions has a root element
        defined, it must be the same as the root element defined
        in the other description, or be in the list of global
        elements stored in that description;</li>
        <li>If neither has a root element defined, then they must
        share at least one global element declaration in
        common.</li>
    </ul>
    <p>
        The DTD grammar caching also assumes that the entirety of
        the cached grammar will lie in an external subset.  i.e.,
        in the example below, Xerces will happily cache--or use a
        cached version of--the DTD in "my.dtd".  If the document
        contained an internal subset, the declarations would be
        ignored.  
    </p>
    <source>&lt;!DOCTYPE myDoc SYSTEM "my.dtd"&gt;
&lt;myDoc ...&gt;...&lt;/myDoc&gt;</source>
    <p>
        Using these heuristics, Xerces's default grammar caching
        implementation appears to do a reasonable job at matching
        grammars up with appropriate instance documents.  This
        functionality is very new, so in addition to bug reports
        we'd very much appreciate, especially on the DTD front,
        feedback on whether this form of caching is indeed useful or
        whether--for instance--it would be better if internal
        declarations were somehow incorporated into the grammar
        that's been cached.
    </p>
  </a>
 </faq>
 <faq title="Preparsing Grammars">
  <q>I like the idea of "active" caching (or I want the grammar
  object for some purpose); how do I go about parsing a grammar
  independent of an instance document?</q>
  <a>
    <p>
        First, if you haven't read <jump href="#grammar-terms">the first FAQ on this page</jump> and
        have trouble with terminology, hopefully answers
        lie there.  
      </p>
      <p>
        Preparsing of grammars in Xerces is accomplished with
        implementations of the <code>XMLGrammarLoader</code>
        interface.  Each implementation needs to know how to
        parse a particular type of grammar and how to build a
        data structure representing that grammar that Xerces can
        efficiently make use of in validation.  Since most
        application programs won't want to deal with Xerces
        implementations per se, we have provided a handy utility
        class to handle grammar preparsing generally:
        <code>org.apache.xerces.parsers.XMLGrammarPreparser</code>.
        This FAQ describes the use of this class.
        For a live example, check out the
        <code>XMLGrammarBuilder</code> sample in the
        <code>samples/xni</code> directory of the binary
        distribution.
    </p>
    <p>
        <code>XMLGrammarPreparser</code> has methods for
        installing XNI error handlers, entity resolvers, setting
        the Locale, and generally doing similar things as an XNI
        configuration.  Any object passed to XMLGrammarPreparser
        by any of these methods will be passed on to all
        <code>XMLGrammarLoader</code>s registered with
        XMLGrammarPreparser.
    </p>
    <p>
        Before <code>XMLGrammarPreparser</code> can be used, its
        <code>registerPreparser(String, XMLGrammarLoader):
        boolean</code> method must be called.  This allows a
        String identifying an arbitrary grammar type to be
        associated with a loader for that type.  To make peoples'
        lives easier, if you want DTD grammars or XML Schema
        grammar support, you can pass <code>null</code> for the
        second parameter and <code>XMLGrammarPreparser</code>
        will try and instantiate the appropriate default grammar
        loader.  For DTD's, for instance, just call
        <code>registerPreparser</code> like:
    </p>
    <source>grammarPreparser("http://www.w3.org/TR/REC-xml", null)</source>
    <p>
        Schema grammars correspond to the URI
        "http://www.w3.org/2001/XMLSchema"; both these constants
        can be found in the
        <code>org.apache.xerces.xni.grammars.XMLGrammarDescription</code>
        interface.  The method returns true if an
        XMLGrammarLoader was successfully associated with the
        given grammar String, false otherwise.
    </p>
    <p>
        XMLGrammarPreparser also contains methods for setting
        features and properties on particular loaders--keyed on
        with the same string that was used to register the
        loader.  It also allows features and properties the
        application believes to be general to all loaders to be
        set; it transmits such features and properties to each
        loader that is registered.  These methods also silently consume any
        notRecognized/notSupported exceptions that the loaders throw.  Particularly useful here is
        registering an <code>XMLGrammarPool</code>
        implementation, such as that found in
        <code>org.apache.xerces.util.XMLGrammarPoolImpl</code>.
    </p>
    <p>
        To actually parse a grammar, one simply calls the
        <code>preparseGrammar(String grammarType, XMLInputSource
        source):  Grammar</code> method.  As above, the String
        represents the type of the grammar to be parsed, and the
        XMLInputSource is the location of the grammar to be
        parsed; this will not be subjected to entity expansion.
    </p>
    <p>
        It's worth noting that Xerces default grammar loaders
        will attempt to cache the resulting grammar(s) if a
        grammar pool implementation is registered with them.
        This is particularly useful in the case of schema
        grammars:  If a schema grammar imports another grammar,
        the Grammar object returned will be the schema doing the
        importing, not the one being imported.  For caching,
        this means that if this grammar is cached by itself, the grammars
        that it imports won't be available to the grammar pool
        implementation.  Since our Schema Loader knows about this
        idiosyncrasy, if a grammar pool is registered with it,
        it will cache all schema grammars it encounters,
        including the one which it was specifically called to
        parse.  In general, it is probably advisable to register
        grammar pool implementations with grammar loaders for
        this reason; generally, one would want to cache--and make
        available to the grammar pool implementation--imported
        grammars as well as specific schema grammars, since the
        specific schemas cannot be used without those that they
        import.
    </p>
  </a>
 </faq>
 <faq title="Grammar caching with Standard APIs">
  <q>All right, I've (somehow) got a grammar pool full of
  grammars.  How do I use this with my application that uses
  standard (SAX|DOM|JAXP) parsers?</q>
  <a><anchor name="caching-w-standards"/>
    <p>
        For SAX and DOM the case is simple.  Just do:
    </p>
    <source>XMLParserConfiguration config = new &DefaultConfig;();
config.setProperty("http://apache.org/xml/properties/internal/grammar-pool",
    myFullGrammarPool);
(SAX|DOM)Parser parser = new (SAX|DOM)Parser(config);</source>
    <p>
        Now your grammar pool instance will be used by all
        validators created by this parser to validate your
        instance documents.
    </p>
    <p>
        If you have an application that uses pure JAXP, your task
        is a bit trickier.  You'll need to do something like
        this:
    </p>
    <source>System.setProperty("org.apache.xerces.xni.parser.XMLParserConfiguration",
    "org.apache.xerces.parsers.XMLGrammarCachingConfiguration");
DocumentBuilder builder = // JAXP factory invocation
// parse documents and store grammars</source>
    <p>
        Note that this only supports the "passive" caching
        approach discussed in <jump href="#passive">above</jump>.  The
        <code>org.apache.xerces.parsers.XMLGrammarCachingConfiguration</code>
        represents experimental code; feedback on whether it is
        useful would be greatly appreciated.
    </p>
  </a>
 </faq>
 <faq title="Examining Grammars">
  <q>But I don't want to "preparse" grammars for efficiency; I
  want to parse them in order to look at their contents using
  some API!  Can I do this?</q>
  <a>
    <p>
        Yes, for grammar types for which such an API is defined.
        No such API exists at the current moment for DTD's.  For
        XML Schemas, Xerces implements the 
        <jump href="http://www.w3.org/Submission/2004/SUBM-xmlschema-api-20040309/">XML Schema API</jump>.
        For details, it's best to look at the 
        <link idref="api" anchor="xml-schema-api-documentation">API</link> docs for
        the <code>org.apache.xerces.xs</code>
        package.  Assuming you have produced a Grammar object from an XML Schema
        document by some means, to turn that object
        into an object usable in this API, do the following:
    </p>
    <ol>
        <li>
            Cast the Grammar object to <code>org.apache.xerces.xni.grammars.XSGrammar</code>;
        </li>
        <li>
            Call the <code>toXSModel()</code> method on the casted object;
        </li>
        <li>
            Use the methods in the <code>org.apache.xerces.xs.XSModel</code>
            interface to examine the new object; methods on this
            interface and others in the same package should allow you to access
            all aspects of the schema.
        </li>
    </ol>
  </a>
 </faq>
 <faq title="Alternative method for getting an XSModel">
  <q>Is there an alternative method for getting an XSModel?</q>
  <a>
   <p>
    Yes, for more information see the <jump href="faq-xs.html">XML Schema FAQ</jump>.
   </p>
  </a>
 </faq>
</faqs>