File: schema.rst

package info (click to toggle)
libxmlada 25.0.0-4
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 11,212 kB
  • sloc: ada: 78,068; sh: 3,310; makefile: 394; xml: 111; python: 39
file content (521 lines) | stat: -rw-r--r-- 19,263 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
.. _The_Schema_module:

*****************
The Schema module
*****************

XML Grammars
============

There are several steps that applications must go through when they have to use
XML files:

* Make sure the XML file is well-formed.

  This is a basic step where we ensure that XML tags are correctly nested, that
  closing tags have the same names as the matching opening tags, that attribute
  values are quoted... This corresponds to a syntactic parser in a compiler.

  This step does not depend on the application domain. One file that is
  well-formed will always be so, no matter in what context you use it.

* Make sure the contents of the XML file is semantically valid.

  Depending on the application domain, we must ensure that the content of the
  file makes sense. This step is highly application dependent, and a file that
  is usable in one application might not be usable in another one.

  This is the phase in which the application needs to check whether a given XML
  file has all its required attributes, whether the children of an XML tag are
  the expected ones, whether the type of the attributes is valid...

* Use the XML file in the application.

  This is done through the already-described SAX or DOM parsers

The first phase is mandatory, and necessarily enforced by XML/Ada. You will not
be able to access the contents of the XML file if it isn't well-formed.

The second phase is provided by the Schema module in XML/Ada. Although such
constraints can be checked at the application level, with ad hoc code, it is
generally easier to maintain a separate file that describes the valid semantic
contents of the file, than maintain specific code when the semantic changes. It
is also difficult not to forget special cases when doing the validating through
a set of `if` statements in the Ada code.

XML provides two ways to describe additional constraints that a file must
satisfy in order to be considered as valid.

* DTD

  The Document Type Description is the original way to do this. They come
  directly from the ancestor of XML, SGML. All XML parsers must parse the DTD,
  and report events if the user is using SAX. However, not all parsers are able
  to validate the document against a DTD (XML/Ada doesn't).

  Their use tends to greatly diminish. Among their limitations are a limited
  capability to express constraints on the order of tag children, the fact that
  the DTDs themselves are written in a separate language totally different from
  XML, and that users must learn this language as a result.

* XML Schema

  The XML schemas are replacing the DTDs. They are written in XML, and provide
  an extensive capability to describe what the XML document should look like.
  In fact, almost all Ada types can be described in an XML schema, including
  range constraints, arrays, records, type inheritance, abstract types...

  It is for instance possible to indicate that the value of a preference, in
  our example, must be a string of length 6. Any other length will result in a
  validation error.

XML Schema Syntax
=================

The Schema module provides subprograms and types to parse an XML schema and
validate an XML document with this schema.

This document does not provide a full documentation on the format of XML
Schemas. This is extensive, has several obscure features, which, although
supported by XML/Ada, are of little use in most practical usages. We refer the
reader to the first part of the XML Schema specification, which is designed as
a tutorial (`http://www.w3.org/TR/xmlschema-0/
<http://www.w3.org/TR/xmlschema-0/>`_).

The typical extension for a schema file is :file:`.xsd`.

A schema file must be a valid XML file, and thus start with the usual
``<?xml version="1.0" ?>`` line. The root node must be named `schema`, and belong to the
namespace (`http://www.w3.org/2001/XMLSchema/
<http://www.w3.org/2001/XMLSchema/>`_). The handling of namespaces is fairly
powerful, but also complex. A given XML document might have nodes belonging to
several namespaces, and thus several schema files might have to be loaded, each
defining one of the namespaces.

.. highlight:: xml

In the following simple example, we will not define our schema for a specific
namespace, and thus no special attribute is needed for the root node.  Thus,
our document will be organized as::

  <?xml version="1.0" ?>
  <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      ... rest of the description goes here ...
  </xsd:schema>

An XML schema does not enforce a specific root node in the XML documents it
validates. However, it must define all the valid elements that can be used in
the XML file. This is done through the `<element>` tag, which takes
one mandatory attribute, the name of the element we are defining.

The contents of the element is then defined in one of two ways:

* Through a `type` attribute.

  Schemas come with a number of predefined simple types. A simple type is
  such that an element of that type accepts no child node, and that its
  contents must satisfy additional constraints (be an integer, a date...).

  Among the predefined simple types (which are all defined in the namespace
  `http://www.w3.org/2001/XMLSchema/ <http://www.w3.org/2001/XMLSchema/>`_),
  one can find: `string`, `integer`, `byte`, `date`, `time`, `dateTime`,
  `boolean`...

  If no additional constraint should be enforced on this simple type when
  applied to the element, the type of the element is given through a `type`
  attribute, as in::

     <xsd:element name="tag1" type="xsd:string"  />
     <xsd:element name="tag2" type="xsd:boolean" />
    
  which would accept the following XML files::

     <tag1>Any string is valid here</tag1>

  and::

     <tag2>true</tag2>
    
  but not::

     <tag2>String</tag2>
    
  As will be described later, it is possible to create new types in XML schema,
  which are created with a name. Such new types can also be associated with the
  element through the `type` attribute.

* Through an inline type definition

  If the element must accept child elements, or if a further constraint needs
  to be enforced on the list of valid values, one must create the type. As
  mentioned above, this can be done by creating a type separately and
  referencing it by name, or through an inline type definition.

  The syntax is mostly the same in both cases. Schemas distinguish between the
  notion of simple types (that accept no child element) and complex types (that
  accept child elements, and possibly text value).

  To define a simple type, based on string, but that only allows a limited set
  of values (similar to an Ada enumeration), one would create a restriction of
  the standard string type, as in::

     <xsd:element name="tag3">
      <xsd:simpleType>
        <xsd:restriction base="xsd:string">
          <xsd:enumeration value="value1" />
          <xsd:enumeration value="value2" />
        </xsd:restriction>
      </xsd:simpleType>
     </xsd:element>

  Similarly, we could create an integer type whose valid range of values
  is between 10 and 20, as in::

     <xsd:element name="tag4">
      <xsd:simpleType>
        <xsd:restriction base="xsd:byte">
          <xsd:minInclusive value="10" />
          <xsd:maxInclusive value="20" />
        </xsd:restriction>
      </xsd:simpleType>
     </xsd:element>
    
  Complex types allow elements to have child nodes, as well as attributes. The
  list of valid attributes is created by a set of `<xsd:attribute>` tags, and
  the list of valid child nodes is generally defined either through a
  `<xsd:choice>` or a `<xsd:sequence>` node (although it is possible to
  indicate that any child node is authorized, among other things).

  `<xsd:choice>` indicates the children can appear in any order, whereas
  `<xsd:sequence>` enforces a specific order on children.

  In both cases, extra attributes can be specified to indicate the number of
  times the sequence or choice itself can be repeated, or that each child node
  can appear.

  For instance, we can indicate that `tag5` accepts between 1 and 4 child
  nodes, chosen among `tag6` and `tag7`, but that the latter, if
  present, can only appear once. In addition, tag5 accepts one optional
  attribute. Note that the type of tag6 and tag7 is here specified through a
  `type` attribute, although it could in turn be defined inline::

     <xsd:element name="tag5">
       <xsd:complexType>
         <xsd:choice>
           <xsd:element name="tag6" type="xsd:string"
                        minOccurs="1" maxOccurs="3/>
           <xsd:element name="tag7" type="xsd:string" maxOccurs="1" />
         </xsd:choice>
         <xsd:attribute name="attr" type="xsd:boolean" use="optional" />
       </xsd:complexType>
     </xsd:element>
    
  In the example above, if `tag6` was defined elsewhere in the
  schema, we could use a reference to it, instead of duplicating its
  type definition, as in::

      <xsd:element ref="tag6" />

  If you need an element with no child element (just a string value),
  but that accepts attributes, this also must be defined through a
  complex type, as in::

     <xsd:element name="tag8" />
       <xsd:complexType>
         <xsd:simpleContent>
           <xsd:extension base="xsd:string">
              <xsd:attribute name="attr" type="xsd:boolean" />
           </xsd:extension>
         </xsd:simpleContent>
       </xsd:complexType>
     </xsd:element>

As mentioned before, instead of defining inline types, we could explicitly
declare them, and reference them in the element declaration later on::

   <xsd:simpleType name="string_of_length_10">
     <xsd:restriction base="xsd:string" />
       <xsd:length value="10"/>
     </xsd:restriction>
   </xsd:simpleType>
   <xsd:element name="tag9" type="string_of_length_10" />
  
Connecting XML documents and schemas
====================================

There are several ways that XML/Ada uses to find what schema to use when
validating a file.

* Manually creating the grammar.

  The schema module contains the package `Schema.Validators` which
  allows you to create a grammar by hand. It is very low-level, and it is
  likely that you will never need to use it. It is used internally mostly,
  and when creating the schema which is used to validate schema files
  themselves.

* Explicitly parsing a schema file

  Parsing a schema file can be done through a call to parse for a reader
  derived from `Schema.Schema_Readers.Schema_reader`.  As usual, you
  call `Parse`, and pass it an input source. As output, you get
  access to a grammar, that can then be given to another instance of a
  `Schema.Readers.Validating_Reader`.

  .. highlight:: ada
     :linenothreshold: 3

  This technique will generally be used when you need to validate several
  XML files with the same grammar: you parse the grammar only once, and
  then reuse its instance, instead of reparsing the :file:`.xsd` file every
  time::

    with Ada.Text_IO;   use Ada.Text_IO;
    with Schema.Schema_Readers, Schema.Validators, Input_Sources.File;
    use  Schema.Schema_Readers, Schema.Validators, Input_Sources.File;

    procedure SchemaExample2 is
       Grammar : XML_Grammar;
       Schema  : Schema_Reader;
       Read    : File_Input;
    begin
       Open ("file.xsd", Read);
       Parse (Schema, Read);
       Close (Read);

       Grammar := Get_Grammar (Schema);

     exception
        when XML_Validation_Error | XML_Not_Implemented =>
           Put_Line ("ERROR: " & Get_Error_Message (Schema));
    end SchemaExample2;

  In the example above, the schema file itself is validated against the
  official schema for schema files.

  The resulting grammar object is in fact a collection of parsed schema
  files, each associated with its own namespace. It can be kept as long as
  you need it in your application. Memory will automatically be reclaimed
  when no longer needed.

  Every time you parse an XML file later on, you must associate the
  Grammar with the parser::

     declare
        Read      : File_Input;
        My_Reader : Validating_Reader;
     begin
        Set_Grammar (My_Reader, Grammar);
        Set_Feature (My_Reader, Schema_Validation_Feature, True);
        Open (Xml_File.all, Read);
        Parse (My_Reader, Read);
        Close (Read);

     exception
        when XML_Validation_Error | XML_Not_Implemented =>
           Put_Line ("ERROR: " & Get_Error_Message (My_reader));
     end;

* Implicitly parsing the schema

  Two special attributes, defined in the Schema standard, can be used to
  indicate, in an XML document itself, that it should be validated with
  a specific schema.

  These attributes are both defined in a special namespace,
  `http://www.w3.org/2001/XMLSchema-instance
  <http://www.w3.org/2001/XMLSchema-instance>`_.

  * `xsi:noNamespaceSchemaLocation`

    The value of this attribute is the name of a file that contains
    the schema to use for elements that are not associated with a
    specific namespace.

  * `xsi:schemaLocation`

    This attribute is a list of strings, alternatively the prefix of
    a namespace and the name of an xsd file to use for that
    namespace. For instance, ``"ns1 file1.xsd ns2 file2.xsd"``.

  When it encounters any of these two attributes, XML/Ada will
  automatically parse the corresponding schema files, and use the result
  to validate the file.

  See the section below on optimizing the parsing of the grammars, as a
  way to avoid parsing the same grammar multiple times.

Validating documents with SAX
=============================

XML/Ada is quite unique in the category of XML parsers, since it allows the
validation of XML files when you are using an event-based parser with SAX.
Most other XML parsers only work on DOM trees.

Basing the validation on SAX is more efficient, since there is no need to read
the whole XML stream (or even the grammar) in memory before starting the
validation, and errors can be reported immediately.

It also requires less memory to run, and thus can validate large XML
documents.

It also means that even if you are using SAX, and not DOM, you still have
access to the validation features.

Validating an XML document while parsing it is basically done the same as when
using SAX itself. Instead of inheriting from `Sax.Readers.Reader`, your tagged
type must inherit from `Schema.Readers.Validating_Reader`.

As usual, you can still override the predefined primitive operations like
`Start_Element`, `End_Element`...

Note the activation of the `Schema_Validation_Feature` feature, without which
no validation takes place:

.. literalinclude:: schema/schemaexample.adb
   :linenos:

Validating documents with DOM
=============================

This is very similar to using DOM itself, except the base class of your
reader should be `Schema.Dom_Readers.Tree_Reader`. Going back to the
example described in :ref:`Using_DOM`, you would use the following to
validate XML streams before generating the DOM tree.

.. literalinclude:: dom/domschemaexample.adb
   :language: ada
   :linenos:

Unsupported schema elements
===========================

Not all aspects of XML schemas are supported by XML/Ada.
In particular, it does not currently support XPath, so any part of the
schema that is related to XPath expressions (for instance `<xsd:key>`
and `<xsd:unique>`) are not supported currently.

Optimizing the parsing of grammars
==================================

It is often the case that a given :file:`.xsd` file will be reused multiple
times to validate XML documents. In such case, you do not want to parse the
file multiple times, but instead reuse an already existing `XML_Grammar`
object. Of course, this is a tradeoff between memory used to keep the
grammar in memory, and the time it would take to reparse the grammar.

This is easily done when you have a single :file:`.xsd` file to reuse
for all the XML files. Simply call `Set_Grammar` on the parser before
you parse the file, as in::

   declare
      G : constant XML_Grammar := ...;  --  parsed earlier
      R : Validating_Reader;
      F : File_Input;
   begin
      R.Set_Grammar (G);
      Open ("file.xml", F);
      R.Parse (F);
      Close (F);
      ...;  --  Do something with the resulting tree
   end;
      
The second use case is a bit more complex: you have several XSD files to
parse, and the XML files will need either of these. If you are using
namespaces, there is nothing special to do, and the same code as above
applies: you can simply parse each of the XSD files into the same
XML_Grammar, and then use that grammar to parse all the XML files,
as in::

   declare
      G : XML_Grammar;
      S : Schema_Reader;
      F : File_Input;
      R : Validating_Reader;
   begin
      Open ("grammar1.xsd", F);
      S.Parse (F);
      F.Close;

      Open ("grammar2.xsd", F);
      S.Parse (F);
      F.Close;

      G := S.Get_Grammar;

      R.Set_Grammar (G);
      Open ("file.xml", F);
      R.Parse (F);
      F.Close;
   end;

If however you are not using namespaces, you cannot use this technique, since
the grammar from the various XSD files would end up mixed up, and validation
will most likely fail. So instead you need to have one `XML_Grammar` per
XSD file, and then set the grammar on the reader dynamically. A full
example is given in the XML/Ada source distribution, in
:file:`tests/schema/multiple_xsd`. Here is an overview.

We first need to parse each of the XSD files into its own grammar::

   declare
      Symbols : Symbol_Table;
      G1, G2 : XML_Grammar;
      S      : Schema_Grammar;
      F      : File_Input;
   begin
      --  Since we are going to reuse grammars, we need to ensure their
      --  symbol tables (where internal strings are stored) across all
      --  involved parsers).
      Symbols := Allocate;
      S.Set_Symbol_Table (Symbols);

      --  Now we can parse each of the XSD file
      Open ("algo1.xsd", F);
      S.Parse (F);
      F.Close;
      G1 := S.Get_Grammar;

      S.Set_Grammar (No_Grammar);  --  reset
      Open ("algo2.xsd", F);
      S.Parse (F);
      F.Close;
      G2 := S.Get_Grammar;
   end;

We then need to create a custom validating reader, which knows how to set
the grammar based on its name. This is done by overriding one of the
primitive operations of the parser::

   declare
      type My_Reader is new Validating_Reader with null record;
      overriding procedure Parse_Grammar
         (Self          : not null access Reader_With_Preloaded_XSD;
          URI, Xsd_File : Sax.Symbols.Symbol;
          Do_Create_NFA : Boolean := True) is
      begin
         if Xsd_File = "algo1.xsd" then
            Self.Set_Grammar (G1);
         elsif Xsd_File = "algo2.xsd" then
            Self.Set_Grammar (G2);
         end if;
      end Parse_Grammar;

      R : My_Reader;
   begin
      --  Also share the same symbol table
      R.Set_Symbol_Table (Symbols);
      R.Set_Feature (Schema_Validation_Feature, True);

      Open ("test1.xml", F);
      R.Parse (F);
      F.Close;
   end;

Where for instance :file:`test1.xml` contains::

   <?xml version="1.0" encoding="UTF-8"?>
   <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:noNamespaceSchemaLocation="algo1.xsd">
      <child>102</child>
   </root>