1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
|
******************************************************************************
Notes on the XML specification
******************************************************************************
==============================================================================
This document
==============================================================================
There are some points in the XML specification which are ambiguous. The
following notes discuss these points, and describe how this parser behaves.
==============================================================================
Conditional sections and the token ]]>
==============================================================================
It is unclear what happens if an ignored section contains the token ]]> at
places where it is normally allowed, i.e. within string literals and comments,
e.g.
<![IGNORE[ <!-- ]]> --> ]]>
On the one hand, the production rule of the XML grammar does not treat such
tokens specially. Following the grammar, already the first ]]> ends the
conditional section
<![IGNORE[ <!-- ]]>
and the other tokens are included into the DTD.
On the other hand, we can read: "Like the internal and external DTD subsets, a
conditional section may contain one or more complete declarations, comments,
processing instructions, or nested conditional sections, intermingled with
white space" (XML 1.0 spec, section 3.4). Complete declarations and comments
may contain ]]>, so this is contradictory to the grammar.
The intention of conditional sections is to include or exclude the section
depending on the current replacement text of a parameter entity. Almost always
such sections are used as in
<!ENTITY % want.a.feature.or.not "INCLUDE"> (or "IGNORE")
<![ %want.a.feature.or.not; [ ... ]]>
This means that if it is possible to include a section it must also be legal to
ignore the same section. This is a strong indication that the token ]]> must
not count as section terminator if it occurs in a string literal or comment.
This parser implements the latter.
==============================================================================
Conditional sections and the inclusion of parameter entities
==============================================================================
It is unclear what happens if an ignored section contains a reference to a
parameter entity. In most cases, this is not problematic because nesting of
parameter entities must respect declaration braces. The replacement text of
parameter entities must either contain a whole number of declarations or only
inner material of one declaration. Almost always it does not matter whether
these references are resolved or not (the section is ignored).
But there is one case which is not explicitly specified: Is it allowed that the
replacement text of an entity contains the end marker ]]> of an ignored
conditional section? Example:
<!ENTITY % end "]]>">
<![ IGNORE [ %end;
We do not find the statement in the XML spec that the ]]> must be contained in
the same entity as the corresponding <![ (as for the tokens <! and > of
declarations). So it is possible to conclude that ]]> may be in another entity.
Of course, there are many arguments not to allow such constructs: The resulting
code is incomprehensive, and parsing takes longer (especially if the entities
are external). I think the best argument against this kind of XML is that the
XML spec is not detailed enough, as it contains no rules where entity
references should be recognized and where not. For example:
<!ENTITY % y "]]>">
<!ENTITY % x "<!ENTITY z '<![CDATA[some text%y;'>">
<![ IGNORE [ %x; ]]>
Which token ]]> counts? From a logical point of view, the ]]> in the third line
ends the conditional section. As already pointed out, the XML spec permits the
interpretation that ]]> is recognized even in string literals, and this may be
also true if it is "imported" from a separate entity; and so the first ]]>
denotes the end of the section.
As a practical solution, this parser does not expand parameter entities in
ignored sections. Furthermore, it is also not allowed that the ending ]]> of
ignored or included sections is contained in a different entity than the
starting <![ token.
==============================================================================
Standalone documents and attribute normalization
==============================================================================
If a document is declared as stand-alone, a restriction on the effect of
attribute normalization takes effect for attributes declared in external
entities. Normally, the parser knows the type of the attribute from the ATTLIST
declaration, and it can normalize attribute values depending on their types.
For example, an NMTOKEN attribute can be written with leading or trailing
spaces, but the parser returns always the nmtoken without such added spaces; in
contrast to this, a CDATA attribute is not normalized in this way. For
stand-alone document the type information is not available if the ATTLIST
declaration is located in an external entity. Because of this, the XML spec
demands that attribute values must be written in their normal form in this
case, i.e. without additional spaces.
This parser interprets this restriction as follows. Obviously, the substitution
of character and entity references is not considered as a "change of the value"
as a result of the normalization, because these operations will be performed
identically if the ATTLIST declaration is not available. The same applies to
the substitution of TABs, CRs, and LFs by space characters. Only the removal of
spaces depending on the type of the attribute changes the value if the ATTLIST
is not available.
This means in detail: CDATA attributes never violate the stand-alone status.
ID, IDREF, NMTOKEN, ENTITY, NOTATION and enumerator attributes must not be
written with leading and/or trailing spaces. IDREF, ENTITIES, and NMTOKENS
attributes must not be written with extra spaces at the beginning or at the end
of the value, or between the tokens of the list.
The whole check is dubious, because the attribute type expresses also a
semantical constraint, not only a syntactical one. At least this parser
distinguishes strictly between single-value and list types, and returns the
attribute values differently; the first are represented as Value s (where s is
a string), the latter are represented as Valuelist [s1; s2; ...; sN]. The
internal representation of the value is dependent on the attribute type, too,
such that even normalized values are processed differently depending on whether
the attribute has list type or not. For this parser, it makes still a
difference whether a value is normalized and processed as if it were CDATA, or
whether the value is processed according to its declared type.
The stand-alone check is included to be able to make a statement whether other,
well-formedness parsers can process the document. Of course, these parsers
always process attributes as CDATA, and the stand-alone check guarantees that
these parsers will always see the normalized values.
==============================================================================
Standalone documents and the restrictions on entity
references
==============================================================================
Stand-alone documents must not refer to entities which are declared in an
external entity. This parser applies this rule only: to general and NDATA
entities when they occur in the document body (i.e. not in the DTD); and to
general and NDATA entities occurring in default attribute values declared in
the internal subset of the DTD.
Parameter entities are out of discussion for the stand-alone property. If there
is a parameter entity reference in the internal subset which was declared in an
external entity, it is not available in the same way as the external entity is
not available that contains its declaration. Because of this "equivalence",
parameter entity references are not checked on violations against the
stand-alone declaration. It simply does not matter. - Illustration:
Main document:
<!ENTITY % ext SYSTEM "ext">
%ext;
%ent;
"ext" contains:
<!ENTITY % ent "<!ELEMENT el (other*)>">
Here, the reference %ent; would be illegal if the standalone declaration is
strictly interpreted. This parser handles the references %ent; and %ext;
equivalently which means that %ent; is allowed, but the element type "el" is
treated as externally declared.
General entities can occur within the DTD, but they can only be contained in
the default value of attributes, or in the definition of other general
entities. The latter can be ignored, because the check will be repeated when
the entities are expanded. Though, general entities occurring in default
attribute values are actually checked at the moment when the default is used in
an element instance.
General entities occurring in the document body are always checked.
NDATA entities can occur in ENTITY attribute values; either in the element
instance or in the default declaration. Both cases are checked.
|