1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342
|
NOTE: The following text is partly out of date, but not fully.
------------------------------------------------ -*- indented-text -*-
Some Notes About the Design:
----------------------------------------------------------------------
----------------------------------------------------------------------
Compilation
----------------------------------------------------------------------
Compilation is non-trivial because:
- The lexer and parser generators ocamlllex resp. ocamlyacc normally
create code such that the parser module precedes the lexer module.
THIS design requires that the lexer layer precedes the entity layer
which precedes the parser layer, because the parsing results modify
the behaviour of the lexer and entity layers. There is no way to get
around this because of the nature of XML.
So the dependency relation of the lexer and the parser is modified;
in particular the "token" type that is normally defined by the
generated parser is moved to a common predecessor of both lexer
and parser.
- Another modification of the standard way of handling parsers is that
the parser is turned into an object. This is necessary because the
whole parser is polymorphic, i.e. there is a type parameter (the
type of the node extension).
......................................................................
First some modules are generated as illustrated by the following
diagram:
markup_yacc.mly
| |
\|/ \|/ [ocamlyacc, 1]
V V
markup_yacc.mli markup_yacc.ml
| --> renamed into markup_yacc.ml0
[awk, 2] \|/ |
V \|/ [sed, 3]
markup_yacc_token.mlf V
| | markup_yacc.ml
markup_lexer_types_ | |
shadow.mli | | | markup_lexer_types_
\|/ [sed, \|/ | shadow.ml
V 4] V | |
markup_lexer_types.mli | | [sed, 4]
\|/ \|/
V V
markup_lexer_types.ml
markup_yacc_shadow.mli
|
\|/ [replaces, 5]
V
markup_yacc.mli
markup_lexers.mll
|
\|/ [ocamllex, 6]
V
markup_lexers.ml
Notes:
(1) ocamlyacc generates both a module and a module interface.
The module is postprocessed in step (3). The interface cannot
be used, but it contains the definition of the "token" type.
This definition is extracted in step (2). The interface is
completely replaced in step (5) by a different file.
(2) An "awk" script extracts the definition of the type "token".
"token" is created by ocamlyacc upon the %token directives
in markup_yacc.mly, and normally "token" is defined in
the module generated by ocamlyacc. This turned out not to be
useful as the module dependency must be that the lexer is
an antecedent of the parser and not vice versa (as usually),
so the "token" type is "moved" to the module Markup_lexer_types
which is an antecedent of both the lexer and the parser.
(3) A "sed" script turns the generated parser into an object.
This is rather simple; some "let" definitions must be rewritten
as "val" definitions, the other "let" definitions as
"method" definitions. The parser object is needed because
the whole parser has a polymorphic type parameter.
(4) The implementation and definition of Markup_lexer_types are
both generated by inserting the "token" type definition
(in markup_lexer_types.mlf) into two pattern files,
markup_lexer_types_shadow.ml resp. -.mli. The point of insertion
is marked by the string INCLUDE_HERE.
(5) The generated interface of the Markup_yacc module is replaced
by a hand-written file.
(6) ocamllex generates the lexer; this process is not patched in any
way.
......................................................................
After the additional modules have been generated, compilation proceeds
in the usual manner.
----------------------------------------------------------------------
Hierarchy of parsing layers:
----------------------------------------------------------------------
From top to bottom:
- Parser: Markup_yacc
+ gets input stream from the main entity object
+ checks most of the grammar
+ creates the DTD object as side-effect
+ creates the element tree as side-effect
+ creates further entity objects that are entered into the DTD
- Entity layer: Markup_entity
+ gets input stream from the lexers, or another entity object
+ handles entity references: if a reference is encountered the
input stream is redirected such that the tokens come from the
referenced entity object
+ handles conditional sections
- Lexer layer: Markup_lexers
+ gets input from lexbuffers created by resolvers
+ different lexers for different lexical contexts
+ a lexer returns pairs (token,lexid), where token is the scanned
token, and lexid is the name of the lexer that must be used for
the next token
- Resolver layer: Markup_entity
+ a resolver creates the lexbuf from some character source
+ a resolver recodes the input and handles the encoding scheme
----------------------------------------------------------------------
The YACC based parser
----------------------------------------------------------------------
ocamlyacc allows it to pass an arbitrary 'next_token' function to the
parsing functions. We always use 'en # next_token()' where 'en' is the
main entity object representing the main file to be parsed.
The parser is not functional, but uses mainly side-effects to accumulate
the structures that have been recognized. This is very important for the
entity definitions, because once an entity definition has been found there
may be a reference to it which is handled by the entity layer (which is
below the yacc layer). This means that such a definition modifies the
token source of the parser, and this can only be handled by side-effects
(at least in a sensible manner; a purely functional parser would have to
pass unresolved entity references to its caller, which would have to
resolve the reference and to re-parse the whole document!).
Note that also element definitions profit from the imperative style of
the parser; an element instance can be validated directly once the end
tag has been read in.
----------------------------------------------------------------------
The entity layer
----------------------------------------------------------------------
The parser gets the tokens from the main entity object. This object
controls the underlying lexing mechanism (see below), and already
interprets the following:
- Conditional sections (if they are allowed in this entity):
The structures <![ INCLUDE [ ... ]]> and <! IGNORE [ ... ]]> are
recognized and interpreted.
This would be hard to realize by the yacc parser, because:
- INCLUDE and IGNORE are not recognized as lexical keywords but as names.
This means that the parser cannot select different rules for them.
- The text after IGNORE requires a different lexical handling.
- Entity references: &name; and %name;
The named entity is looked up and the input source is redirected to it, i.e.
if the main entity object gets the message 'next_token' this message is
forwarded to the referenced entity. (This entity may choose to forward the
message again to a third entity, and so on.)
There are some fine points:
- It is okay that redirection happens at token level, not at character level:
+ General entities must always match the 'content' production, and because
of this they must always consist of a whole number of tokens.
+ If parameter entities are resolved, the XML specification states that
a space character is inserted before and after the replacement text.
This also means that such entities always consists of a whole number
of tokens.
- There are some "nesting constraints":
+ General entities must match the 'content' production. Because of this,
the special token Begin_entity is inserted before the first token of
the entity, and End_entity is inserted just before the Eof token. The
brace Begin_entity...End_entity is recognized by the yacc parser, but
only in the 'content' production.
+ External parameter entities must match 'extSubsetDecl'. Again,
Begin_entity and End_entity tokens embrace the inner token stream.
The brace Begin_entity...End_entity is recognized by the yacc parser
at the appropriate position.
(As general and parameter entities are used in different contexts
(document vs. DTD), both kinds of entities can use the same brace
Begin_entity...End_entity.)
+ TODO:
The constraints for internal parameter entities are not yet checked.
- Recursive references can be detected because entities must be opened
before the 'next_token' method can be invoked.
----------------------------------------------------------------------
The lexer layer
----------------------------------------------------------------------
There are five main lexers, and a number of auxiliary lexers. The five
main lexers are:
- Document (function scan_document):
Scans an XML document outside the DTD and outside the element instance.
- Content (function scan_content):
Scans an element instance, but not within tags.
- Within_tag (function scan_within_tag):
Scans within <...>, i.e. a tag denoting an element instance.
- Document_type (function scan_document_type):
Scans after <!DOCTYPE until the corresponding >.
- Declaration (function scan_declaration):
Scans sequences of declarations
Why several lexers? Because there are different lexical rules in these
five regions of an XML document.
Every lexer not only produces tokens, but also the name of the next lexer
to use. For example, if the Document lexer scans "<!DOCTYPE", it also
outputs that the next token must be scanned by Document_type.
It is interesting that this really works. The beginning of every lexical
context can be recognized by the lexer of the previous context, and there
is always a token that unambiguously indicates that the context ends.
----------------------------------------------------------------------
The DTD object
----------------------------------------------------------------------
There is usually one object that collects DTD declarations. All kinds of
declarations are entered here:
- element and attribute list declarations
- entity declarations
- notation declarations
Some properties are validated directly after a declarations has been added
to the DTD, but most validation is done by a 'validate' method.
The result of 'validate' is stored such that another invocation is cheap.
A DTD becomes again 'unchecked' if another declaration is added.
TODO: We need a special DTD object that allows every content.
The DTD object is known by more or less every other object, i.e. entities
know the DTD, element declarations and instances know the DTD, and so on.
TODO: We need a method that deletes all entity declarations once the DTD
is complete (to free memory).
----------------------------------------------------------------------
Element and Document objects
----------------------------------------------------------------------
The 'element' objects form the tree of the element instances.
The 'document' object is a derivate of 'element' where properties of the
whole document can be stored.
New element objects are NOT created by the "new class" mechanism, but
instead by an exemplar/instance scheme: A new instance is the duplicate
of an exemplar. This has the advantage that the user can provide own
classes for the element instances. A hashtable contains the exemplars
for every element type (tag name), and there is a default exemplar.
The user can configure this hashtable such that for elements A objects
of class element_a, for elements B objects of class element_b and so on
are used.
The object for the root element must already be created before parsing
starts, and the parser returns the (filled) root object. Because of this,
the user determines the *static* type of the object without the need
of back coercion (which is not possible in Ocaml).
----------------------------------------------------------------------
Newline normalization
----------------------------------------------------------------------
The XML spec states that all of \n, \r, and \r\n must be recognized
as newline characters/character sequences. Notes:
- The replacement text of entities always contains the original text,
i.e. \r and \r\n are NOT converted to \n.
It is unclear if this is a violation of the standard or not.
- Content of elements: Newline characters are converted to \n.
- Attribute values: Newline characters are converted to spaces.
- Processing instructions: Newline characters are not converted.
It is unclear if this is a violation of the standard or not.
----------------------------------------------------------------------
Empty entities
----------------------------------------------------------------------
Many entities are artificially surrounded by a Begin_entity/End_entity pair.
This is sometimes not done if the entity is empty:
- External parameter entities are parsed entities, i.e. they must match
the markupdecl* production. If they are not empty, the Begin_entity/End_entity
trick guarantees that they match markupdecl+, and that they are only
referred to at positions where markupdecl+ is allowed.
If they are empty, they are allowed everywhere just like internal
parameter entities. Because of this, the Begin_entity/End_entity pair
is dropped.
- This does not apply to parameter entities (either external or internal)
which are referred to in the internal subset, nor applies to internal
parameter entities, nor applies to general entities:
+ References in the internal subset are only allowed at positions where
markupdecl can occur, so Begin_entity/End_entity is added even if the
entity is empty.
+ References to internal parameter entities are allowed anywhere, so
never Begin_entity/End_entity is added.
+ References to general entities: An empty Begin_entity/End_entity pair
is recognized by the yacc parser, so special handling is not required.
Moreover, there is the situation that an empty entity is referred to
after the toplevel element:
<!DOCTYPE doc ...[
<!ENTITY empty "">
]>
<doc></doc>∅
- This is illegal, and the presence of an empty Begin_entity/End_entity pair
helps to recognize this.
|