1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532
|
<!-- -*-jfw-sgml-*- -->
<!doctype book public "-//JFW//DTD Book//EN" [
<!ENTITY sdc "<code/sdc/" >
<!ENTITY notation SYSTEM "notations.sgml" SUBDOC >
<!ENTITY bib SYSTEM "intro-bib.sgml" subdoc>
<!ENTITY % f.text SYSTEM "common.ent" > %f.text;
<!ENTITY loutcnv SYSTEM "../target/lout/preparse.scm" >
<!ENTITY % LaTeX "IGNORE" >
<![ %LaTeX [ <!ENTITY LaTeX SDATA "\LaTeX" > ]]>
<!ENTITY LaTeX "LaTeX" >
]>
<book
author="J&oe;rg Wittenberger"
inst="University of Technology Dresden"
date="26. 6. 1996, unfinished"
>&sdc; -- developers documentation
<intro>
This document decribes <em/some/ internals of the &sdc;.
Unfortunatly some code is inherited from older versions and some new
code needs to be documented, but this is a lot of work. I must assume
that only a few people will ever read this document, therfor I decided
not to put too much efford in.
If you encounter problems, can't understand what I could have been
talking about etc. don't hesitate to mail me.
</intro>
<chapt>Overview
&bib;
First of all after parsing the command line &sdc; checks if it can
derive a desired target format. It either uses the argument given to
the <code/-O/ switch or if there is none it tries to guess the target
format from the extension of the output file. Having this string it
looks for a <code>target<var>name<code>preparse.scm</code> in the
installation directory. If the file is readable, it is loaded as a
scheme file
<footnote>
Throughout this paper the term "loaded" is used to refer to the
process of loading a file as scheme does. This means the file must
contain valid scheme source which is evaluated in the interpreter.
</footnote>
and the target format is treated valid. Through loading this file
settings depending on the target formats are done. There are some
hooks which can be set. Pre- and postpressors may be hooked in (but
postprocessors may also be hooked in at a later stage).
<note>
In fact this <code/preparse.scm/ file may be empty, but if it is
readable or not is used to decide whether a given name is a valid
(implemented) target format. This check is no longer stricktly
nessesary but there for historical reasons. It might be reduced just
to check for the directory <code>target<var>name</var>.
</note>
Now &sdc; converts from SGML into the target format by first
invoking sgmls or nsgmls on the SGML source, keeping the <code/esis/
output in a temporary file. Prior to start sgmls/nsgmls the
environment variable DOCPATH ist asked for. If it doesn't exist the
current directory is assumed. This value is extended by the
library directories and passed as <code/SGML_SEARCH_PATH/ to nsgmls
and, each element of the path extended by <code>/%S</code>, as
<code/ SGML_PATH/ to sgmls.
<bf/A note about history and this manual/
The 0.7 version of &sdc; is a mayor reimplementation of the former
version.
At the moment (version 0.7) all old code is still left in, as the rtf
target is not yet reimplemented...
Also there is still some code which came from even older time.
I wrote about those ages ago:
&sdc; has developed from it's precessor, which in turn was simillar
to the <em/format/ from the QUERTZ project. It's precessor tried hard
to do it's job using shell scripts, m4, sed and other things. But it
turned out, that it is almost impossible to do everything &sdc; does
with these tools. Therefore I choose to reimplement it using a
language with excellent extension capabilities, scheme.
At the other hand I've been new to scheme at that time -- &sdc; is
in fact my first real scheme program. Also there were some more or
less working code left. Tor these reasons some things are pretty
straight forward implemented and will be changed some day. At the
other hand some things are done in a way hard to understand. This is
mostly because the code is left from the shell script time.
I'll point it out if things are still to be changed.
<chapt>The convertion process
Preparse.scm can be used to set something due to the target
format. Per convention it adds a option to the sgml preprocessor
defining an system entity to identify the target format like this
<verb>
(set! sgml-opts (string-append "-i Lout " sgml-opts))
</verb>
Please note the naming convention: all system entities introduced
through &sdc; or it's DTD's begin with a capital letter. The
namespace for user defined entities is defined as that starting with
lower case letters.
<note>
Also the <code/sgml-subdir/ is set here to point to the subdirectory
<code/ sgml/.
This is because older versions didn't have it, and those are still the
default.
This will change some day, but is won't hurd to set it anyway.
The old version used also to set the variable
<code/doc-postprocess-hook/ here if nessesary.
</note>
As the old way is still the default the variable
<code/compile-function/ is set to the one defined within the file.
The convertions for the targets <code/ascii/ and <code/ps/ use this
file to redirect themself to invoke <code/lout/ adding the call to
lout as postprocess.
<sect>SGML parsing
The value of the environment variable <code/DOCPATH/ is extended in a
way that the front end parser will also look down from the
subdirectory <code/sgml-subdir/ of the installation directory for
system id's. It is passed to the parser as <code/SGML_PATH/ (for
sgmls) and <code/SGML_SEARCH_PATH/ (for nsgmls).
Directories given to &sdc; via the <code/-D/ option are prepended to
that value.
Also directories mentioned in the environment variable
<code/TYPESETLIB/ or introduced by the command line switch <code/-L/
are looked up for here.
These are appended to the path.
<quote>
There is a HACK to work around a bug in sgmls: The parsing of
SUBDOC's starts with an empty table of entities. Therefore a
temporary file is created holding those definitions explicitly and
included from the file <code>dtd/targets</code>. As we are forced to
use a well known name for it (to be able to include it) you can't
have more than one &sdc; run in the same directory at the same
time.
(This file is written into the cwd.)
For nsgmls this hack is removed at compile time (depending on a
makefile config setting.
</quote>
Then the input files are parsed by sgmls and converted into the
<code/esis/ format. The output of this process is read by &sdc; into
the list <code/sgmls-output/. This list consists of pairs where the
car is the command character and the cdr is the rest of the line. The
variable is not supposed to be touched until the end of compilation.
The executable for <code/sgmls/ is searched along
the <code/bin/ subdirectories of the library path
and is called <code/sdc-sgmls/.
This is to avoid confusion with other software
and
because I can't keep up to date
with the frequent changes made to sgmls.
<note>
Most systems suffer from having sgmls version 1.1 installed
under the name "sgmls".
As the 1.1.91 version behaves the same as 1.1 there is
<em/no/ need to keep the 1.1 version.
</note>
<sect id=Convertion>Converting the esis representation
<index id="compile-function"
>
Next a function <code/compile-function/ is called from &sdc;. This
receives three arguments: a list of the esis lines.
The format (contents) of this variable (argument) is to be changed.
It's only purpose is to be feed to the function <code/token-stream/,
which converts it into a stream suitable for further processing.
The stream consists now of tokens of the form:
<desc>
<dt/<code>#(STARTTAG <var>GI AttributeList<code>)</dt>
This represents start tags.
The <var/GI/ is a symbol of the same name as the tags name in the DTD.
The <var/AttributeList/ is a list of vectors of the form:
<code>#(<var>AttributeName
<meta>[ <code>TOKEN <meta>| <code>CDATA <meta>]
<var>AttributeValue
<code/)/
<dt/<code>#(ENDTAG <var>GI<code>)</dt>
End tags
<dt/<code/#(PI/ <var/text/)</dt>
Processing instructions
<dt/<code/#(DATA/ <var/text/)</dt>
Data.
</desc>
There are some more not mentioned here. These are used when dealing
with NDATA external entities.
There are a some test and extract functions for the elements of these
tokens.
Always use those as the internal representation might change in sake
of speed.
<sect>"Normalizing"
Most targets (in fact currently all but the literate one) pass the
stream of token through a "normalizer" pass.
This pass adds nothing for the formating. But it adds informations of
common value. For instance: All divisions get a hirarchical
name. (That is, if you used to part your document with
<division>'s you end up here with <sect>'s again.)
The following things are done:
<list>
<o>Remove "side effects" of the use of a mixed content model and short
references.
That is, <code/DATA/-tokens are removed if they appear, where no data
should be allowed if we used an element only model.
Also <code/<p>/-tags which are inserted for empty lines from the
short reference mechanism are removed at those positions.
<o>Change to use of hirarchical names for all divisions.
<o>Add a <code/NO/-Attribute to both the division's token and it's
head token.
<o>Add a <code/NO/ attribute with a running number to list items and a
<code/LEVEL/ attribute with the nesting level too.
<o>Add a <code/NO/-Attribute to figures.
<o>Add a <code/NO/-Attribute to FAQ's A/Q pairs.
See comments in <code>include/faq.scm</code> for future plans
about changing their form.
<o>Convert <code/<inline>/ elements to look like external
<code/NDATA/ entities and process both.
<o>Convert <code/SUBDOC/ entities to appear as a <code/<division>/
element of the outer document.
<o>If requested by load options, change some document types (i.e.,
manpage) to look like a simple document, if the target doesn't
implement a special formating for it.
<o>If requested by load options and the <code/FACE/ attribute, add
extra sections for the index and the bibliography.
</list>
<sect>Main processing
Next a simmilar process is launched to convert all the tokens into
appropriate commands for the target format.
This is somewhat simmilar between the varios formats.
In fact most formats do not care about the documents structure anymore
at this stage.
This is because a) the structure is correct (plus or minus a bug) b)
it should simpify the program.
In fact this slows down the execution as large lists are to be
searched for the propper action.
Only a few tags per target format need a special handling.
Those appear still to be handled the same way one can learn from the
<code>include/normal.scm</code> file.
All the other are defined by a replacement mechanism.
These list look like:
<verb>
(SPLIT #f #"\n")
</verb>
The first element of each entry is the name of the GI.
The second is the value to be replaced for the open tag.
If <code/#f/ is given, the open tag is removed.
The third is what is to be replaced for the closing tag.
The values to be replaced can be a list. In that case all listed
values are replaced.
If the value (or an element of the list) is a symbol, the value of the
attribute with the same name is replaced.
As said, most conversions are quite simmilar.
Only the info target is somewhat different
as it has the need to know the
tree structure of the document
all the time.
If a target needs to know <em/a lot/ about
this tree structure
visit the <code>info/preparse.scm</code> file.
The code:
<code>(hook 'rdpl 'add (lambda f (list rdpl-accu)))</code>
configures the <code>include/normal.scm</code>
to parse the whole document
into just one token,
the tree structure of the document.
Unfortunatly the rest of the processing
(particular the part below the info-old-tbl stuff)
is not too clean.
though it might be a little hard to read.
For all the <em/other/ targets
it's a good idea
to read through the code of
<code>target/lout/preparse.scm</code>.
I'm going to comment this code some more.
<sect>Post processing
<p>
Some target format invoke a formating tool after the conversation.
Other like info, have one implemented within their own code body.
&sdc; has no idea about use of temporary files
by post processing tools.
This might be added at some time.
(E. g., a notation processing step <ref id=notation// might have
produced a temporary file
to be used by the post processing step.)
Therefor it leaves temp files around some times.
Also some error conditions are only catched by Scheme runtime errors.
The handling of those doesn't include deleting of temp Files.
To make sure those are found by the user
they are always left in the current directory.
Besides this practical reason
having temporary files in some (pseudo) random directory
would be fatal if the user
was using &sdc; to handle private documents.
This could be solved by changing the file permissions,
but see the source to understand why this would
add to the runtime in a non desirable way.
I'll fix this some day.
</chapt>
<InclDiv>¬ation;
<chapt>Character handling
<index id="Files" sub="chrproc.scm" <index id="Files" sub="procchar.c"
<index id="process-char"
<index id="Character handling"
>
Character handling is the treatment applied to plain text
(<code/#PCDATA/) in the DTD and to the attributes of tags.
This treatment is to ensure, that only the special characters of SGML
have a special meaning in the source. All characters special to the
backend have to be escaped due to the rules of the backend.
A second job of this handling is the support of <code/SDATA/ declared
entities. <code/Sgmls/ will put them into text lines enclosed by
"<code/\|/"-sequences. Those <code/SDATA/ entities contain stuff which
might have backend specials (e.g., &aplha; is declared for &LaTeX;
als <code/<ENTITY alpha SDATA "$\alpha$">/). For those Entities the
escaping has to be turned off.
Third <code/sgmls/ assumes systems which are not 8-bit
clean. Therefore it converts all characters with the eight bit set
into octal sequences. These are converted back here.
<chapt>Hooks
There are some hooks defined:
Most hooks are reached via the dynamic hook managment. To run a hook
type:
<quote/<code>(hook 'run '<var>Name Arguments<code/)//
In general the <code/hook/ function takes a symbol, the command to
perform, a second symbol, the name of the (dynamic) hook, and a rest
argument passed to the command.
Command available:
<desc>
<dt/run/ The first function is applied to rest argument. Then the next
function is applied to the return value of the first and so on. The
last return value is returned.
Hooks run without being defined just return their arguments, i.e.,
they run the identical function. No error is produced.
<dt/add/ add the function in the rest argument in front of the hook
<dt/append/ append it
<dt/doc/ return the documentation string, if available
<dt/set-doc!/ set the documentation string
</desc>
At the moment there are probably too few hooks defined.
Hooks in &sdc;:
<desc>
<dt/face/ The <code/face/ attribute of the top level element get's
feed through prior to any use.
<dt/rdpl/ From <code>include/normal.scm</code>. Merely for internal
use. Wraped around elements wherever either the (normal) stream or a
tree representation is to be returned from the parser.
<dt/p-body*-frame-body*/ From <code>include/normal.scm</code>.
Defines which style (%Body)* within an virtual <!element body o o
(%Body)* > is used.
<dt/p-body*/ Defines which virtual contents definition to use for
entity %Body.
<dt/normalize-raw, normalize-cooked/ Called with the raw/cooked stream
feed to/delivered by normalize. Supposed to include debugging
statements. See <code>rc/watch</code> how it can be set.
<dt/external-messages/ Called with some of the tokens with external
impact. By default errors ar displayed.
<dt/doc-preprocess-hook, doc-postprocess-hook/ Old, not dynamic hooks
invoked without any arguments before and after processing the
document. Use simmilar to the dynamic hook like
<code/(doc-process-hook 'add (lambda () ...))/. Better forget about
them, they might disappear.
</desc>
<chapt>Toolbox
Under the toolbox topic procedures and comcepts will be discussed
which have proofen to be widly useful for implementing converters or
implement simple things which might come in handy some time.
Things wich are found in the <code/include/ directory are either there
for easy and centralized configuration or are candidates for
generalization and implementation in the compiled part.
<sect>rdp.scm
An implementation of functions to
implement recursive descendant parsers
in a descriptive way.
It is based on streams.
The way it's implemented
it should'nt be too hard to created a parser
from analysing a DTD.
That's the way future versions are supposed to overcome the need of
sgmls.
<sect>Stream.scm
An implementation of streams. It's close to <ref t=B id="SICP"//.
<sect>Strings.scm
<sect>Control.scm
<sect1>Indexing
<sect1>Memoziation for arbitrary functions
<code/(memoize function . ac)/
Function must use as many arguments as there are elements in the
<code/ac/-list.
<code/ac/ is a list of comparators.
Each gets invoked on an argument of the current call
and the argument to the memoized calls.
If it returns <code/#t/,
the arguments are considered to be equal.
If all arguments are found to be equal,
the memoized result is used
otherwise it's computed and remembered.
Example:
<verb>
(define m-func
(memoize
(lambda (a1 a2) ; the function of some (2) args
...)
eq? equal?) ; an equivalence predicate for each arg
)
</verb>
<sect1>Lists
There are some procedures operating on lists and specially useful with
&sdc;.
<desc>
<dt/<code/append-to//<dd>
<dt/<code/list-flat-once//<dd>
<dt/<code/(rm1 obj list)//<dd>
<dt/<code/(remove-all objs list)//<dd>
</desc>
<sect1>Queues and Stacks
|