File: doc-processing-model.html

package info (click to toggle)
xmlformat 1.04-3
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 1,012 kB
  • sloc: xml: 2,289; perl: 996; ruby: 831; makefile: 109; sh: 65
file content (450 lines) | stat: -rw-r--r-- 18,539 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><title>3. 
The Document Processing Model
</title><meta name="generator" content="DocBook XSL Stylesheets V1.69.1"><link rel="start" href="index.html" title="
The xmlformat XML Document Formatter
"><link rel="up" href="index.html" title="
The xmlformat XML Document Formatter
"><link rel="prev" href="how-to-use.html" title="2. 
How to Use xmlformat
"><link rel="next" href="using-config-files.html" title="4. 
Using Configuration Files
"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">3. 
The Document Processing Model
</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="how-to-use.html">Prev</a> </td><th width="60%" align="center"> </th><td width="20%" align="right"> <a accesskey="n" href="using-config-files.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="doc-processing-model"></a>3. 
The Document Processing Model
</h2></div></div></div><div class="toc"><dl><dt><span class="sect2"><a href="doc-processing-model.html#document-components">3.1. 
Document Components
</a></span></dt><dt><span class="sect2"><a href="doc-processing-model.html#line-breaks-and-indentation">3.2. 
Line Breaks and Indentation
</a></span></dt><dt><span class="sect2"><a href="doc-processing-model.html#text-handling">3.3. 
Text Handling
</a></span></dt></dl></div><p>
XML documents consist primarily of elements arranged in nested fashion.
Elements may also contain text. <span><strong class="command">xmlformat</strong></span> acts to
rearrange elements by removing or adding line breaks and indentation,
and to reformat text.
</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="document-components"></a>3.1. 
Document Components
</h3></div></div></div><p>
XML elements within input documents may be of three types:
</p><div class="itemizedlist"><ul type="disc"><li><p>
block elements
</p><p>
This is the default element type. The DocBook
<code class="literal">&lt;chapter&gt;</code>, <code class="literal">&lt;sect1&gt;</code>,
and <code class="literal">&lt;para&gt;</code> elements are examples of block
elements.
</p><p>
Typically a block element will begin a new line. (That is the default
formatting behavior, although <span><strong class="command">xmlformat</strong></span> allows you to
override it.)
</p><p>
Spacing between sub-elements can be controlled, and sub-elements can be
indented. Whitespace in block element text may be normalized. If
normalization is in effect, line-wrapping may be applied as well.
Normalization and line-wrapping may be appropriate for a block element
with mixed content (such as <code class="literal">&lt;para&gt;</code>).
</p></li><li><p>
inline elements
</p><p>
These are elements that are contained within a block or within other
inlines. The DocBook <code class="literal">&lt;emphasis&gt;</code> and
<code class="literal">&lt;literal&gt;</code> elements are examples of inline
elements.
</p><p>
Normalization and line-wrapping of inline element tags and content is
handled the same way as for the enclosing block element. In essence, an
inline element is treated as part of parent's "text" content.
</p></li><li><p>
verbatim elements
</p><p>
No formatting is done for verbatim elements. The DocBook
<code class="literal">&lt;programlisting&gt;</code> and
<code class="literal">&lt;screen&gt;</code> elements are examples of verbatim
elements.
</p><p>
Verbatim element content is written out exactly as it appears in the
input document. This also applies to child elements. Any formatting that
would otherwise be performed on them is suppressed when they occur
within a verbatim element.
</p></li></ul></div><p>
<span><strong class="command">xmlformat</strong></span> never reformats element tags. In
particular, it does not change whitespace betweeen attributes or which
attribute values. This is true even for inline tags within line-wrapped
block elements.
</p><p>
<span><strong class="command">xmlformat</strong></span> handles empty elements as follows:
</p><div class="itemizedlist"><ul type="disc"><li><p>
If an element appears as <code class="literal">&lt;abc/&gt;</code> in the input
document, it is written as <code class="literal">&lt;abc/&gt;</code>.
</p></li><li><p>
If an element appears as <code class="literal">&lt;abc&gt;&lt;/abc&gt;</code>, it
is written as <code class="literal">&lt;abc&gt;&lt;/abc&gt;</code>. No line break
is placed between the two tags.
</p></li></ul></div><p>
XML documents may contain other constructs besides elements and text:
</p><div class="itemizedlist"><ul type="disc"><li><p>
Processing instructions
</p></li><li><p>
Comments
</p></li><li><p>
<code class="literal">DOCTYPE</code> declaration
</p></li><li><p>
<code class="literal">CDATA</code> sections
</p></li></ul></div><p>
<span><strong class="command">xmlformat</strong></span> handles these constructs much the same way
as verbatim elements. It does not reformat them.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="line-breaks-and-indentation"></a>3.2. 
Line Breaks and Indentation
</h3></div></div></div><p>
Line breaks within block elements are controlled by the
<code class="literal">entry-break</code>, <code class="literal">element-break</code>, and
<code class="literal">exit-break</code> formatting options. A break value of
<em class="replaceable"><code>n</code></em> means <em class="replaceable"><code>n</code></em>
newlines. (This produces <em class="replaceable"><code>n</code></em>-1 blank lines.)
</p><p>
Example. Suppose input text looks like this:
</p><pre class="screen">
&lt;elt&gt;
&lt;subelt/&gt; &lt;subelt/&gt; &lt;subelt/&gt;
&lt;/elt&gt;
</pre><p>
Here, an <code class="literal">&lt;elt&gt;</code> element contains three nested
<code class="literal">&lt;subelt&gt;</code> elements, which for simplicity are
empty.
</p><p>
This input can be formatted several ways, depending on the configuration
options. The following examples show how to do this.
</p><div class="orderedlist"><ol type="1"><li><p>
To produce output with all sub-elements are on the same line as the
<code class="literal">&lt;elt&gt;</code> element, add a section to the
configuration file that defines <code class="literal">&lt;elt&gt;</code> as a
block element and sets all its break values to 0:
</p><pre class="screen">
elt
  format          block
  entry-break     0
  exit-break      0
  element-break   0
</pre><p>
Result:
</p><pre class="screen">
&lt;elt&gt;&lt;subelt/&gt;&lt;subelt/&gt;&lt;subelt/&gt;&lt;/elt&gt;
</pre></li><li><p>
To leave the sub-elements together on the same line, but on a separate
line between the <code class="literal">&lt;elt&gt;</code> tags, leave the
<code class="literal">element-break</code> value set to 0, but set the
<code class="literal">entry-break</code> and <code class="literal">exit-break</code> values
to 1. To suppress sub-element indentation, set
<code class="literal">subindent</code> to 0.
</p><pre class="screen">
elt
  format          block
  entry-break     1
  exit-break      1
  element-break   0
  subindent       0
</pre><p>
Result:
</p><pre class="screen">
&lt;elt&gt;
&lt;subelt/&gt;&lt;subelt/&gt;&lt;subelt/&gt;
&lt;/elt&gt;
</pre></li><li><p>
To indent the sub-elements, make the <code class="literal">subindent</code> value
greater than zero.
</p><pre class="screen">
elt
  format          block
  entry-break     1
  exit-break      1
  element-break   0
  subindent       2
</pre><p>
Result:
</p><pre class="screen">
&lt;elt&gt;
  &lt;subelt/&gt;&lt;subelt/&gt;&lt;subelt/&gt;
&lt;/elt&gt;
</pre></li><li><p>
To cause the each sub-element begin a new line, change the
<code class="literal">element-break</code> to 1.
</p><pre class="screen">
elt
  format          block
  entry-break     1
  exit-break      1
  element-break   1
  subindent       2
</pre><p>
Result:
</p><pre class="screen">
&lt;elt&gt;
  &lt;subelt/&gt;
  &lt;subelt/&gt;
  &lt;subelt/&gt;
&lt;/elt&gt;
</pre></li><li><p>
To add a blank line between sub-elements, increase the
<code class="literal">element-break</code> from 1 to 2.
</p><pre class="screen">
elt
  format          block
  entry-break     1
  exit-break      1
  element-break   2
  subindent       2
</pre><p>
Result:
</p><pre class="screen">
&lt;elt&gt;
  &lt;subelt/&gt;

  &lt;subelt/&gt;

  &lt;subelt/&gt;
&lt;/elt&gt;
</pre></li><li><p>
To also produce a blank line after the <code class="literal">&lt;elt&gt;</code>
opening tag and before the closing tag, increase the
<code class="literal">entry-break</code> and <code class="literal">exit-break</code> values
from 1 to 2.
</p><pre class="screen">
elt
  format          block
  entry-break     2
  exit-break      2
  element-break   2
  subindent       2
</pre><p>
Result:
</p><pre class="screen">
&lt;elt&gt;

  &lt;subelt/&gt;

  &lt;subelt/&gt;

  &lt;subelt/&gt;

&lt;/elt&gt;
</pre></li><li><p>
To have blank lines only after the opening tag and before the closing
tag, but not have blank lines between the sub-elements, decrease the
<code class="literal">element-break</code> from 2 to 1.
</p><pre class="screen">
elt
  format          block
  entry-break     2
  exit-break      2
  element-break   1
  subindent       2
</pre><p>
Result:
</p><pre class="screen">
&lt;elt&gt;

  &lt;subelt/&gt;
  &lt;subelt/&gt;
  &lt;subelt/&gt;

&lt;/elt&gt;
</pre></li></ol></div><p>
Breaks within block elements are suppressed in certain cases:
</p><div class="itemizedlist"><ul type="disc"><li><p>
Breaks apply to nested block or verbatim elements, but not to inline
elements, which are, after all, inline. (If you really want an inline to
begin a new line, define it as a block element.)
</p></li><li><p>
Breaks are not applied to text within non-normalized blocks.
Non-normalized text should not be changed, and adding line breaks
changes the text.
</p><p>
For example if <code class="literal">&lt;x&gt;</code> elements are normalized, you
might elect to format this:
</p><pre class="screen">
&lt;x&gt;This is a sentence.&lt;/x&gt;
</pre><p>
Like this:
</p><pre class="screen">
&lt;x&gt;
This is a sentence.
&lt;/x&gt;
</pre><p>
Here, breaks are added before and after the text to place it on a
separate line. But if <code class="literal">&lt;x&gt;</code> is not normalized,
the text content will be written as it appears in the input, to avoid
changing it.
</p></li></ul></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="text-handling"></a>3.3. 
Text Handling
</h3></div></div></div><p>
The XML standard considers whitespace nodes insignificant in elements
that contain only other elements. In other words, for elements that have
element content, sub-elements may optionally be separated by whitespace,
but that whitespace is insignificant and may be ignored.
</p><p>
An element that has mixed content may have text
(<code class="literal">#PCDATA</code>) content, optionally interspersed with
sub-elements. In this case, whitespace-only nodes may be significant.
</p><p>
<span><strong class="command">xmlformat</strong></span> treats only literal whitespace as
whitespace. This includes the space, tab, newline (linefeed), and
carriage return characters. <span><strong class="command">xmlformat</strong></span> does not
resolve entity references, so entities such as
<code class="literal">&amp;#32;</code> or <code class="literal">&amp;#x20;</code> that
represent whitespace characters are seen as non-whitespace text, not as
whitespace.
</p><p>
<span><strong class="command">xmlformat</strong></span> doesn't know whether a block element has
element content or mixed content. It handles text content as follows:
</p><div class="itemizedlist"><ul type="disc"><li><p>
If an element has element content, it will have only sub-elements and
possibly all-whitespace text nodes. In this case, it is assumed that
you'll want to control line-break behavior between sub-elements, so that
the (all-whitespace) text nodes can be discarded and replaced with the
proper number of newlines, and possibly indentation.
</p></li><li><p>
If an element has mixed content, you may want to leave text nodes alone,
or you may want to normalize (and possibly line-wrap) them. In
<span><strong class="command">xmlformat</strong></span>, normalization converts runs of whitespace
characters to single spaces, and discards leading and trailing
whitespace.
</p></li></ul></div><p>
To achieve this kind of formatting, <span><strong class="command">xmlformat</strong></span>
recognizes <code class="literal">normalize</code> and
<code class="literal">wrap-length</code> configuration options for block elements.
They affect text formatting as follows:
</p><div class="itemizedlist"><ul type="disc"><li><p>
You can enable or disable text normalization by setting the
<code class="literal">normalize</code> option to <code class="literal">yes</code> or
<code class="literal">no</code>.
</p></li><li><p>
Within a normalized block, runs of whitespace are converted to single
spaces. Leading and trailing whitespace is discarded. Line-wrapping and
indenting may be applied.
</p></li><li><p>
In a non-normalized block, text nodes are not changed as long as they
contain any non-whitespace characters. No line-wrapping or indenting is
applied. However, if a text node contains only whitespace (for example,
a space or newline between sub-elements), it is assumed to be
insignficant and is discarded. It may be replaced by line breaks and
indentation when output formatting occurs.
</p></li></ul></div><p>
Consider the following input:
</p><pre class="screen">
&lt;row&gt; &lt;cell&gt; A &lt;/cell&gt; &lt;cell&gt; B &lt;/cell&gt; &lt;/row&gt;
</pre><p>
Suppose that the <code class="literal">&lt;row&gt;</code> and
<code class="literal">&lt;cell&gt;</code> elements both are to be treated as
non-normalized. The contents of the <code class="literal">&lt;cell&gt;</code>
elements are text nodes that contain non-whitespace characters, so they
would not be reformatted. On the other hand, the spaces between tags are
all-whitespace text nodes and are not significant. This means that you
could reformat the input like this:
</p><pre class="screen">
&lt;row&gt;&lt;cell&gt; A &lt;/cell&gt;&lt;cell&gt; B &lt;/cell&gt;&lt;/row&gt;
</pre><p>
Or like this:
</p><pre class="screen">
&lt;row&gt;
&lt;cell&gt; A &lt;/cell&gt;&lt;cell&gt; B &lt;/cell&gt;
&lt;/row&gt;
</pre><p>
Or like this:
</p><pre class="screen">
&lt;row&gt;
  &lt;cell&gt; A &lt;/cell&gt;
  &lt;cell&gt; B &lt;/cell&gt;
&lt;/row&gt;
</pre><p>
In each of those cases, the whitespace between tags was subject to
reformatting, but the text content of the
<code class="literal">&lt;cell&gt;</code> elements was not.
</p><p>
The input would <span class="emphasis"><em>not</em></span> be formatted like this:
</p><pre class="screen">
&lt;row&gt;&lt;cell&gt;A&lt;/cell&gt;&lt;cell&gt;B&lt;/cell&gt;&lt;/row&gt;
</pre><p>
Or like this:
</p><pre class="screen">
&lt;row&gt;
  &lt;cell&gt;
    A
  &lt;/cell&gt;
  &lt;cell&gt;
   B
  &lt;/cell&gt;
&lt;/row&gt;
</pre><p>
In both of those cases, the text content of the
<code class="literal">&lt;cell&gt;</code> elements has been modified, which is not
allowed within non-normalized blocks. You would have to declare
<code class="literal">&lt;cell&gt;</code> to have a <code class="literal">normalize</code>
value of <code class="literal">yes</code> to achieve either of those output
styles.
</p><p>
Now consider the following input:
</p><pre class="screen">
&lt;para&gt; This is a        sentence. &lt;/para&gt;
</pre><p>
Suppose that <code class="literal">&lt;para&gt;</code> is to be treated as a
normalized element. It could be reformatted like this:
</p><pre class="screen">
&lt;para&gt;This is a sentence.&lt;/para&gt;
</pre><p>
Or like this:
</p><pre class="screen">
&lt;para&gt;
This is a sentence.
&lt;/para&gt;
</pre><p>
Or like this:
</p><pre class="screen">
&lt;para&gt;
  This is a sentence.
&lt;/para&gt;
</pre><p>
Or even (with line-wrapping) like this:
</p><pre class="screen">
&lt;para&gt;
  This is a
  sentence.
&lt;/para&gt;
</pre><p>
The preceding description of normalization is a bit oversimplified.
Normalization is complicated by the possibility that non-normalized
elements may occur as sub-elements of a normalized block. In the
following example, a verbatim block occurs in the middle of a normalized
block:
</p><pre class="screen">
&lt;para&gt;This is a paragraph that contains
&lt;programlisting&gt;
a code listing
&lt;/programlisting&gt;
in the middle.
&lt;/para&gt;
</pre><p>
In general, when this occurs, any whitespace in text nodes adjacent to
non-reformatted nodes is discarded.
</p><p>
There is no "preserve all whitespace as is" mode for block elements.
Even if normalization is disabled for a block, any all-whitespace text
nodes are considered dispensible. If you really want all text within an
element to be preserved intact, you should declare it as a verbatim
element. (Within verbatim elements, nothing is ever reformatted, so
whitespace is significant as a result.)
</p><p>
If you want to see how <span><strong class="command">xmlformat</strong></span> handles whitespace
nodes and text normalization, invoke it with the
<code class="option">--canonized-output</code> option. This option causes
<span><strong class="command">xmlformat</strong></span> to display the document after it has been
canonized by removing whitespace nodes and performing text
normalization, but before it has been reformatted in final form. By
examining the canonized document, you can see what effect your
configuration options have on treatment of the document before
line-wrapping and indentation is performed and line breaks are added.
</p></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="how-to-use.html">Prev</a> </td><td width="20%" align="center"> </td><td width="40%" align="right"> <a accesskey="n" href="using-config-files.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">2. 
How to Use <span><strong class="command">xmlformat</strong></span>
 </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 4. 
Using Configuration Files
</td></tr></table></div></body></html>