1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
|
<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><title>3.
The Document Processing Model
</title><meta name="generator" content="DocBook XSL Stylesheets V1.69.1"><link rel="start" href="index.html" title="
The xmlformat XML Document Formatter
"><link rel="up" href="index.html" title="
The xmlformat XML Document Formatter
"><link rel="prev" href="how-to-use.html" title="2.
How to Use xmlformat
"><link rel="next" href="using-config-files.html" title="4.
Using Configuration Files
"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">3.
The Document Processing Model
</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="how-to-use.html">Prev</a> </td><th width="60%" align="center"> </th><td width="20%" align="right"> <a accesskey="n" href="using-config-files.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="doc-processing-model"></a>3.
The Document Processing Model
</h2></div></div></div><div class="toc"><dl><dt><span class="sect2"><a href="doc-processing-model.html#document-components">3.1.
Document Components
</a></span></dt><dt><span class="sect2"><a href="doc-processing-model.html#line-breaks-and-indentation">3.2.
Line Breaks and Indentation
</a></span></dt><dt><span class="sect2"><a href="doc-processing-model.html#text-handling">3.3.
Text Handling
</a></span></dt></dl></div><p>
XML documents consist primarily of elements arranged in nested fashion.
Elements may also contain text. <span><strong class="command">xmlformat</strong></span> acts to
rearrange elements by removing or adding line breaks and indentation,
and to reformat text.
</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="document-components"></a>3.1.
Document Components
</h3></div></div></div><p>
XML elements within input documents may be of three types:
</p><div class="itemizedlist"><ul type="disc"><li><p>
block elements
</p><p>
This is the default element type. The DocBook
<code class="literal"><chapter></code>, <code class="literal"><sect1></code>,
and <code class="literal"><para></code> elements are examples of block
elements.
</p><p>
Typically a block element will begin a new line. (That is the default
formatting behavior, although <span><strong class="command">xmlformat</strong></span> allows you to
override it.)
</p><p>
Spacing between sub-elements can be controlled, and sub-elements can be
indented. Whitespace in block element text may be normalized. If
normalization is in effect, line-wrapping may be applied as well.
Normalization and line-wrapping may be appropriate for a block element
with mixed content (such as <code class="literal"><para></code>).
</p></li><li><p>
inline elements
</p><p>
These are elements that are contained within a block or within other
inlines. The DocBook <code class="literal"><emphasis></code> and
<code class="literal"><literal></code> elements are examples of inline
elements.
</p><p>
Normalization and line-wrapping of inline element tags and content is
handled the same way as for the enclosing block element. In essence, an
inline element is treated as part of parent's "text" content.
</p></li><li><p>
verbatim elements
</p><p>
No formatting is done for verbatim elements. The DocBook
<code class="literal"><programlisting></code> and
<code class="literal"><screen></code> elements are examples of verbatim
elements.
</p><p>
Verbatim element content is written out exactly as it appears in the
input document. This also applies to child elements. Any formatting that
would otherwise be performed on them is suppressed when they occur
within a verbatim element.
</p></li></ul></div><p>
<span><strong class="command">xmlformat</strong></span> never reformats element tags. In
particular, it does not change whitespace betweeen attributes or which
attribute values. This is true even for inline tags within line-wrapped
block elements.
</p><p>
<span><strong class="command">xmlformat</strong></span> handles empty elements as follows:
</p><div class="itemizedlist"><ul type="disc"><li><p>
If an element appears as <code class="literal"><abc/></code> in the input
document, it is written as <code class="literal"><abc/></code>.
</p></li><li><p>
If an element appears as <code class="literal"><abc></abc></code>, it
is written as <code class="literal"><abc></abc></code>. No line break
is placed between the two tags.
</p></li></ul></div><p>
XML documents may contain other constructs besides elements and text:
</p><div class="itemizedlist"><ul type="disc"><li><p>
Processing instructions
</p></li><li><p>
Comments
</p></li><li><p>
<code class="literal">DOCTYPE</code> declaration
</p></li><li><p>
<code class="literal">CDATA</code> sections
</p></li></ul></div><p>
<span><strong class="command">xmlformat</strong></span> handles these constructs much the same way
as verbatim elements. It does not reformat them.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="line-breaks-and-indentation"></a>3.2.
Line Breaks and Indentation
</h3></div></div></div><p>
Line breaks within block elements are controlled by the
<code class="literal">entry-break</code>, <code class="literal">element-break</code>, and
<code class="literal">exit-break</code> formatting options. A break value of
<em class="replaceable"><code>n</code></em> means <em class="replaceable"><code>n</code></em>
newlines. (This produces <em class="replaceable"><code>n</code></em>-1 blank lines.)
</p><p>
Example. Suppose input text looks like this:
</p><pre class="screen">
<elt>
<subelt/> <subelt/> <subelt/>
</elt>
</pre><p>
Here, an <code class="literal"><elt></code> element contains three nested
<code class="literal"><subelt></code> elements, which for simplicity are
empty.
</p><p>
This input can be formatted several ways, depending on the configuration
options. The following examples show how to do this.
</p><div class="orderedlist"><ol type="1"><li><p>
To produce output with all sub-elements are on the same line as the
<code class="literal"><elt></code> element, add a section to the
configuration file that defines <code class="literal"><elt></code> as a
block element and sets all its break values to 0:
</p><pre class="screen">
elt
format block
entry-break 0
exit-break 0
element-break 0
</pre><p>
Result:
</p><pre class="screen">
<elt><subelt/><subelt/><subelt/></elt>
</pre></li><li><p>
To leave the sub-elements together on the same line, but on a separate
line between the <code class="literal"><elt></code> tags, leave the
<code class="literal">element-break</code> value set to 0, but set the
<code class="literal">entry-break</code> and <code class="literal">exit-break</code> values
to 1. To suppress sub-element indentation, set
<code class="literal">subindent</code> to 0.
</p><pre class="screen">
elt
format block
entry-break 1
exit-break 1
element-break 0
subindent 0
</pre><p>
Result:
</p><pre class="screen">
<elt>
<subelt/><subelt/><subelt/>
</elt>
</pre></li><li><p>
To indent the sub-elements, make the <code class="literal">subindent</code> value
greater than zero.
</p><pre class="screen">
elt
format block
entry-break 1
exit-break 1
element-break 0
subindent 2
</pre><p>
Result:
</p><pre class="screen">
<elt>
<subelt/><subelt/><subelt/>
</elt>
</pre></li><li><p>
To cause the each sub-element begin a new line, change the
<code class="literal">element-break</code> to 1.
</p><pre class="screen">
elt
format block
entry-break 1
exit-break 1
element-break 1
subindent 2
</pre><p>
Result:
</p><pre class="screen">
<elt>
<subelt/>
<subelt/>
<subelt/>
</elt>
</pre></li><li><p>
To add a blank line between sub-elements, increase the
<code class="literal">element-break</code> from 1 to 2.
</p><pre class="screen">
elt
format block
entry-break 1
exit-break 1
element-break 2
subindent 2
</pre><p>
Result:
</p><pre class="screen">
<elt>
<subelt/>
<subelt/>
<subelt/>
</elt>
</pre></li><li><p>
To also produce a blank line after the <code class="literal"><elt></code>
opening tag and before the closing tag, increase the
<code class="literal">entry-break</code> and <code class="literal">exit-break</code> values
from 1 to 2.
</p><pre class="screen">
elt
format block
entry-break 2
exit-break 2
element-break 2
subindent 2
</pre><p>
Result:
</p><pre class="screen">
<elt>
<subelt/>
<subelt/>
<subelt/>
</elt>
</pre></li><li><p>
To have blank lines only after the opening tag and before the closing
tag, but not have blank lines between the sub-elements, decrease the
<code class="literal">element-break</code> from 2 to 1.
</p><pre class="screen">
elt
format block
entry-break 2
exit-break 2
element-break 1
subindent 2
</pre><p>
Result:
</p><pre class="screen">
<elt>
<subelt/>
<subelt/>
<subelt/>
</elt>
</pre></li></ol></div><p>
Breaks within block elements are suppressed in certain cases:
</p><div class="itemizedlist"><ul type="disc"><li><p>
Breaks apply to nested block or verbatim elements, but not to inline
elements, which are, after all, inline. (If you really want an inline to
begin a new line, define it as a block element.)
</p></li><li><p>
Breaks are not applied to text within non-normalized blocks.
Non-normalized text should not be changed, and adding line breaks
changes the text.
</p><p>
For example if <code class="literal"><x></code> elements are normalized, you
might elect to format this:
</p><pre class="screen">
<x>This is a sentence.</x>
</pre><p>
Like this:
</p><pre class="screen">
<x>
This is a sentence.
</x>
</pre><p>
Here, breaks are added before and after the text to place it on a
separate line. But if <code class="literal"><x></code> is not normalized,
the text content will be written as it appears in the input, to avoid
changing it.
</p></li></ul></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="text-handling"></a>3.3.
Text Handling
</h3></div></div></div><p>
The XML standard considers whitespace nodes insignificant in elements
that contain only other elements. In other words, for elements that have
element content, sub-elements may optionally be separated by whitespace,
but that whitespace is insignificant and may be ignored.
</p><p>
An element that has mixed content may have text
(<code class="literal">#PCDATA</code>) content, optionally interspersed with
sub-elements. In this case, whitespace-only nodes may be significant.
</p><p>
<span><strong class="command">xmlformat</strong></span> treats only literal whitespace as
whitespace. This includes the space, tab, newline (linefeed), and
carriage return characters. <span><strong class="command">xmlformat</strong></span> does not
resolve entity references, so entities such as
<code class="literal">&#32;</code> or <code class="literal">&#x20;</code> that
represent whitespace characters are seen as non-whitespace text, not as
whitespace.
</p><p>
<span><strong class="command">xmlformat</strong></span> doesn't know whether a block element has
element content or mixed content. It handles text content as follows:
</p><div class="itemizedlist"><ul type="disc"><li><p>
If an element has element content, it will have only sub-elements and
possibly all-whitespace text nodes. In this case, it is assumed that
you'll want to control line-break behavior between sub-elements, so that
the (all-whitespace) text nodes can be discarded and replaced with the
proper number of newlines, and possibly indentation.
</p></li><li><p>
If an element has mixed content, you may want to leave text nodes alone,
or you may want to normalize (and possibly line-wrap) them. In
<span><strong class="command">xmlformat</strong></span>, normalization converts runs of whitespace
characters to single spaces, and discards leading and trailing
whitespace.
</p></li></ul></div><p>
To achieve this kind of formatting, <span><strong class="command">xmlformat</strong></span>
recognizes <code class="literal">normalize</code> and
<code class="literal">wrap-length</code> configuration options for block elements.
They affect text formatting as follows:
</p><div class="itemizedlist"><ul type="disc"><li><p>
You can enable or disable text normalization by setting the
<code class="literal">normalize</code> option to <code class="literal">yes</code> or
<code class="literal">no</code>.
</p></li><li><p>
Within a normalized block, runs of whitespace are converted to single
spaces. Leading and trailing whitespace is discarded. Line-wrapping and
indenting may be applied.
</p></li><li><p>
In a non-normalized block, text nodes are not changed as long as they
contain any non-whitespace characters. No line-wrapping or indenting is
applied. However, if a text node contains only whitespace (for example,
a space or newline between sub-elements), it is assumed to be
insignficant and is discarded. It may be replaced by line breaks and
indentation when output formatting occurs.
</p></li></ul></div><p>
Consider the following input:
</p><pre class="screen">
<row> <cell> A </cell> <cell> B </cell> </row>
</pre><p>
Suppose that the <code class="literal"><row></code> and
<code class="literal"><cell></code> elements both are to be treated as
non-normalized. The contents of the <code class="literal"><cell></code>
elements are text nodes that contain non-whitespace characters, so they
would not be reformatted. On the other hand, the spaces between tags are
all-whitespace text nodes and are not significant. This means that you
could reformat the input like this:
</p><pre class="screen">
<row><cell> A </cell><cell> B </cell></row>
</pre><p>
Or like this:
</p><pre class="screen">
<row>
<cell> A </cell><cell> B </cell>
</row>
</pre><p>
Or like this:
</p><pre class="screen">
<row>
<cell> A </cell>
<cell> B </cell>
</row>
</pre><p>
In each of those cases, the whitespace between tags was subject to
reformatting, but the text content of the
<code class="literal"><cell></code> elements was not.
</p><p>
The input would <span class="emphasis"><em>not</em></span> be formatted like this:
</p><pre class="screen">
<row><cell>A</cell><cell>B</cell></row>
</pre><p>
Or like this:
</p><pre class="screen">
<row>
<cell>
A
</cell>
<cell>
B
</cell>
</row>
</pre><p>
In both of those cases, the text content of the
<code class="literal"><cell></code> elements has been modified, which is not
allowed within non-normalized blocks. You would have to declare
<code class="literal"><cell></code> to have a <code class="literal">normalize</code>
value of <code class="literal">yes</code> to achieve either of those output
styles.
</p><p>
Now consider the following input:
</p><pre class="screen">
<para> This is a sentence. </para>
</pre><p>
Suppose that <code class="literal"><para></code> is to be treated as a
normalized element. It could be reformatted like this:
</p><pre class="screen">
<para>This is a sentence.</para>
</pre><p>
Or like this:
</p><pre class="screen">
<para>
This is a sentence.
</para>
</pre><p>
Or like this:
</p><pre class="screen">
<para>
This is a sentence.
</para>
</pre><p>
Or even (with line-wrapping) like this:
</p><pre class="screen">
<para>
This is a
sentence.
</para>
</pre><p>
The preceding description of normalization is a bit oversimplified.
Normalization is complicated by the possibility that non-normalized
elements may occur as sub-elements of a normalized block. In the
following example, a verbatim block occurs in the middle of a normalized
block:
</p><pre class="screen">
<para>This is a paragraph that contains
<programlisting>
a code listing
</programlisting>
in the middle.
</para>
</pre><p>
In general, when this occurs, any whitespace in text nodes adjacent to
non-reformatted nodes is discarded.
</p><p>
There is no "preserve all whitespace as is" mode for block elements.
Even if normalization is disabled for a block, any all-whitespace text
nodes are considered dispensible. If you really want all text within an
element to be preserved intact, you should declare it as a verbatim
element. (Within verbatim elements, nothing is ever reformatted, so
whitespace is significant as a result.)
</p><p>
If you want to see how <span><strong class="command">xmlformat</strong></span> handles whitespace
nodes and text normalization, invoke it with the
<code class="option">--canonized-output</code> option. This option causes
<span><strong class="command">xmlformat</strong></span> to display the document after it has been
canonized by removing whitespace nodes and performing text
normalization, but before it has been reformatted in final form. By
examining the canonized document, you can see what effect your
configuration options have on treatment of the document before
line-wrapping and indentation is performed and line breaks are added.
</p></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="how-to-use.html">Prev</a> </td><td width="20%" align="center"> </td><td width="40%" align="right"> <a accesskey="n" href="using-config-files.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">2.
How to Use <span><strong class="command">xmlformat</strong></span>
</td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 4.
Using Configuration Files
</td></tr></table></div></body></html>
|