File: manpage.xml

package info (click to toggle)
herold 8.0.1-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 5,304 kB
  • sloc: java: 40,460; xml: 513; makefile: 37; sh: 29
file content (551 lines) | stat: -rw-r--r-- 20,873 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
<?xml version="1.0" encoding="UTF-8"?>
<refentry version="5.0" xmlns="http://docbook.org/ns/docbook"
          xmlns:xlink="http://www.w3.org/1999/xlink"
          xmlns:xi="http://www.w3.org/2001/XInclude"
          xmlns:svg="http://www.w3.org/2000/svg"
          xmlns:m="http://www.w3.org/1998/Math/MathML"
          xmlns:html="http://www.w3.org/1999/xhtml"
          xmlns:db="http://docbook.org/ns/docbook">
  <info>
    <author>
      <personname>Michael Fuchs</personname>
      <personblurb>
        <para>Software Engineer</para>
      </personblurb>
    </author>
    <productname>herold</productname>
  </info>

  <refmeta>
    <refentrytitle>herold</refentrytitle>
    <manvolnum>1</manvolnum>
    <refmiscinfo class="manual">User Commands</refmiscinfo>
  </refmeta>

  <refnamediv>
    <refname>herold</refname>
    <refpurpose>HTML to DocBook converter</refpurpose>
  </refnamediv>

  <refsynopsisdiv>
    <cmdsynopsis>
      <command>herold</command>
      <arg choice="opt">OPTIONS</arg>
    </cmdsynopsis>
  </refsynopsisdiv>

  <refsection>
    <title>Description</title>

    <para>The reuse of HTML content in presentation-neutral form is a frequent
    problem. One possible solution is to convert HTML to DocBook XML, because
    DocBook is a semantic markup language for documentation, which enables its
    users to create document content that captures the logical structure of
    the content.</para>

    <para>The command line tool <productname>herold</productname> can be used
    to convert HTML to DocBook. Because HTML elements are often used not as
    intended, the possibilities for such a transformation are somewhat
    limited. herold is part of the dbdoclet suite of tools. For more
    information visit <link
    xlink:href="http://www.dbdoclet.org">http://www.dbdoclet.org</link>.</para>
  </refsection>

  <refsection>
    <title>Options</title>

    <variablelist>
      <varlistentry>
        <term>--docbook-add-index, -x</term>
        <listitem>
          <para>Automatically add an index element at the end of the
          document.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--docbook-decompose-tables, -T</term>
        <listitem>
          <para>Decomposes the tables from the HTML code into single
          paragraphs. This can be useful, if a document contains a lot of
          tables for formatting reasons.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--docbook-encoding, -d</term>
        <listitem>
          <para>Specifies the encoding of the generated DocBook XML
          files.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--docbook-root-element, -r</term>
        <listitem>
          <para>The root element of the document. Possible values are: book,
          article, reference, part, chapter or section. The default value for
          this option is 'article'</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--docbook-title, -t</term>
        <listitem>
          <para>The title for the resulting document.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--in, -i</term>
        <listitem>
          <para>Specifies the HTML input file.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--help, -h</term>
        <listitem>
          <para>Prints a help page on the console.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--html-encoding, -s</term>
        <listitem>
          <para>Specifies the encoding of the HTML source files, such as
          ISO-8859-1.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--out, -o</term>
        <listitem>
          <para>Specifies the DocBook XML destination file.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--profile, -p</term>
        <listitem>
          <para>A profile file with predefined settings.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--verbose, v</term>
        <listitem>
          <para>Enables the verbosity for the console output.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--version, -V</term>
        <listitem>
          <para>Displays the version of herold.</para>
        </listitem>
      </varlistentry>
    </variablelist>
  </refsection>

  <refsection>
    <title>Configuration</title>

    <para>The details of a transformation are controlled by a profile file. A
    profile file offers more possibilities to influence the transformation
    than the command line arguments. The following example shows a typical
    profile file.</para>

    <programlisting linenumbering="numbered">transformation html2docbook;

section section-detection  {
    attribute-class = ["^MsoHeading(\d+)$"];
    section-numbering-pattern = "((\d+\.)+)?\d*\.?\p{Z}*";
}

section list-detection {
    itemized-attribute-class = ["^MsoListBullet(\w*)$", "Aufzhlung(\w+)$];
    itemized-strip-prefix = [ "-", "o", "\u00b7" ];
    ordered-attribute-class = ["^MsoListNumbered(\w*)$"];
    ordered-strip-prefix = [ "\d+\.\s+" ];
}

section HTML {
    encoding = "windows-1252";
    exclude = [ "//p[starts-with(@class, 'MsoToc')]", "" ];
}

section DocBook {
    abstract = """&lt;title&gt;Lorem ipsum&lt;/title&gt;
&lt;para&gt;Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed 
do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 
enim ad minim veniam, quis nostrud exercitation ullamco laboris 
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in 
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.sed, dolor 
amet.&lt;/para&gt;""";
    add-index = true;
    author-email = "me@somewhere.de";
    author-firstname = "Michael";
    author-surname = "Fuchs";
    chunk-elements = [ "chapter", "section", "appendix" ];
// Syntax: chunk-&lt;CHUNK-ELEMENT&gt;-depth = &lt;INT&gt;;
    chunk-section-depth = 3;  
    collapse-protected-space = "true";
    copyright-holder = "Ingenieurbüro Michael Fuchs";
    copyright-year = "2015";
    corporation = "";
    create-condition-attribute = false;
    create-prolog = true;
    create-remap-attribute = false;
    create-xref-label = false;
    decompose-tables = false;
    detect-trapped-br = true;
    documentation-id = "doc01";
    document-element = "book";
    encoding = "UTF-8";
    hyphenation-char = "soft-hyphen";
    image-data-formats = [ "gif", "base64" ];
    image-path = "./figures";
    language = "de";
    release-info = "Version 3.1";
    table-style = "all";
    title = "Tutorial";
    title-normalize-space = true;
    use-absolute-image-path = false;
}
</programlisting>

    <refsection>
      <title>Syntax</title>
      <para>A profile file consists mainly of sections. Sections are used to
      group parameters which share the same context. Every section must start
      with the keyword <varname>section</varname> followed by the name of the
      section. After the name comes the block of parameters, which is
      surrounded by curly braces. Parameters can be of type String, Number,
      Boolean or Array. Strings must be framed with double quotes. If the
      String contains newlines, use three double quotes instead of one. Arrays
      are framed with square brackets. Inside an array, the elements must be
      comma separated. Every assignment must be finished by a semicolon. Multi
      line comments have the form <varname>/* my comment */</varname> , single
      line comments look like <varname>// my comment\n</varname>.</para>
    </refsection>

    <refsection>
      <title>Mandatory Elements</title>
      <para>A profile for herold must start with the line
      <literal>transformation html2docbook;</literal>.</para>
    </refsection>

    <refsection>
      <title>Section HTML</title>
      <para>The section HTML defines parameters, which control the loading and
      parsing of the HTML input data.</para>

      <para><variablelist>
          <varlistentry>
            <term>encoding</term>

            <listitem>
              <para>The character set used to read the input stream.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>exclude</term>

            <listitem>
              <para>Defines an array of xpath expressions. All matches are
              removed from the HTML DOM tree before transformation.</para>
            </listitem>
          </varlistentry>
        </variablelist></para>
    </refsection>

    <refsection>
      <title>Section DocBook</title>

      <para><variablelist>
          <varlistentry>
            <term>abstract</term>

            <listitem>
              <para>The text for the abstract element of the info section. If
              the text is structured with newlines, use three double quotes as
              delimiters. If the text starts with a "&lt;" character, it is
              embedded into an abstract element, otherwise the text is
              embedded into an para element inside of an abstract element. The
              text will parsed and can contain DocBook elements.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>add-index</term>

            <listitem>
              <para>If set to true, an index element is inserted at the end of
              the DocBook XML.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>author-email</term>

            <listitem>
              <para>The email address of the author. If this parameter is set,
              it is used to create an info section at the beginning of the
              document.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>author-firstname</term>

            <listitem>
              <para>The firstname of the author. If this parameter is set, it
              is used to create an info section at the beginning of the
              document.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>author-surname</term>

            <listitem>
              <para>The surname of the author. If this parameter is set, it is
              used to create an info section at the beginning of the
              document.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>chunk-elements</term>

            <listitem>
              <para>Defines an array of element names. If an element of this
              list is detected while writing the output, the element and all
              child nodes will be written to a separate file. This new file
              will be included into the parent file with an
              <markup>xi:include</markup> tag. Recursive structures result in
              recursive includes. You might want to use this, if you are
              transforming big HTML files and the resulting DocBook XML file
              becomes uncomfortable large.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>chunk-&lt;CHUNK-ELEMENT&gt;-depth</term>

            <listitem>
              <para>Defines the depth for a chunk element, until the chunking
              should be executed, eg <property>chunk-section-depth =
              3</property>. If an element defined for chunking is nested
              recursivley, you might want to control the depth to which the
              chunking should be done. The default depth is 1, which means
              only the topmost element is separated.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>create-xref-label</term>

            <listitem>
              <para>if set to false, anchor elements doesn't get a xreflabel
              attribute.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>decompose-tables</term>

            <listitem>
              <para>If set to true, tables structures will be ignored. The
              content of the table cells will be inserted into the DocBook XML
              as a sequence of paragraphs. This parameter can be useful if
              your HTML contains tables for formatting purposes. Normally you
              want to get rid of them, because they tamper the logical
              structure.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>document-element</term>
            <listitem>
              <para>The document element you want to use. Must be one of
              article, book, part or reference.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>encoding</term>
            <listitem>
              <para>The character set which will be used for writing the
              output file.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>image-data-formats</term>

            <listitem>
              <para>An array of image formats. These formats will be inserted
              as imageobject elements, additionally to the format found in the
              src attribute of the corresponding img element. The original
              format is inserted twice with the roles "html" and "fo". The
              other formats are inserted as "html-&lt;FORMAT&gt;" and
              "fo-&lt;FORMAT&gt;".</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>title</term>

            <listitem>
              <para>The title of the resulting document. If this parameter is
              undefined, herold tries to dected the title from the head
              section of the HTML data.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>use-absolute-image-path</term>

            <listitem>
              <para>If you want absolute image paths in the fileref attribute
              of the imagedata element, set this parameter to true.</para>
            </listitem>
          </varlistentry>
        </variablelist></para>
    </refsection>

    <refsection>
      <title>Section node</title>
      <para>The mapping of HTML elements to DocBook element can be fine tuned
      by using node sections. If you have HTML code which looks like the following fragment:
      <programlisting>&lt;ol class="procedure"&gt;
  &lt;li&gt;Step 1&lt;/li&gt;
  &lt;li&gt;Step 2&lt;/li&gt;
  &lt;li&gt;Step 3&lt;/li&gt;
&lt;/ol&gt;</programlisting>
      The resulting DocBook XML after the transformation would normally look
      like:
      <programlisting>&lt;orderedlist&gt;
  &lt;listitem&gt;Step 1&lt;/listitem&gt;
  &lt;listitem&gt;Step 2&lt;/listitem&gt;
  &lt;listitem&gt;Step 3&lt;/listitem&gt;
&lt;/orderlist&gt;</programlisting>
      But what you would like to have is something like:
      <programlisting>&lt;procedure&gt;
  &lt;step&gt;Step 1&lt;/step&gt;
  &lt;step&gt;Step 2&lt;/step&gt;
  &lt;step&gt;Step 3&lt;/step&gt;
&lt;/procedure&gt;</programlisting>
      To achieve this, you can use the following
      rules in our profile:
      <programlisting>node "//ol[@class='procedure']" {
  map-to = "procedure";
}

node "//ol[@class='procedure']/li" {
  map-to = "step";
}</programlisting>
      After the keyword <code>node</code> follows a xpath expression which is
      matched against the document element of the HTML file (typically &lt;html&gt;).
      The parameter <code>map-to</code> defines the DocBook element, which is used instead
      of the default mapping element.</para>
    </refsection>

    <refsection>
      <title>Section attribute</title>
      <para>An attribute section is more or less the same as a node section. Instead of redefining the mapping of a HTML element to a DocBook element, the mapping for an attribute is changed. The following section maps an attribute <code>class='procedure'</code> to <code>role='procedure'</code>.
      <programlisting>
attribute "//@class[contains(., 'procedure')]" {
	map-to = "role";
}
      </programlisting>
      </para>
    </refsection>

    <refsection>
      <title>Section section-detection</title>

      <para>The section <varname>section-detection</varname> is used to detect
      section elements in HTML code and to strip off any numbering prefix from
      the titles.</para>

      <para>Many authoring tools allow deeply nested sections. While exporting
      HTML, it happens, that the nesting becomes deeper than six levels. HTML
      provides header elements for up to six levels, h1-h6, but no h7 or even
      more. At this point, the formatting is normally done with the help of
      CSS and div or p elements. herold is able to detect the header element
      of HTML, but it can not know about the export format of a specific tool.
      To solve this problem even for some cases, you can specify the parameter
      <varname>attribute-class</varname>. It consists of a list of regular
      expressions, which are matched against the class attribute of each HTML
      element. If a match is found, the element is considered as a section
      element. The regular expression can have group, which is interpreted as
      level indicator. The group must be the first group and it must match
      against a number, e.g. <literal>^heading(\d+)$</literal>. If the level
      can not be detected, a level of seven is assumed.</para>

      <para>Because DocBook XSL stylesheets take care of the section numbering
      while transforming the DocBook XML to a specific output, it is often
      necessary to strip the numbering already defined in the HTML page.
      Otherwise you end up with two numbering texts in front of your titles.
      To help herold with the detection of numbering patterns, use the
      parameter <varname>section-numbering-pattern</varname>.</para>

      <para><variablelist>
          <varlistentry>
            <term>attribute-class</term>
            <listitem>
              <para>A regular expression, which is applied to every p and div
              element. If the expression matches, the current element is
              handled as a section element. If the regular expression has
              groups, the first group will be used as nesting level, otherwise
              level seven is assumed.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>section-numbering-pattern</term>
            <listitem>
              <para>Normally you want to get rid of the section numbering that
              comes with the HTML data, because it becomes part of the title
              text in DocBook. The section numbers will the appear twice in
              your target media. One from HTML and one from the DocBook XSL
              processing. The parameter section-numbering-pattern defines a
              regular expression, which is matched against the beginning of
              every section title. If it matches, the matching part is
              removed.</para>
            </listitem>
          </varlistentry>
        </variablelist></para>
    </refsection>

    <refsection>
      <title>Section list-detection</title>
      <para>Sometimes lists are not represented with ul, ol or dl tags, but
      they are represented as p tags with additional css formatting. If you
      use a tool, which creates or exports HTML with such a construct, the
      conversion will end up with para elements, instead of the corresponding
      list elements in DocBook. To recreate the lists in some cases, you can
      use the section <varname>list-detection</varname>. The parameters
      <varname>itemized-attribute-class</varname> and
      <varname>ordered-attribute-class</varname> let you define lists of
      regular expression, which should match against the class attribute of
      listitem elements in the HTML. herold tries to rebuild the proper list
      structure from this information, even for nested lists.</para>
    </refsection>
  </refsection>

  <refsection>
    <title>Copyright</title>
    <para>Copyright 2001-2015 Michael Fuchs. License GPLv3+: GNU GPL version 3
    or later <link
    xlink:href="http://gnu.org/licenses/gpl.html">http://gnu.org/licenses/gpl.html</link>.
    This is free software: you are free to change and redistribute it. There
    is NO WARRANTY, to the extent permitted by law.</para>
  </refsection>
</refentry>