File: manpage.xml

package info (click to toggle)
herold 8.0.1-2
links: PTS, VCS
area: main
in suites: forky, sid
size: 5,304 kB
sloc: java: 40,460; xml: 513; makefile: 37; sh: 29
file content (551 lines) | stat: -rw-r--r-- 20,873 bytes
parent folder | download | duplicates (3)
<?xml version="1.0" encoding="UTF-8"?>
<refentry version="5.0" xmlns="http://docbook.org/ns/docbook"
          xmlns:xlink="http://www.w3.org/1999/xlink"
          xmlns:xi="http://www.w3.org/2001/XInclude"
          xmlns:svg="http://www.w3.org/2000/svg"
          xmlns:m="http://www.w3.org/1998/Math/MathML"
          xmlns:html="http://www.w3.org/1999/xhtml"
          xmlns:db="http://docbook.org/ns/docbook">
  <info>
    <author>
      <personname>Michael Fuchs</personname>
      <personblurb>
        <para>Software Engineer</para>
      </personblurb>
    </author>
    <productname>herold</productname>
  </info>

  <refmeta>
    <refentrytitle>herold</refentrytitle>
    <manvolnum>1</manvolnum>
    <refmiscinfo class="manual">User Commands</refmiscinfo>
  </refmeta>

  <refnamediv>
    <refname>herold</refname>
    <refpurpose>HTML to DocBook converter</refpurpose>
  </refnamediv>

  <refsynopsisdiv>
    <cmdsynopsis>
      <command>herold</command>
      <arg choice="opt">OPTIONS</arg>
    </cmdsynopsis>
  </refsynopsisdiv>

  <refsection>
    <title>Description</title>

    <para>The reuse of HTML content in presentation-neutral form is a frequent
    problem. One possible solution is to convert HTML to DocBook XML, because
    DocBook is a semantic markup language for documentation, which enables its
    users to create document content that captures the logical structure of
    the content.</para>

    <para>The command line tool <productname>herold</productname> can be used
    to convert HTML to DocBook. Because HTML elements are often used not as
    intended, the possibilities for such a transformation are somewhat
    limited. herold is part of the dbdoclet suite of tools. For more
    information visit <link
    xlink:href="http://www.dbdoclet.org">http://www.dbdoclet.org</link>.</para>
  </refsection>

  <refsection>
    <title>Options</title>

    <variablelist>
      <varlistentry>
        <term>--docbook-add-index, -x</term>
        <listitem>
          <para>Automatically add an index element at the end of the
          document.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--docbook-decompose-tables, -T</term>
        <listitem>
          <para>Decomposes the tables from the HTML code into single
          paragraphs. This can be useful, if a document contains a lot of
          tables for formatting reasons.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--docbook-encoding, -d</term>
        <listitem>
          <para>Specifies the encoding of the generated DocBook XML
          files.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--docbook-root-element, -r</term>
        <listitem>
          <para>The root element of the document. Possible values are: book,
          article, reference, part, chapter or section. The default value for
          this option is 'article'</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--docbook-title, -t</term>
        <listitem>
          <para>The title for the resulting document.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--in, -i</term>
        <listitem>
          <para>Specifies the HTML input file.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--help, -h</term>
        <listitem>
          <para>Prints a help page on the console.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--html-encoding, -s</term>
        <listitem>
          <para>Specifies the encoding of the HTML source files, such as
          ISO-8859-1.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--out, -o</term>
        <listitem>
          <para>Specifies the DocBook XML destination file.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--profile, -p</term>
        <listitem>
          <para>A profile file with predefined settings.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--verbose, v</term>
        <listitem>
          <para>Enables the verbosity for the console output.</para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term>--version, -V</term>
        <listitem>
          <para>Displays the version of herold.</para>
        </listitem>
      </varlistentry>
    </variablelist>
  </refsection>

  <refsection>
    <title>Configuration</title>

    <para>The details of a transformation are controlled by a profile file. A
    profile file offers more possibilities to influence the transformation
    than the command line arguments. The following example shows a typical
    profile file.</para>

    <programlisting linenumbering="numbered">transformation html2docbook;

section section-detection  {
    attribute-class = ["^MsoHeading(\d+)$"];
    section-numbering-pattern = "((\d+\.)+)?\d*\.?\p{Z}*";
}

section list-detection {
    itemized-attribute-class = ["^MsoListBullet(\w*)$", "Aufzhlung(\w+)$];
    itemized-strip-prefix = [ "-", "o", "\u00b7" ];
    ordered-attribute-class = ["^MsoListNumbered(\w*)$"];
    ordered-strip-prefix = [ "\d+\.\s+" ];
}

section HTML {
    encoding = "windows-1252";
    exclude = [ "//p[starts-with(@class, 'MsoToc')]", "" ];
}

section DocBook {
    abstract = """&lt;title&gt;Lorem ipsum&lt;/title&gt;
&lt;para&gt;Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed 
do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 
enim ad minim veniam, quis nostrud exercitation ullamco laboris 
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in 
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.sed, dolor 
amet.&lt;/para&gt;""";
    add-index = true;
    author-email = "me@somewhere.de";
    author-firstname = "Michael";
    author-surname = "Fuchs";
    chunk-elements = [ "chapter", "section", "appendix" ];
// Syntax: chunk-&lt;CHUNK-ELEMENT&gt;-depth = &lt;INT&gt;;
    chunk-section-depth = 3;  
    collapse-protected-space = "true";
    copyright-holder = "Ingenieurbüro Michael Fuchs";
    copyright-year = "2015";
    corporation = "";
    create-condition-attribute = false;
    create-prolog = true;
    create-remap-attribute = false;
    create-xref-label = false;
    decompose-tables = false;
    detect-trapped-br = true;
    documentation-id = "doc01";
    document-element = "book";
    encoding = "UTF-8";
    hyphenation-char = "soft-hyphen";
    image-data-formats = [ "gif", "base64" ];
    image-path = "./figures";
    language = "de";
    release-info = "Version 3.1";
    table-style = "all";
    title = "Tutorial";
    title-normalize-space = true;
    use-absolute-image-path = false;
}
</programlisting>

    <refsection>
      <title>Syntax</title>
      <para>A profile file consists mainly of sections. Sections are used to
      group parameters which share the same context. Every section must start
      with the keyword <varname>section</varname> followed by the name of the
      section. After the name comes the block of parameters, which is
      surrounded by curly braces. Parameters can be of type String, Number,
      Boolean or Array. Strings must be framed with double quotes. If the
      String contains newlines, use three double quotes instead of one. Arrays
      are framed with square brackets. Inside an array, the elements must be
      comma separated. Every assignment must be finished by a semicolon. Multi
      line comments have the form <varname>/* my comment */</varname> , single
      line comments look like <varname>// my comment\n</varname>.</para>
    </refsection>

    <refsection>
      <title>Mandatory Elements</title>
      <para>A profile for herold must start with the line
      <literal>transformation html2docbook;</literal>.</para>
    </refsection>

    <refsection>
      <title>Section HTML</title>
      <para>The section HTML defines parameters, which control the loading and
      parsing of the HTML input data.</para>

      <para><variablelist>
          <varlistentry>
            <term>encoding</term>

            <listitem>
              <para>The character set used to read the input stream.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>exclude</term>

            <listitem>
              <para>Defines an array of xpath expressions. All matches are
              removed from the HTML DOM tree before transformation.</para>
            </listitem>
          </varlistentry>
        </variablelist></para>
    </refsection>

    <refsection>
      <title>Section DocBook</title>

      <para><variablelist>
          <varlistentry>
            <term>abstract</term>

            <listitem>
              <para>The text for the abstract element of the info section. If
              the text is structured with newlines, use three double quotes as
              delimiters. If the text starts with a "&lt;" character, it is
              embedded into an abstract element, otherwise the text is
              embedded into an para element inside of an abstract element. The
              text will parsed and can contain DocBook elements.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>add-index</term>

            <listitem>
              <para>If set to true, an index element is inserted at the end of
              the DocBook XML.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>author-email</term>

            <listitem>
              <para>The email address of the author. If this parameter is set,
              it is used to create an info section at the beginning of the
              document.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>author-firstname</term>

            <listitem>
              <para>The firstname of the author. If this parameter is set, it
              is used to create an info section at the beginning of the
              document.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>author-surname</term>

            <listitem>
              <para>The surname of the author. If this parameter is set, it is
              used to create an info section at the beginning of the
              document.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>chunk-elements</term>

            <listitem>
              <para>Defines an array of element names. If an element of this
              list is detected while writing the output, the element and all
              child nodes will be written to a separate file. This new file
              will be included into the parent file with an
              <markup>xi:include</markup> tag. Recursive structures result in
              recursive includes. You might want to use this, if you are
              transforming big HTML files and the resulting DocBook XML file
              becomes uncomfortable large.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>chunk-&lt;CHUNK-ELEMENT&gt;-depth</term>

            <listitem>
              <para>Defines the depth for a chunk element, until the chunking
              should be executed, eg <property>chunk-section-depth =
              3</property>. If an element defined for chunking is nested
              recursivley, you might want to control the depth to which the
              chunking should be done. The default depth is 1, which means
              only the topmost element is separated.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>create-xref-label</term>

            <listitem>
              <para>if set to false, anchor elements doesn't get a xreflabel
              attribute.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>decompose-tables</term>

            <listitem>
              <para>If set to true, tables structures will be ignored. The
              content of the table cells will be inserted into the DocBook XML
              as a sequence of paragraphs. This parameter can be useful if
              your HTML contains tables for formatting purposes. Normally you
              want to get rid of them, because they tamper the logical
              structure.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>document-element</term>
            <listitem>
              <para>The document element you want to use. Must be one of
              article, book, part or reference.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>encoding</term>
            <listitem>
              <para>The character set which will be used for writing the
              output file.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>image-data-formats</term>

            <listitem>
              <para>An array of image formats. These formats will be inserted
              as imageobject elements, additionally to the format found in the
              src attribute of the corresponding img element. The original
              format is inserted twice with the roles "html" and "fo". The
              other formats are inserted as "html-&lt;FORMAT&gt;" and
              "fo-&lt;FORMAT&gt;".</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>title</term>

            <listitem>
              <para>The title of the resulting document. If this parameter is
              undefined, herold tries to dected the title from the head
              section of the HTML data.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>use-absolute-image-path</term>

            <listitem>
              <para>If you want absolute image paths in the fileref attribute
              of the imagedata element, set this parameter to true.</para>
            </listitem>
          </varlistentry>
        </variablelist></para>
    </refsection>

    <refsection>
      <title>Section node</title>
      <para>The mapping of HTML elements to DocBook element can be fine tuned
      by using node sections. If you have HTML code which looks like the following fragment:
      <programlisting>&lt;ol class="procedure"&gt;
  &lt;li&gt;Step 1&lt;/li&gt;
  &lt;li&gt;Step 2&lt;/li&gt;
  &lt;li&gt;Step 3&lt;/li&gt;
&lt;/ol&gt;</programlisting>
      The resulting DocBook XML after the transformation would normally look
      like:
      <programlisting>&lt;orderedlist&gt;
  &lt;listitem&gt;Step 1&lt;/listitem&gt;
  &lt;listitem&gt;Step 2&lt;/listitem&gt;
  &lt;listitem&gt;Step 3&lt;/listitem&gt;
&lt;/orderlist&gt;</programlisting>
      But what you would like to have is something like:
      <programlisting>&lt;procedure&gt;
  &lt;step&gt;Step 1&lt;/step&gt;
  &lt;step&gt;Step 2&lt;/step&gt;
  &lt;step&gt;Step 3&lt;/step&gt;
&lt;/procedure&gt;</programlisting>
      To achieve this, you can use the following
      rules in our profile:
      <programlisting>node "//ol[@class='procedure']" {
  map-to = "procedure";
}

node "//ol[@class='procedure']/li" {
  map-to = "step";
}</programlisting>
      After the keyword <code>node</code> follows a xpath expression which is
      matched against the document element of the HTML file (typically &lt;html&gt;).
      The parameter <code>map-to</code> defines the DocBook element, which is used instead
      of the default mapping element.</para>
    </refsection>

    <refsection>
      <title>Section attribute</title>
      <para>An attribute section is more or less the same as a node section. Instead of redefining the mapping of a HTML element to a DocBook element, the mapping for an attribute is changed. The following section maps an attribute <code>class='procedure'</code> to <code>role='procedure'</code>.
      <programlisting>
attribute "//@class[contains(., 'procedure')]" {
	map-to = "role";
}
      </programlisting>
      </para>
    </refsection>

    <refsection>
      <title>Section section-detection</title>

      <para>The section <varname>section-detection</varname> is used to detect
      section elements in HTML code and to strip off any numbering prefix from
      the titles.</para>

      <para>Many authoring tools allow deeply nested sections. While exporting
      HTML, it happens, that the nesting becomes deeper than six levels. HTML
      provides header elements for up to six levels, h1-h6, but no h7 or even
      more. At this point, the formatting is normally done with the help of
      CSS and div or p elements. herold is able to detect the header element
      of HTML, but it can not know about the export format of a specific tool.
      To solve this problem even for some cases, you can specify the parameter
      <varname>attribute-class</varname>. It consists of a list of regular
      expressions, which are matched against the class attribute of each HTML
      element. If a match is found, the element is considered as a section
      element. The regular expression can have group, which is interpreted as
      level indicator. The group must be the first group and it must match
      against a number, e.g. <literal>^heading(\d+)$</literal>. If the level
      can not be detected, a level of seven is assumed.</para>

      <para>Because DocBook XSL stylesheets take care of the section numbering
      while transforming the DocBook XML to a specific output, it is often
      necessary to strip the numbering already defined in the HTML page.
      Otherwise you end up with two numbering texts in front of your titles.
      To help herold with the detection of numbering patterns, use the
      parameter <varname>section-numbering-pattern</varname>.</para>

      <para><variablelist>
          <varlistentry>
            <term>attribute-class</term>
            <listitem>
              <para>A regular expression, which is applied to every p and div
              element. If the expression matches, the current element is
              handled as a section element. If the regular expression has
              groups, the first group will be used as nesting level, otherwise
              level seven is assumed.</para>
            </listitem>
          </varlistentry>

          <varlistentry>
            <term>section-numbering-pattern</term>
            <listitem>
              <para>Normally you want to get rid of the section numbering that
              comes with the HTML data, because it becomes part of the title
              text in DocBook. The section numbers will the appear twice in
              your target media. One from HTML and one from the DocBook XSL
              processing. The parameter section-numbering-pattern defines a
              regular expression, which is matched against the beginning of
              every section title. If it matches, the matching part is
              removed.</para>
            </listitem>
          </varlistentry>
        </variablelist></para>
    </refsection>

    <refsection>
      <title>Section list-detection</title>
      <para>Sometimes lists are not represented with ul, ol or dl tags, but
      they are represented as p tags with additional css formatting. If you
      use a tool, which creates or exports HTML with such a construct, the
      conversion will end up with para elements, instead of the corresponding
      list elements in DocBook. To recreate the lists in some cases, you can
      use the section <varname>list-detection</varname>. The parameters
      <varname>itemized-attribute-class</varname> and
      <varname>ordered-attribute-class</varname> let you define lists of
      regular expression, which should match against the class attribute of
      listitem elements in the HTML. herold tries to rebuild the proper list
      structure from this information, even for nested lists.</para>
    </refsection>
  </refsection>

  <refsection>
    <title>Copyright</title>
    <para>Copyright 2001-2015 Michael Fuchs. License GPLv3+: GNU GPL version 3
    or later <link
    xlink:href="http://gnu.org/licenses/gpl.html">http://gnu.org/licenses/gpl.html</link>.
    This is free software: you are free to change and redistribute it. There
    is NO WARRANTY, to the extent permitted by law.</para>
  </refsection>
</refentry>