File: architecture.xml

package info (click to toggle)
idzebra 2.2.8-2
links: PTS, VCS
area: main
in suites: forky, sid
size: 10,572 kB
sloc: ansic: 54,389; xml: 27,058; sh: 5,892; makefile: 1,102; perl: 210; tcl: 64
file content (629 lines) | stat: -rw-r--r-- 22,806 bytes
parent folder | download | duplicates (5)
 <chapter id="architecture">
  <title>Overview of &zebra; Architecture</title>

  <section id="architecture-representation">
   <title>Local Representation</title>

   <para>
    As mentioned earlier, &zebra; places few restrictions on the type of
    data that you can index and manage. Generally, whatever the form of
    the data, it is parsed by an input filter specific to that format, and
    turned into an internal structure that &zebra; knows how to handle. This
    process takes place whenever the record is accessed - for indexing and
    retrieval.
   </para>

   <para>
    The RecordType parameter in the <literal>zebra.cfg</literal> file, or
    the <literal>-t</literal> option to the indexer tells &zebra; how to
    process input records.
    Two basic types of processing are available - raw text and structured
    data. Raw text is just that, and it is selected by providing the
    argument <emphasis>text</emphasis> to &zebra;. Structured records are
    all handled internally using the basic mechanisms described in the
    subsequent sections.
    &zebra; can read structured records in many different formats.
    <!--
    How this is done is governed by additional parameters after the
    "grs" keyword, separated by "." characters.
    -->
   </para>
  </section>

  <section id="architecture-maincomponents">
   <title>Main Components</title>
   <para>
    The &zebra; system is designed to support a wide range of data management
    applications. The system can be configured to handle virtually any
    kind of structured data. Each record in the system is associated with
    a <emphasis>record schema</emphasis> which lends context to the data
    elements of the record.
    Any number of record schemas can coexist in the system.
    Although it may be wise to use only a single schema within
    one database, the system poses no such restrictions.
   </para>
   <para>
    The &zebra; indexer and information retrieval server consists of the
    following main applications: the <command>zebraidx</command>
    indexing maintenance utility, and the <command>zebrasrv</command>
    information query and retrieval server. Both are using some of the
    same main components, which are presented here.
   </para>
   <para>
    The virtual Debian package <literal>idzebra-2.0</literal>
    installs all the necessary packages to start
    working with &zebra; - including utility programs, development libraries,
    documentation and modules.
  </para>

   <section id="componentcore">
    <title>Core &zebra; Libraries Containing Common Functionality</title>
    <para>
     The core &zebra; module is the meat of the <command>zebraidx</command>
    indexing maintenance utility, and the <command>zebrasrv</command>
    information query and retrieval server binaries. Shortly, the core
    libraries are responsible for
     <variablelist>
      <varlistentry>
       <term>Dynamic Loading</term>
       <listitem>
        <para>of external filter modules, in case the application is
        not compiled statically. These filter modules define indexing,
        search and retrieval capabilities of the various input formats.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Index Maintenance</term>
       <listitem>
        <para> &zebra; maintains Term Dictionaries and ISAM index
        entries in inverted index structures kept on disk. These are
        optimized for fast inset, update and delete, as well as good
        search performance.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Search Evaluation</term>
       <listitem>
        <para>by execution of search requests expressed in &acro.pqf;/&acro.rpn;
         data structures, which are handed over from
         the &yaz; server frontend &acro.api;. Search evaluation includes
         construction of hit lists according to boolean combinations
         of simpler searches. Fast performance is achieved by careful
         use of index structures, and by evaluation specific index hit
         lists in correct order.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Ranking and Sorting</term>
       <listitem>
        <para>
         components call resorting/re-ranking algorithms on the hit
         sets. These might also be pre-sorted not only using the
         assigned document ID's, but also using assigned static rank
         information.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Record Presentation</term>
       <listitem>
        <para>returns - possibly ranked - result sets, hit
         numbers, and the like internal data to the &yaz; server backend &acro.api;
         for shipping to the client. Each individual filter module
         implements it's own specific presentation formats.
        </para>
       </listitem>
      </varlistentry>
     </variablelist>
     </para>
    <para>
     The Debian package <literal>libidzebra-2.0</literal>
     contains all run-time libraries for &zebra;, the
     documentation in PDF and HTML is found in
     <literal>idzebra-2.0-doc</literal>, and
     <literal>idzebra-2.0-common</literal>
     includes common essential &zebra; configuration files.
    </para>
   </section>


   <section id="componentindexer">
    <title>&zebra; Indexer</title>
    <para>
     The  <command>zebraidx</command>
     indexing maintenance utility
     loads external filter modules used for indexing data records of
     different type, and creates, updates and drops databases and
     indexes according to the rules defined in the filter modules.
    </para>
    <para>
     The Debian  package <literal>idzebra-2.0-utils</literal> contains
     the  <command>zebraidx</command> utility.
    </para>
   </section>

   <section id="componentsearcher">
    <title>&zebra; Searcher/Retriever</title>
    <para>
     This is the executable which runs the &acro.z3950;/&acro.sru;/&acro.srw; server and
     glues together the core libraries and the filter modules to one
     great Information Retrieval server application.
    </para>
    <para>
     The Debian  package <literal>idzebra-2.0-utils</literal> contains
     the  <command>zebrasrv</command> utility.
    </para>
   </section>

   <section id="componentyazserver">
    <title>&yaz; Server Frontend</title>
    <para>
     The &yaz; server frontend is
     a full fledged stateful &acro.z3950; server taking client
     connections, and forwarding search and scan requests to the
     &zebra; core indexer.
    </para>
    <para>
     In addition to &acro.z3950; requests, the &yaz; server frontend acts
     as HTTP server, honoring
      <ulink url="&url.sru;">&acro.sru; &acro.soap;</ulink>
     requests, and
     &acro.sru; &acro.rest;
     requests. Moreover, it can
     translate incoming
     <ulink url="&url.cql;">&acro.cql;</ulink>
     queries to
     <ulink url="&url.yaz.pqf;">&acro.pqf;</ulink>
      queries, if
     correctly configured.
    </para>
    <para>
     <ulink url="&url.yaz;">&yaz;</ulink>
     is an Open Source
     toolkit that allows you to develop software using the
     &acro.ansi; &acro.z3950;/ISO23950 standard for information retrieval.
     It is packaged in the Debian packages
     <literal>yaz</literal> and <literal>libyaz</literal>.
    </para>
   </section>

   <section id="componentmodules">
    <title>Record Models and Filter Modules</title>
    <para>
     The hard work of knowing <emphasis>what</emphasis> to index,
     <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
     part of the records to send in a search/retrieve response is
     implemented in
     various filter modules. It is their responsibility to define the
     exact indexing and record display filtering rules.
     </para>
     <para>
     The virtual Debian package
     <literal>libidzebra-2.0-modules</literal> installs all base filter
     modules.
    </para>

   <section id="componentmodulesdom">
    <title>&acro.dom; &acro.xml; Record Model and Filter Module</title>
     <para>
      The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
      internal data model, and can thus parse, index, and display
      any &acro.xml; document.
    </para>
    <para>
      A parser for binary &acro.marc; records based on the ISO2709 library
      standard is provided, it transforms these to the internal
      &acro.marcxml; &acro.dom; representation.
    </para>
    <para>
      The internal &acro.dom; &acro.xml; representation can be fed into four
      different pipelines, consisting of arbitrarily many successive
      &acro.xslt; transformations; these are for
     <itemizedlist>
       <listitem><para>input parsing and initial
          transformations,</para></listitem>
       <listitem><para>indexing term extraction
          transformations</para></listitem>
       <listitem><para>transformations before internal document
          storage, and </para></listitem>
       <listitem><para>retrieve transformations from storage to output
          format</para></listitem>
      </itemizedlist>
    </para>
    <para>
      The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if  supported on
      your platform, even &acro.exslt;), it brings thus full &acro.xpath;
      support to the indexing, storage and display rules of not only
      &acro.xml; documents, but also binary &acro.marc; records.
    </para>
    <para>
      Finally, the &acro.dom; &acro.xml; filter allows for static ranking at index
      time, and to to sort hit lists according to predefined
      static ranks.
    </para>
    <para>
      Details on the experimental &acro.dom; &acro.xml; filter are found in
      <xref linkend="record-model-domxml"/>.
      </para>
     <para>
      The Debian package <literal>libidzebra-2.0-mod-dom</literal>
      contains the &acro.dom; filter module.
     </para>
    </section>

   <section id="componentmodulesalvis">
    <title>ALVIS &acro.xml; Record Model and Filter Module</title>
     <note>
      <para>
        The functionality of this record model has been improved and
        replaced by the &acro.dom; &acro.xml; record model. See
        <xref linkend="componentmodulesdom"/>.
      </para>
     </note>

     <para>
      The Alvis filter for &acro.xml; files is an &acro.xslt; based input
      filter.
      It indexes element and attribute content of any thinkable &acro.xml; format
      using full &acro.xpath; support, a feature which the standard &zebra;
      &acro.grs1; &acro.sgml; and &acro.xml; filters lacked. The indexed documents are
      parsed into a standard &acro.xml; &acro.dom; tree, which restricts record size
      according to availability of memory.
    </para>
    <para>
      The Alvis filter
      uses &acro.xslt; display stylesheets, which let
      the &zebra; DB administrator associate multiple, different views on
      the same &acro.xml; document type. These views are chosen on-the-fly in
      search time.
     </para>
    <para>
      In addition, the Alvis filter configuration is not bound to the
      arcane  &acro.bib1; &acro.z3950; library catalogue indexing traditions and
      folklore, and is therefore easier to understand.
    </para>
    <para>
      Finally, the Alvis  filter allows for static ranking at index
      time, and to to sort hit lists according to predefined
      static ranks. This imposes no overhead at all, both
      search and indexing perform still
      <emphasis>O(1)</emphasis> irrespectively of document
      collection size. This feature resembles Google's pre-ranking using
      their PageRank algorithm.
    </para>
    <para>
      Details on the experimental Alvis &acro.xslt; filter are found in
      <xref linkend="record-model-alvisxslt"/>.
      </para>
     <para>
      The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
      contains the Alvis filter module.
     </para>
    </section>

   <section id="componentmodulesgrs">
    <title>&acro.grs1; Record Model and Filter Modules</title>
     <note>
      <para>
        The functionality of this record model has been improved and
        replaced by the &acro.dom; &acro.xml; record model. See
        <xref linkend="componentmodulesdom"/>.
      </para>
     </note>
    <para>
    The &acro.grs1; filter modules described in
    <xref linkend="grs"/>
    are all based on the &acro.z3950; specifications, and it is absolutely
    mandatory to have the reference pages on &acro.bib1; attribute sets on
    you hand when configuring &acro.grs1; filters. The GRS filters come in
    different flavors, and a short introduction is needed here.
    &acro.grs1; filters of various kind have also been called ABS filters due
    to the <filename>*.abs</filename> configuration file suffix.
    </para>
    <para>
      The <emphasis>grs.marc</emphasis> and
      <emphasis>grs.marcxml</emphasis> filters are suited to parse and
      index binary and &acro.xml; versions of traditional library &acro.marc; records
      based on the ISO2709 standard. The Debian package for both
      filters is
     <literal>libidzebra-2.0-mod-grs-marc</literal>.
    </para>
    <para>
      &acro.grs1; TCL scriptable filters for extensive user configuration come
     in two flavors: a regular expression filter
     <emphasis>grs.regx</emphasis> using TCL regular expressions, and
     a general scriptable TCL filter called
     <emphasis>grs.tcl</emphasis>
     are both included in the
     <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
    </para>
    <para>
      A general purpose &acro.sgml; filter is called
     <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
     but planned to be in the
     <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
    </para>
    <para>
      The Debian  package
      <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
      <emphasis>grs.xml</emphasis> filter which uses <ulink
      url="&url.expat;">Expat</ulink> to
      parse records in &acro.xml; and turn them into ID&zebra;'s internal &acro.grs1; node
      trees. Have also a look at the Alvis &acro.xml;/&acro.xslt; filter described in
      the next session.
    </para>
   </section>

   <section id="componentmodulestext">
    <title>TEXT Record Model and Filter Module</title>
    <para>
      Plain ASCII text filter. TODO: add information here.
    </para>
   </section>

    <!--
   <section id="componentmodulessafari">
    <title>SAFARI Record Model and Filter Module</title>
    <para>
     SAFARI filter module TODO: add information here.
    </para>
   </section>
    -->

   </section>

  </section>


  <section id="architecture-workflow">
   <title>Indexing and Retrieval Workflow</title>

  <para>
   Records pass through three different states during processing in the
   system.
  </para>

  <para>

   <itemizedlist>
    <listitem>

     <para>
      When records are accessed by the system, they are represented
      in their local, or native format. This might be &acro.sgml; or HTML files,
      News or Mail archives, &acro.marc; records. If the system doesn't already
      know how to read the type of data you need to store, you can set up an
      input filter by preparing conversion rules based on regular
      expressions and possibly augmented by a flexible scripting language
      (Tcl).
      The input filter produces as output an internal representation,
      a tree structure.

     </para>
    </listitem>
    <listitem>

     <para>
      When records are processed by the system, they are represented
      in a tree-structure, constructed by tagged data elements hanging off a
      root node. The tagged elements may contain data or yet more tagged
      elements in a recursive structure. The system performs various
      actions on this tree structure (indexing, element selection, schema
      mapping, etc.),

     </para>
    </listitem>
    <listitem>

     <para>
      Before transmitting records to the client, they are first
      converted from the internal structure to a form suitable for exchange
      over the network - according to the &acro.z3950; standard.
     </para>
    </listitem>

   </itemizedlist>

  </para>
  </section>

  <section id="special-retrieval">
   <title>Retrieval of &zebra; internal record data</title>
   <para>
    Starting with <literal>&zebra;</literal> version 2.0.5 or newer, it is
    possible to use a special element set which has the prefix
    <literal>zebra::</literal>.
   </para>
   <para>
    Using this element will, regardless of record type, return
    &zebra;'s internal index structure/data for a record.
    In particular, the regular record filters are not invoked when
    these are in use.
    This can in some cases make the retrieval faster than regular
    retrieval operations (for &acro.marc;, &acro.xml; etc).
   </para>
   <table id="special-retrieval-types">
    <title>Special Retrieval Elements</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Element Set</entry>
       <entry>Description</entry>
       <entry>Syntax</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry><literal>zebra::meta::sysno</literal></entry>
       <entry>Get &zebra; record system ID</entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry><literal>zebra::data</literal></entry>
       <entry>Get raw record</entry>
       <entry>all</entry>
      </row>
      <row>
       <entry><literal>zebra::meta</literal></entry>
       <entry>Get &zebra; record internal metadata</entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry><literal>zebra::index</literal></entry>
       <entry>Get all indexed keys for record</entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry>
	<literal>zebra::index::</literal><replaceable>f</replaceable>
       </entry>
       <entry>
	Get indexed keys for field <replaceable>f</replaceable>	for record
       </entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry>
	<literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
       </entry>
       <entry>
	Get indexed keys for field <replaceable>f</replaceable>
	  and type <replaceable>t</replaceable> for record
       </entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry>
	<literal>zebra::snippet</literal>
       </entry>
       <entry>
	Get snippet for record for one or more indexes (f1,f2,..).
	This includes a phrase from the original
	record at the point where a match occurs (for a query). By default
	give terms before - and after are included in the snippet. The
	matching terms are enclosed within element
	<literal>&lt;s&gt;</literal>. The snippet facility requires
	Zebra 2.0.16 or later.
       </entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry>
	<literal>zebra::facet::</literal><replaceable>f1</replaceable>:<replaceable>t1</replaceable>,<replaceable>f2</replaceable>:<replaceable>t2</replaceable>,..
       </entry>
       <entry>
	Get facet of a result set. The facet result is returned
	as if it was a normal record, while in reality is a
	recap of most "important" terms in a result set for the fields
	given.
	The facet facility first appeared in Zebra 2.0.20.
       </entry>
       <entry>&acro.xml;</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
   <para>
    For example, to fetch the raw binary record data stored in the
    zebra internal storage, or on the filesystem, the following
    commands can be issued:
    <screen>
      Z> f @attr 1=title my
      Z> format xml
      Z> elements zebra::data
      Z> s 1+1
      Z> format sutrs
      Z> s 1+1
      Z> format usmarc
      Z> s 1+1
    </screen>
    </para>
   <para>
    The special
    <literal>zebra::data</literal> element set name is
    defined for any record syntax, but will always fetch
    the raw record data in exactly the original form. No record syntax
    specific transformations will be applied to the raw record data.
   </para>
   <para>
    Also, &zebra; internal metadata about the record can be accessed:
    <screen>
      Z> f @attr 1=title my
      Z> format xml
      Z> elements zebra::meta::sysno
      Z> s 1+1
    </screen>
    displays in <literal>&acro.xml;</literal> record syntax only internal
    record system number, whereas
    <screen>
      Z> f @attr 1=title my
      Z> format xml
      Z> elements zebra::meta
      Z> s 1+1
    </screen>
    displays all available metadata on the record. These include system
    number, database name,  indexed filename,  filter used for indexing,
    score and static ranking information and finally bytesize of record.
   </para>
   <para>
    Sometimes, it is very hard to figure out what exactly has been
    indexed how and in which indexes. Using the indexing stylesheet of
    the Alvis filter, one can at least see which portion of the record
    went into which index, but a similar aid does not exist for all
    other indexing filters.
   </para>
   <para>
    The special
    <literal>zebra::index</literal> element set names are provided to
    access information on per record indexed fields. For example, the
    queries
    <screen>
      Z> f @attr 1=title my
      Z> format sutrs
      Z> elements zebra::index
      Z> s 1+1
    </screen>
    will display all indexed tokens from all indexed fields of the
    first record, and it will display in <literal>&acro.sutrs;</literal>
    record syntax, whereas
    <screen>
      Z> f @attr 1=title my
      Z> format xml
      Z> elements zebra::index::title
      Z> s 1+1
      Z> elements zebra::index::title:p
      Z> s 1+1
    </screen>
    displays in <literal>&acro.xml;</literal> record syntax only the content
      of the zebra string index <literal>title</literal>, or
      even only the type <literal>p</literal> phrase indexed part of it.
   </para>
   <note>
    <para>
     Trying to access numeric <literal>&acro.bib1;</literal> use
     attributes or trying to access non-existent zebra intern string
     access points will result in a Diagnostic 25: Specified element set
     'name not valid for specified database.
    </para>
   </note>
  </section>

 </chapter>

 <!-- Keep this comment at the end of the file
 Local variables:
 mode: sgml
 sgml-omittag:t
 sgml-shorttag:t
 sgml-minimize-attributes:nil
 sgml-always-quote-attributes:t
 sgml-indent-step:1
 sgml-indent-data:t
 sgml-parent-document: "idzebra.xml"
 sgml-local-catalogs: nil
 sgml-namecase-general:t
 End:
 -->