1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
|
<html><head><meta charset="ISO-8859-1"><title>2.Main Components</title><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot"><link rel="home" href="index.html" title="Zebra - User's Guide and Reference"><link rel="up" href="architecture.html" title="Chapter4.Overview of Zebra Architecture"><link rel="prev" href="architecture.html" title="Chapter4.Overview of Zebra Architecture"><link rel="next" href="architecture-workflow.html" title="3.Indexing and Retrieval Workflow"></head><body><link rel="stylesheet" type="text/css" href="common/style1.css"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">2.Main Components</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="architecture.html">Prev</a></td><th width="60%" align="center">Chapter4.Overview of <span class="application">Zebra</span> Architecture</th><td width="20%" align="right"><a accesskey="n" href="architecture-workflow.html">Next</a></td></tr></table><hr></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="architecture-maincomponents"></a>2.Main Components</h2></div></div></div><p>
The <span class="application">Zebra</span> system is designed to support a wide range of data management
applications. The system can be configured to handle virtually any
kind of structured data. Each record in the system is associated with
a <span class="emphasis"><em>record schema</em></span> which lends context to the data
elements of the record.
Any number of record schemas can coexist in the system.
Although it may be wise to use only a single schema within
one database, the system poses no such restrictions.
</p><p>
The <span class="application">Zebra</span> indexer and information retrieval server consists of the
following main applications: the <span class="command"><strong>zebraidx</strong></span>
indexing maintenance utility, and the <span class="command"><strong>zebrasrv</strong></span>
information query and retrieval server. Both are using some of the
same main components, which are presented here.
</p><p>
The virtual Debian package <code class="literal">idzebra-2.0</code>
installs all the necessary packages to start
working with <span class="application">Zebra</span> - including utility programs, development libraries,
documentation and modules.
</p><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentcore"></a>2.1.Core <span class="application">Zebra</span> Libraries Containing Common Functionality</h3></div></div></div><p>
The core <span class="application">Zebra</span> module is the meat of the <span class="command"><strong>zebraidx</strong></span>
indexing maintenance utility, and the <span class="command"><strong>zebrasrv</strong></span>
information query and retrieval server binaries. Shortly, the core
libraries are responsible for
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">Dynamic Loading</span></dt><dd><p>of external filter modules, in case the application is
not compiled statically. These filter modules define indexing,
search and retrieval capabilities of the various input formats.
</p></dd><dt><span class="term">Index Maintenance</span></dt><dd><p> <span class="application">Zebra</span> maintains Term Dictionaries and ISAM index
entries in inverted index structures kept on disk. These are
optimized for fast inset, update and delete, as well as good
search performance.
</p></dd><dt><span class="term">Search Evaluation</span></dt><dd><p>by execution of search requests expressed in <acronym class="acronym">PQF</acronym>/<acronym class="acronym">RPN</acronym>
data structures, which are handed over from
the <span class="application">YAZ</span> server frontend <acronym class="acronym">API</acronym>. Search evaluation includes
construction of hit lists according to boolean combinations
of simpler searches. Fast performance is achieved by careful
use of index structures, and by evaluation specific index hit
lists in correct order.
</p></dd><dt><span class="term">Ranking and Sorting</span></dt><dd><p>
components call resorting/re-ranking algorithms on the hit
sets. These might also be pre-sorted not only using the
assigned document ID's, but also using assigned static rank
information.
</p></dd><dt><span class="term">Record Presentation</span></dt><dd><p>returns - possibly ranked - result sets, hit
numbers, and the like internal data to the <span class="application">YAZ</span> server backend <acronym class="acronym">API</acronym>
for shipping to the client. Each individual filter module
implements it's own specific presentation formats.
</p></dd></dl></div><p>
</p><p>
The Debian package <code class="literal">libidzebra-2.0</code>
contains all run-time libraries for <span class="application">Zebra</span>, the
documentation in PDF and HTML is found in
<code class="literal">idzebra-2.0-doc</code>, and
<code class="literal">idzebra-2.0-common</code>
includes common essential <span class="application">Zebra</span> configuration files.
</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentindexer"></a>2.2.<span class="application">Zebra</span> Indexer</h3></div></div></div><p>
The <span class="command"><strong>zebraidx</strong></span>
indexing maintenance utility
loads external filter modules used for indexing data records of
different type, and creates, updates and drops databases and
indexes according to the rules defined in the filter modules.
</p><p>
The Debian package <code class="literal">idzebra-2.0-utils</code> contains
the <span class="command"><strong>zebraidx</strong></span> utility.
</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentsearcher"></a>2.3.<span class="application">Zebra</span> Searcher/Retriever</h3></div></div></div><p>
This is the executable which runs the <acronym class="acronym">Z39.50</acronym>/<acronym class="acronym">SRU</acronym>/<acronym class="acronym">SRW</acronym> server and
glues together the core libraries and the filter modules to one
great Information Retrieval server application.
</p><p>
The Debian package <code class="literal">idzebra-2.0-utils</code> contains
the <span class="command"><strong>zebrasrv</strong></span> utility.
</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentyazserver"></a>2.4.<span class="application">YAZ</span> Server Frontend</h3></div></div></div><p>
The <span class="application">YAZ</span> server frontend is
a full fledged stateful <acronym class="acronym">Z39.50</acronym> server taking client
connections, and forwarding search and scan requests to the
<span class="application">Zebra</span> core indexer.
</p><p>
In addition to <acronym class="acronym">Z39.50</acronym> requests, the <span class="application">YAZ</span> server frontend acts
as HTTP server, honoring
<a class="ulink" href="https://www.loc.gov/standards/sru/" target="_top"><acronym class="acronym">SRU</acronym> <acronym class="acronym">SOAP</acronym></a>
requests, and
<acronym class="acronym">SRU</acronym> <acronym class="acronym">REST</acronym>
requests. Moreover, it can
translate incoming
<a class="ulink" href="https://www.loc.gov/standards/sru/cql/" target="_top"><acronym class="acronym">CQL</acronym></a>
queries to
<a class="ulink" href="https://www.indexdata.com/yaz/doc/tools.html#PQF" target="_top"><acronym class="acronym">PQF</acronym></a>
queries, if
correctly configured.
</p><p>
<a class="ulink" href="https://www.indexdata.com/yaz" target="_top"><span class="application">YAZ</span></a>
is an Open Source
toolkit that allows you to develop software using the
<acronym class="acronym">ANSI</acronym> <acronym class="acronym">Z39.50</acronym>/ISO23950 standard for information retrieval.
It is packaged in the Debian packages
<code class="literal">yaz</code> and <code class="literal">libyaz</code>.
</p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentmodules"></a>2.5.Record Models and Filter Modules</h3></div></div></div><p>
The hard work of knowing <span class="emphasis"><em>what</em></span> to index,
<span class="emphasis"><em>how</em></span> to do it, and <span class="emphasis"><em>which</em></span>
part of the records to send in a search/retrieve response is
implemented in
various filter modules. It is their responsibility to define the
exact indexing and record display filtering rules.
</p><p>
The virtual Debian package
<code class="literal">libidzebra-2.0-modules</code> installs all base filter
modules.
</p><div class="section"><div class="titlepage"><div><div><h4 class="title"><a name="componentmodulesdom"></a>2.5.1.<acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> Record Model and Filter Module</h4></div></div></div><p>
The <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> filter uses a standard <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> structure as
internal data model, and can thus parse, index, and display
any <acronym class="acronym">XML</acronym> document.
</p><p>
A parser for binary <acronym class="acronym">MARC</acronym> records based on the ISO2709 library
standard is provided, it transforms these to the internal
<acronym class="acronym">MARCXML</acronym> <acronym class="acronym">DOM</acronym> representation.
</p><p>
The internal <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> representation can be fed into four
different pipelines, consisting of arbitrarily many successive
<acronym class="acronym">XSLT</acronym> transformations; these are for
</p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>input parsing and initial
transformations,</p></li><li class="listitem"><p>indexing term extraction
transformations</p></li><li class="listitem"><p>transformations before internal document
storage, and </p></li><li class="listitem"><p>retrieve transformations from storage to output
format</p></li></ul></div><p>
</p><p>
The <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> filter pipelines use <acronym class="acronym">XSLT</acronym> (and if supported on
your platform, even <acronym class="acronym">EXSLT</acronym>), it brings thus full <acronym class="acronym">XPATH</acronym>
support to the indexing, storage and display rules of not only
<acronym class="acronym">XML</acronym> documents, but also binary <acronym class="acronym">MARC</acronym> records.
</p><p>
Finally, the <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> filter allows for static ranking at index
time, and to to sort hit lists according to predefined
static ranks.
</p><p>
Details on the experimental <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> filter are found in
<a class="xref" href="record-model-domxml.html" title="Chapter7.DOM XML Record Model and Filter Module">Chapter7, <i><acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> Record Model and Filter Module</i></a>.
</p><p>
The Debian package <code class="literal">libidzebra-2.0-mod-dom</code>
contains the <acronym class="acronym">DOM</acronym> filter module.
</p></div><div class="section"><div class="titlepage"><div><div><h4 class="title"><a name="componentmodulesalvis"></a>2.5.2.ALVIS <acronym class="acronym">XML</acronym> Record Model and Filter Module</h4></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>
The functionality of this record model has been improved and
replaced by the <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> record model. See
<a class="xref" href="architecture-maincomponents.html#componentmodulesdom" title="2.5.1.DOM XML Record Model and Filter Module">Section2.5.1, “<acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> Record Model and Filter Module”</a>.
</p></div><p>
The Alvis filter for <acronym class="acronym">XML</acronym> files is an <acronym class="acronym">XSLT</acronym> based input
filter.
It indexes element and attribute content of any thinkable <acronym class="acronym">XML</acronym> format
using full <acronym class="acronym">XPATH</acronym> support, a feature which the standard <span class="application">Zebra</span>
<acronym class="acronym">GRS-1</acronym> <acronym class="acronym">SGML</acronym> and <acronym class="acronym">XML</acronym> filters lacked. The indexed documents are
parsed into a standard <acronym class="acronym">XML</acronym> <acronym class="acronym">DOM</acronym> tree, which restricts record size
according to availability of memory.
</p><p>
The Alvis filter
uses <acronym class="acronym">XSLT</acronym> display stylesheets, which let
the <span class="application">Zebra</span> DB administrator associate multiple, different views on
the same <acronym class="acronym">XML</acronym> document type. These views are chosen on-the-fly in
search time.
</p><p>
In addition, the Alvis filter configuration is not bound to the
arcane <acronym class="acronym">BIB-1</acronym> <acronym class="acronym">Z39.50</acronym> library catalogue indexing traditions and
folklore, and is therefore easier to understand.
</p><p>
Finally, the Alvis filter allows for static ranking at index
time, and to to sort hit lists according to predefined
static ranks. This imposes no overhead at all, both
search and indexing perform still
<span class="emphasis"><em>O(1)</em></span> irrespectively of document
collection size. This feature resembles Google's pre-ranking using
their PageRank algorithm.
</p><p>
Details on the experimental Alvis <acronym class="acronym">XSLT</acronym> filter are found in
<a class="xref" href="record-model-alvisxslt.html" title="Chapter8.ALVIS XML Record Model and Filter Module">Chapter8, <i>ALVIS <acronym class="acronym">XML</acronym> Record Model and Filter Module</i></a>.
</p><p>
The Debian package <code class="literal">libidzebra-2.0-mod-alvis</code>
contains the Alvis filter module.
</p></div><div class="section"><div class="titlepage"><div><div><h4 class="title"><a name="componentmodulesgrs"></a>2.5.3.<acronym class="acronym">GRS-1</acronym> Record Model and Filter Modules</h4></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>
The functionality of this record model has been improved and
replaced by the <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> record model. See
<a class="xref" href="architecture-maincomponents.html#componentmodulesdom" title="2.5.1.DOM XML Record Model and Filter Module">Section2.5.1, “<acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> Record Model and Filter Module”</a>.
</p></div><p>
The <acronym class="acronym">GRS-1</acronym> filter modules described in
<a class="xref" href="grs.html" title="Chapter9.GRS-1 Record Model and Filter Modules">Chapter9, <i><acronym class="acronym">GRS-1</acronym> Record Model and Filter Modules</i></a>
are all based on the <acronym class="acronym">Z39.50</acronym> specifications, and it is absolutely
mandatory to have the reference pages on <acronym class="acronym">BIB-1</acronym> attribute sets on
you hand when configuring <acronym class="acronym">GRS-1</acronym> filters. The GRS filters come in
different flavors, and a short introduction is needed here.
<acronym class="acronym">GRS-1</acronym> filters of various kind have also been called ABS filters due
to the <code class="filename">*.abs</code> configuration file suffix.
</p><p>
The <span class="emphasis"><em>grs.marc</em></span> and
<span class="emphasis"><em>grs.marcxml</em></span> filters are suited to parse and
index binary and <acronym class="acronym">XML</acronym> versions of traditional library <acronym class="acronym">MARC</acronym> records
based on the ISO2709 standard. The Debian package for both
filters is
<code class="literal">libidzebra-2.0-mod-grs-marc</code>.
</p><p>
<acronym class="acronym">GRS-1</acronym> TCL scriptable filters for extensive user configuration come
in two flavors: a regular expression filter
<span class="emphasis"><em>grs.regx</em></span> using TCL regular expressions, and
a general scriptable TCL filter called
<span class="emphasis"><em>grs.tcl</em></span>
are both included in the
<code class="literal">libidzebra-2.0-mod-grs-regx</code> Debian package.
</p><p>
A general purpose <acronym class="acronym">SGML</acronym> filter is called
<span class="emphasis"><em>grs.sgml</em></span>. This filter is not yet packaged,
but planned to be in the
<code class="literal">libidzebra-2.0-mod-grs-sgml</code> Debian package.
</p><p>
The Debian package
<code class="literal">libidzebra-2.0-mod-grs-xml</code> includes the
<span class="emphasis"><em>grs.xml</em></span> filter which uses <a class="ulink" href="https://libexpat.github.io" target="_top">Expat</a> to
parse records in <acronym class="acronym">XML</acronym> and turn them into ID<span class="application">Zebra</span>'s internal <acronym class="acronym">GRS-1</acronym> node
trees. Have also a look at the Alvis <acronym class="acronym">XML</acronym>/<acronym class="acronym">XSLT</acronym> filter described in
the next session.
</p></div><div class="section"><div class="titlepage"><div><div><h4 class="title"><a name="componentmodulestext"></a>2.5.4.TEXT Record Model and Filter Module</h4></div></div></div><p>
Plain ASCII text filter. TODO: add information here.
</p></div></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="architecture.html">Prev</a></td><td width="20%" align="center"><a accesskey="u" href="architecture.html">Up</a></td><td width="40%" align="right"><a accesskey="n" href="architecture-workflow.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter4.Overview of <span class="application">Zebra</span> Architecture</td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top">3.Indexing and Retrieval Workflow</td></tr></table></div></body></html>
|