File: architecture-maincomponents.html

package info (click to toggle)
idzebra 2.2.10-1
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 10,644 kB
  • sloc: ansic: 54,389; xml: 27,054; sh: 6,211; makefile: 1,099; perl: 210; tcl: 64
file content (219 lines) | stat: -rw-r--r-- 19,604 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
<html><head><meta charset="ISO-8859-1"><title>2.Main Components</title><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot"><link rel="home" href="index.html" title="Zebra - User's Guide and Reference"><link rel="up" href="architecture.html" title="Chapter4.Overview of Zebra Architecture"><link rel="prev" href="architecture.html" title="Chapter4.Overview of Zebra Architecture"><link rel="next" href="architecture-workflow.html" title="3.Indexing and Retrieval Workflow"></head><body><link rel="stylesheet" type="text/css" href="common/style1.css"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">2.Main Components</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="architecture.html">Prev</a></td><th width="60%" align="center">Chapter4.Overview of <span class="application">Zebra</span> Architecture</th><td width="20%" align="right"><a accesskey="n" href="architecture-workflow.html">Next</a></td></tr></table><hr></div><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="architecture-maincomponents"></a>2.Main Components</h2></div></div></div><p>
    The <span class="application">Zebra</span> system is designed to support a wide range of data management
    applications. The system can be configured to handle virtually any
    kind of structured data. Each record in the system is associated with
    a <span class="emphasis"><em>record schema</em></span> which lends context to the data
    elements of the record.
    Any number of record schemas can coexist in the system.
    Although it may be wise to use only a single schema within
    one database, the system poses no such restrictions.
   </p><p>
    The <span class="application">Zebra</span> indexer and information retrieval server consists of the
    following main applications: the <span class="command"><strong>zebraidx</strong></span>
    indexing maintenance utility, and the <span class="command"><strong>zebrasrv</strong></span>
    information query and retrieval server. Both are using some of the
    same main components, which are presented here.
   </p><p>
    The virtual Debian package <code class="literal">idzebra-2.0</code>
    installs all the necessary packages to start
    working with <span class="application">Zebra</span> - including utility programs, development libraries,
    documentation and modules.
  </p><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentcore"></a>2.1.Core <span class="application">Zebra</span> Libraries Containing Common Functionality</h3></div></div></div><p>
     The core <span class="application">Zebra</span> module is the meat of the <span class="command"><strong>zebraidx</strong></span>
    indexing maintenance utility, and the <span class="command"><strong>zebrasrv</strong></span>
    information query and retrieval server binaries. Shortly, the core
    libraries are responsible for
     </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">Dynamic Loading</span></dt><dd><p>of external filter modules, in case the application is
        not compiled statically. These filter modules define indexing,
        search and retrieval capabilities of the various input formats.
        </p></dd><dt><span class="term">Index Maintenance</span></dt><dd><p> <span class="application">Zebra</span> maintains Term Dictionaries and ISAM index
        entries in inverted index structures kept on disk. These are
        optimized for fast inset, update and delete, as well as good
        search performance.
        </p></dd><dt><span class="term">Search Evaluation</span></dt><dd><p>by execution of search requests expressed in <acronym class="acronym">PQF</acronym>/<acronym class="acronym">RPN</acronym>
         data structures, which are handed over from
         the <span class="application">YAZ</span> server frontend <acronym class="acronym">API</acronym>. Search evaluation includes
         construction of hit lists according to boolean combinations
         of simpler searches. Fast performance is achieved by careful
         use of index structures, and by evaluation specific index hit
         lists in correct order.
        </p></dd><dt><span class="term">Ranking and Sorting</span></dt><dd><p>
         components call resorting/re-ranking algorithms on the hit
         sets. These might also be pre-sorted not only using the
         assigned document ID's, but also using assigned static rank
         information.
        </p></dd><dt><span class="term">Record Presentation</span></dt><dd><p>returns - possibly ranked - result sets, hit
         numbers, and the like internal data to the <span class="application">YAZ</span> server backend <acronym class="acronym">API</acronym>
         for shipping to the client. Each individual filter module
         implements it's own specific presentation formats.
        </p></dd></dl></div><p>
     </p><p>
     The Debian package <code class="literal">libidzebra-2.0</code>
     contains all run-time libraries for <span class="application">Zebra</span>, the
     documentation in PDF and HTML is found in
     <code class="literal">idzebra-2.0-doc</code>, and
     <code class="literal">idzebra-2.0-common</code>
     includes common essential <span class="application">Zebra</span> configuration files.
    </p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentindexer"></a>2.2.<span class="application">Zebra</span> Indexer</h3></div></div></div><p>
     The  <span class="command"><strong>zebraidx</strong></span>
     indexing maintenance utility
     loads external filter modules used for indexing data records of
     different type, and creates, updates and drops databases and
     indexes according to the rules defined in the filter modules.
    </p><p>
     The Debian  package <code class="literal">idzebra-2.0-utils</code> contains
     the  <span class="command"><strong>zebraidx</strong></span> utility.
    </p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentsearcher"></a>2.3.<span class="application">Zebra</span> Searcher/Retriever</h3></div></div></div><p>
     This is the executable which runs the <acronym class="acronym">Z39.50</acronym>/<acronym class="acronym">SRU</acronym>/<acronym class="acronym">SRW</acronym> server and
     glues together the core libraries and the filter modules to one
     great Information Retrieval server application.
    </p><p>
     The Debian  package <code class="literal">idzebra-2.0-utils</code> contains
     the  <span class="command"><strong>zebrasrv</strong></span> utility.
    </p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentyazserver"></a>2.4.<span class="application">YAZ</span> Server Frontend</h3></div></div></div><p>
     The <span class="application">YAZ</span> server frontend is
     a full fledged stateful <acronym class="acronym">Z39.50</acronym> server taking client
     connections, and forwarding search and scan requests to the
     <span class="application">Zebra</span> core indexer.
    </p><p>
     In addition to <acronym class="acronym">Z39.50</acronym> requests, the <span class="application">YAZ</span> server frontend acts
     as HTTP server, honoring
      <a class="ulink" href="https://www.loc.gov/standards/sru/" target="_top"><acronym class="acronym">SRU</acronym> <acronym class="acronym">SOAP</acronym></a>
     requests, and
     <acronym class="acronym">SRU</acronym> <acronym class="acronym">REST</acronym>
     requests. Moreover, it can
     translate incoming
     <a class="ulink" href="https://www.loc.gov/standards/sru/cql/" target="_top"><acronym class="acronym">CQL</acronym></a>
     queries to
     <a class="ulink" href="https://www.indexdata.com/yaz/doc/tools.html#PQF" target="_top"><acronym class="acronym">PQF</acronym></a>
      queries, if
     correctly configured.
    </p><p>
     <a class="ulink" href="https://www.indexdata.com/yaz" target="_top"><span class="application">YAZ</span></a>
     is an Open Source
     toolkit that allows you to develop software using the
     <acronym class="acronym">ANSI</acronym> <acronym class="acronym">Z39.50</acronym>/ISO23950 standard for information retrieval.
     It is packaged in the Debian packages
     <code class="literal">yaz</code> and <code class="literal">libyaz</code>.
    </p></div><div class="section"><div class="titlepage"><div><div><h3 class="title"><a name="componentmodules"></a>2.5.Record Models and Filter Modules</h3></div></div></div><p>
     The hard work of knowing <span class="emphasis"><em>what</em></span> to index,
     <span class="emphasis"><em>how</em></span> to do it, and <span class="emphasis"><em>which</em></span>
     part of the records to send in a search/retrieve response is
     implemented in
     various filter modules. It is their responsibility to define the
     exact indexing and record display filtering rules.
     </p><p>
     The virtual Debian package
     <code class="literal">libidzebra-2.0-modules</code> installs all base filter
     modules.
    </p><div class="section"><div class="titlepage"><div><div><h4 class="title"><a name="componentmodulesdom"></a>2.5.1.<acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> Record Model and Filter Module</h4></div></div></div><p>
      The <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> filter uses a standard <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> structure as
      internal data model, and can thus parse, index, and display
      any <acronym class="acronym">XML</acronym> document.
    </p><p>
      A parser for binary <acronym class="acronym">MARC</acronym> records based on the ISO2709 library
      standard is provided, it transforms these to the internal
      <acronym class="acronym">MARCXML</acronym> <acronym class="acronym">DOM</acronym> representation.
    </p><p>
      The internal <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> representation can be fed into four
      different pipelines, consisting of arbitrarily many successive
      <acronym class="acronym">XSLT</acronym> transformations; these are for
     </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>input parsing and initial
          transformations,</p></li><li class="listitem"><p>indexing term extraction
          transformations</p></li><li class="listitem"><p>transformations before internal document
          storage, and </p></li><li class="listitem"><p>retrieve transformations from storage to output
          format</p></li></ul></div><p>
    </p><p>
      The <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> filter pipelines use <acronym class="acronym">XSLT</acronym> (and if  supported on
      your platform, even <acronym class="acronym">EXSLT</acronym>), it brings thus full <acronym class="acronym">XPATH</acronym>
      support to the indexing, storage and display rules of not only
      <acronym class="acronym">XML</acronym> documents, but also binary <acronym class="acronym">MARC</acronym> records.
    </p><p>
      Finally, the <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> filter allows for static ranking at index
      time, and to to sort hit lists according to predefined
      static ranks.
    </p><p>
      Details on the experimental <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> filter are found in
      <a class="xref" href="record-model-domxml.html" title="Chapter7.DOM XML Record Model and Filter Module">Chapter7, <i><acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> Record Model and Filter Module</i></a>.
      </p><p>
      The Debian package <code class="literal">libidzebra-2.0-mod-dom</code>
      contains the <acronym class="acronym">DOM</acronym> filter module.
     </p></div><div class="section"><div class="titlepage"><div><div><h4 class="title"><a name="componentmodulesalvis"></a>2.5.2.ALVIS <acronym class="acronym">XML</acronym> Record Model and Filter Module</h4></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>
        The functionality of this record model has been improved and
        replaced by the <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> record model. See
        <a class="xref" href="architecture-maincomponents.html#componentmodulesdom" title="2.5.1.DOM XML Record Model and Filter Module">Section2.5.1, &#8220;<acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> Record Model and Filter Module&#8221;</a>.
      </p></div><p>
      The Alvis filter for <acronym class="acronym">XML</acronym> files is an <acronym class="acronym">XSLT</acronym> based input
      filter.
      It indexes element and attribute content of any thinkable <acronym class="acronym">XML</acronym> format
      using full <acronym class="acronym">XPATH</acronym> support, a feature which the standard <span class="application">Zebra</span>
      <acronym class="acronym">GRS-1</acronym> <acronym class="acronym">SGML</acronym> and <acronym class="acronym">XML</acronym> filters lacked. The indexed documents are
      parsed into a standard <acronym class="acronym">XML</acronym> <acronym class="acronym">DOM</acronym> tree, which restricts record size
      according to availability of memory.
    </p><p>
      The Alvis filter
      uses <acronym class="acronym">XSLT</acronym> display stylesheets, which let
      the <span class="application">Zebra</span> DB administrator associate multiple, different views on
      the same <acronym class="acronym">XML</acronym> document type. These views are chosen on-the-fly in
      search time.
     </p><p>
      In addition, the Alvis filter configuration is not bound to the
      arcane  <acronym class="acronym">BIB-1</acronym> <acronym class="acronym">Z39.50</acronym> library catalogue indexing traditions and
      folklore, and is therefore easier to understand.
    </p><p>
      Finally, the Alvis  filter allows for static ranking at index
      time, and to to sort hit lists according to predefined
      static ranks. This imposes no overhead at all, both
      search and indexing perform still
      <span class="emphasis"><em>O(1)</em></span> irrespectively of document
      collection size. This feature resembles Google's pre-ranking using
      their PageRank algorithm.
    </p><p>
      Details on the experimental Alvis <acronym class="acronym">XSLT</acronym> filter are found in
      <a class="xref" href="record-model-alvisxslt.html" title="Chapter8.ALVIS XML Record Model and Filter Module">Chapter8, <i>ALVIS <acronym class="acronym">XML</acronym> Record Model and Filter Module</i></a>.
      </p><p>
      The Debian package <code class="literal">libidzebra-2.0-mod-alvis</code>
      contains the Alvis filter module.
     </p></div><div class="section"><div class="titlepage"><div><div><h4 class="title"><a name="componentmodulesgrs"></a>2.5.3.<acronym class="acronym">GRS-1</acronym> Record Model and Filter Modules</h4></div></div></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>
        The functionality of this record model has been improved and
        replaced by the <acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> record model. See
        <a class="xref" href="architecture-maincomponents.html#componentmodulesdom" title="2.5.1.DOM XML Record Model and Filter Module">Section2.5.1, &#8220;<acronym class="acronym">DOM</acronym> <acronym class="acronym">XML</acronym> Record Model and Filter Module&#8221;</a>.
      </p></div><p>
    The <acronym class="acronym">GRS-1</acronym> filter modules described in
    <a class="xref" href="grs.html" title="Chapter9.GRS-1 Record Model and Filter Modules">Chapter9, <i><acronym class="acronym">GRS-1</acronym> Record Model and Filter Modules</i></a>
    are all based on the <acronym class="acronym">Z39.50</acronym> specifications, and it is absolutely
    mandatory to have the reference pages on <acronym class="acronym">BIB-1</acronym> attribute sets on
    you hand when configuring <acronym class="acronym">GRS-1</acronym> filters. The GRS filters come in
    different flavors, and a short introduction is needed here.
    <acronym class="acronym">GRS-1</acronym> filters of various kind have also been called ABS filters due
    to the <code class="filename">*.abs</code> configuration file suffix.
    </p><p>
      The <span class="emphasis"><em>grs.marc</em></span> and
      <span class="emphasis"><em>grs.marcxml</em></span> filters are suited to parse and
      index binary and <acronym class="acronym">XML</acronym> versions of traditional library <acronym class="acronym">MARC</acronym> records
      based on the ISO2709 standard. The Debian package for both
      filters is
     <code class="literal">libidzebra-2.0-mod-grs-marc</code>.
    </p><p>
      <acronym class="acronym">GRS-1</acronym> TCL scriptable filters for extensive user configuration come
     in two flavors: a regular expression filter
     <span class="emphasis"><em>grs.regx</em></span> using TCL regular expressions, and
     a general scriptable TCL filter called
     <span class="emphasis"><em>grs.tcl</em></span>
     are both included in the
     <code class="literal">libidzebra-2.0-mod-grs-regx</code> Debian package.
    </p><p>
      A general purpose <acronym class="acronym">SGML</acronym> filter is called
     <span class="emphasis"><em>grs.sgml</em></span>. This filter is not yet packaged,
     but planned to be in the
     <code class="literal">libidzebra-2.0-mod-grs-sgml</code> Debian package.
    </p><p>
      The Debian  package
      <code class="literal">libidzebra-2.0-mod-grs-xml</code> includes the
      <span class="emphasis"><em>grs.xml</em></span> filter which uses <a class="ulink" href="https://libexpat.github.io" target="_top">Expat</a> to
      parse records in <acronym class="acronym">XML</acronym> and turn them into ID<span class="application">Zebra</span>'s internal <acronym class="acronym">GRS-1</acronym> node
      trees. Have also a look at the Alvis <acronym class="acronym">XML</acronym>/<acronym class="acronym">XSLT</acronym> filter described in
      the next session.
    </p></div><div class="section"><div class="titlepage"><div><div><h4 class="title"><a name="componentmodulestext"></a>2.5.4.TEXT Record Model and Filter Module</h4></div></div></div><p>
      Plain ASCII text filter. TODO: add information here.
    </p></div></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="architecture.html">Prev</a></td><td width="20%" align="center"><a accesskey="u" href="architecture.html">Up</a></td><td width="40%" align="right"><a accesskey="n" href="architecture-workflow.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter4.Overview of <span class="application">Zebra</span> Architecture</td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top">3.Indexing and Retrieval Workflow</td></tr></table></div></body></html>