File: architecture.xml

package info (click to toggle)
idzebra 2.2.8-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 10,572 kB
  • sloc: ansic: 54,389; xml: 27,058; sh: 5,892; makefile: 1,102; perl: 210; tcl: 64
file content (629 lines) | stat: -rw-r--r-- 22,806 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
 <chapter id="architecture">
  <title>Overview of &zebra; Architecture</title>

  <section id="architecture-representation">
   <title>Local Representation</title>

   <para>
    As mentioned earlier, &zebra; places few restrictions on the type of
    data that you can index and manage. Generally, whatever the form of
    the data, it is parsed by an input filter specific to that format, and
    turned into an internal structure that &zebra; knows how to handle. This
    process takes place whenever the record is accessed - for indexing and
    retrieval.
   </para>

   <para>
    The RecordType parameter in the <literal>zebra.cfg</literal> file, or
    the <literal>-t</literal> option to the indexer tells &zebra; how to
    process input records.
    Two basic types of processing are available - raw text and structured
    data. Raw text is just that, and it is selected by providing the
    argument <emphasis>text</emphasis> to &zebra;. Structured records are
    all handled internally using the basic mechanisms described in the
    subsequent sections.
    &zebra; can read structured records in many different formats.
    <!--
    How this is done is governed by additional parameters after the
    "grs" keyword, separated by "." characters.
    -->
   </para>
  </section>

  <section id="architecture-maincomponents">
   <title>Main Components</title>
   <para>
    The &zebra; system is designed to support a wide range of data management
    applications. The system can be configured to handle virtually any
    kind of structured data. Each record in the system is associated with
    a <emphasis>record schema</emphasis> which lends context to the data
    elements of the record.
    Any number of record schemas can coexist in the system.
    Although it may be wise to use only a single schema within
    one database, the system poses no such restrictions.
   </para>
   <para>
    The &zebra; indexer and information retrieval server consists of the
    following main applications: the <command>zebraidx</command>
    indexing maintenance utility, and the <command>zebrasrv</command>
    information query and retrieval server. Both are using some of the
    same main components, which are presented here.
   </para>
   <para>
    The virtual Debian package <literal>idzebra-2.0</literal>
    installs all the necessary packages to start
    working with &zebra; - including utility programs, development libraries,
    documentation and modules.
  </para>

   <section id="componentcore">
    <title>Core &zebra; Libraries Containing Common Functionality</title>
    <para>
     The core &zebra; module is the meat of the <command>zebraidx</command>
    indexing maintenance utility, and the <command>zebrasrv</command>
    information query and retrieval server binaries. Shortly, the core
    libraries are responsible for
     <variablelist>
      <varlistentry>
       <term>Dynamic Loading</term>
       <listitem>
        <para>of external filter modules, in case the application is
        not compiled statically. These filter modules define indexing,
        search and retrieval capabilities of the various input formats.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Index Maintenance</term>
       <listitem>
        <para> &zebra; maintains Term Dictionaries and ISAM index
        entries in inverted index structures kept on disk. These are
        optimized for fast inset, update and delete, as well as good
        search performance.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Search Evaluation</term>
       <listitem>
        <para>by execution of search requests expressed in &acro.pqf;/&acro.rpn;
         data structures, which are handed over from
         the &yaz; server frontend &acro.api;. Search evaluation includes
         construction of hit lists according to boolean combinations
         of simpler searches. Fast performance is achieved by careful
         use of index structures, and by evaluation specific index hit
         lists in correct order.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Ranking and Sorting</term>
       <listitem>
        <para>
         components call resorting/re-ranking algorithms on the hit
         sets. These might also be pre-sorted not only using the
         assigned document ID's, but also using assigned static rank
         information.
        </para>
       </listitem>
      </varlistentry>
      <varlistentry>
       <term>Record Presentation</term>
       <listitem>
        <para>returns - possibly ranked - result sets, hit
         numbers, and the like internal data to the &yaz; server backend &acro.api;
         for shipping to the client. Each individual filter module
         implements it's own specific presentation formats.
        </para>
       </listitem>
      </varlistentry>
     </variablelist>
     </para>
    <para>
     The Debian package <literal>libidzebra-2.0</literal>
     contains all run-time libraries for &zebra;, the
     documentation in PDF and HTML is found in
     <literal>idzebra-2.0-doc</literal>, and
     <literal>idzebra-2.0-common</literal>
     includes common essential &zebra; configuration files.
    </para>
   </section>


   <section id="componentindexer">
    <title>&zebra; Indexer</title>
    <para>
     The  <command>zebraidx</command>
     indexing maintenance utility
     loads external filter modules used for indexing data records of
     different type, and creates, updates and drops databases and
     indexes according to the rules defined in the filter modules.
    </para>
    <para>
     The Debian  package <literal>idzebra-2.0-utils</literal> contains
     the  <command>zebraidx</command> utility.
    </para>
   </section>

   <section id="componentsearcher">
    <title>&zebra; Searcher/Retriever</title>
    <para>
     This is the executable which runs the &acro.z3950;/&acro.sru;/&acro.srw; server and
     glues together the core libraries and the filter modules to one
     great Information Retrieval server application.
    </para>
    <para>
     The Debian  package <literal>idzebra-2.0-utils</literal> contains
     the  <command>zebrasrv</command> utility.
    </para>
   </section>

   <section id="componentyazserver">
    <title>&yaz; Server Frontend</title>
    <para>
     The &yaz; server frontend is
     a full fledged stateful &acro.z3950; server taking client
     connections, and forwarding search and scan requests to the
     &zebra; core indexer.
    </para>
    <para>
     In addition to &acro.z3950; requests, the &yaz; server frontend acts
     as HTTP server, honoring
      <ulink url="&url.sru;">&acro.sru; &acro.soap;</ulink>
     requests, and
     &acro.sru; &acro.rest;
     requests. Moreover, it can
     translate incoming
     <ulink url="&url.cql;">&acro.cql;</ulink>
     queries to
     <ulink url="&url.yaz.pqf;">&acro.pqf;</ulink>
      queries, if
     correctly configured.
    </para>
    <para>
     <ulink url="&url.yaz;">&yaz;</ulink>
     is an Open Source
     toolkit that allows you to develop software using the
     &acro.ansi; &acro.z3950;/ISO23950 standard for information retrieval.
     It is packaged in the Debian packages
     <literal>yaz</literal> and <literal>libyaz</literal>.
    </para>
   </section>

   <section id="componentmodules">
    <title>Record Models and Filter Modules</title>
    <para>
     The hard work of knowing <emphasis>what</emphasis> to index,
     <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
     part of the records to send in a search/retrieve response is
     implemented in
     various filter modules. It is their responsibility to define the
     exact indexing and record display filtering rules.
     </para>
     <para>
     The virtual Debian package
     <literal>libidzebra-2.0-modules</literal> installs all base filter
     modules.
    </para>

   <section id="componentmodulesdom">
    <title>&acro.dom; &acro.xml; Record Model and Filter Module</title>
     <para>
      The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
      internal data model, and can thus parse, index, and display
      any &acro.xml; document.
    </para>
    <para>
      A parser for binary &acro.marc; records based on the ISO2709 library
      standard is provided, it transforms these to the internal
      &acro.marcxml; &acro.dom; representation.
    </para>
    <para>
      The internal &acro.dom; &acro.xml; representation can be fed into four
      different pipelines, consisting of arbitrarily many successive
      &acro.xslt; transformations; these are for
     <itemizedlist>
       <listitem><para>input parsing and initial
          transformations,</para></listitem>
       <listitem><para>indexing term extraction
          transformations</para></listitem>
       <listitem><para>transformations before internal document
          storage, and </para></listitem>
       <listitem><para>retrieve transformations from storage to output
          format</para></listitem>
      </itemizedlist>
    </para>
    <para>
      The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if  supported on
      your platform, even &acro.exslt;), it brings thus full &acro.xpath;
      support to the indexing, storage and display rules of not only
      &acro.xml; documents, but also binary &acro.marc; records.
    </para>
    <para>
      Finally, the &acro.dom; &acro.xml; filter allows for static ranking at index
      time, and to to sort hit lists according to predefined
      static ranks.
    </para>
    <para>
      Details on the experimental &acro.dom; &acro.xml; filter are found in
      <xref linkend="record-model-domxml"/>.
      </para>
     <para>
      The Debian package <literal>libidzebra-2.0-mod-dom</literal>
      contains the &acro.dom; filter module.
     </para>
    </section>

   <section id="componentmodulesalvis">
    <title>ALVIS &acro.xml; Record Model and Filter Module</title>
     <note>
      <para>
        The functionality of this record model has been improved and
        replaced by the &acro.dom; &acro.xml; record model. See
        <xref linkend="componentmodulesdom"/>.
      </para>
     </note>

     <para>
      The Alvis filter for &acro.xml; files is an &acro.xslt; based input
      filter.
      It indexes element and attribute content of any thinkable &acro.xml; format
      using full &acro.xpath; support, a feature which the standard &zebra;
      &acro.grs1; &acro.sgml; and &acro.xml; filters lacked. The indexed documents are
      parsed into a standard &acro.xml; &acro.dom; tree, which restricts record size
      according to availability of memory.
    </para>
    <para>
      The Alvis filter
      uses &acro.xslt; display stylesheets, which let
      the &zebra; DB administrator associate multiple, different views on
      the same &acro.xml; document type. These views are chosen on-the-fly in
      search time.
     </para>
    <para>
      In addition, the Alvis filter configuration is not bound to the
      arcane  &acro.bib1; &acro.z3950; library catalogue indexing traditions and
      folklore, and is therefore easier to understand.
    </para>
    <para>
      Finally, the Alvis  filter allows for static ranking at index
      time, and to to sort hit lists according to predefined
      static ranks. This imposes no overhead at all, both
      search and indexing perform still
      <emphasis>O(1)</emphasis> irrespectively of document
      collection size. This feature resembles Google's pre-ranking using
      their PageRank algorithm.
    </para>
    <para>
      Details on the experimental Alvis &acro.xslt; filter are found in
      <xref linkend="record-model-alvisxslt"/>.
      </para>
     <para>
      The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
      contains the Alvis filter module.
     </para>
    </section>

   <section id="componentmodulesgrs">
    <title>&acro.grs1; Record Model and Filter Modules</title>
     <note>
      <para>
        The functionality of this record model has been improved and
        replaced by the &acro.dom; &acro.xml; record model. See
        <xref linkend="componentmodulesdom"/>.
      </para>
     </note>
    <para>
    The &acro.grs1; filter modules described in
    <xref linkend="grs"/>
    are all based on the &acro.z3950; specifications, and it is absolutely
    mandatory to have the reference pages on &acro.bib1; attribute sets on
    you hand when configuring &acro.grs1; filters. The GRS filters come in
    different flavors, and a short introduction is needed here.
    &acro.grs1; filters of various kind have also been called ABS filters due
    to the <filename>*.abs</filename> configuration file suffix.
    </para>
    <para>
      The <emphasis>grs.marc</emphasis> and
      <emphasis>grs.marcxml</emphasis> filters are suited to parse and
      index binary and &acro.xml; versions of traditional library &acro.marc; records
      based on the ISO2709 standard. The Debian package for both
      filters is
     <literal>libidzebra-2.0-mod-grs-marc</literal>.
    </para>
    <para>
      &acro.grs1; TCL scriptable filters for extensive user configuration come
     in two flavors: a regular expression filter
     <emphasis>grs.regx</emphasis> using TCL regular expressions, and
     a general scriptable TCL filter called
     <emphasis>grs.tcl</emphasis>
     are both included in the
     <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
    </para>
    <para>
      A general purpose &acro.sgml; filter is called
     <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
     but planned to be in the
     <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
    </para>
    <para>
      The Debian  package
      <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
      <emphasis>grs.xml</emphasis> filter which uses <ulink
      url="&url.expat;">Expat</ulink> to
      parse records in &acro.xml; and turn them into ID&zebra;'s internal &acro.grs1; node
      trees. Have also a look at the Alvis &acro.xml;/&acro.xslt; filter described in
      the next session.
    </para>
   </section>

   <section id="componentmodulestext">
    <title>TEXT Record Model and Filter Module</title>
    <para>
      Plain ASCII text filter. TODO: add information here.
    </para>
   </section>

    <!--
   <section id="componentmodulessafari">
    <title>SAFARI Record Model and Filter Module</title>
    <para>
     SAFARI filter module TODO: add information here.
    </para>
   </section>
    -->

   </section>

  </section>


  <section id="architecture-workflow">
   <title>Indexing and Retrieval Workflow</title>

  <para>
   Records pass through three different states during processing in the
   system.
  </para>

  <para>

   <itemizedlist>
    <listitem>

     <para>
      When records are accessed by the system, they are represented
      in their local, or native format. This might be &acro.sgml; or HTML files,
      News or Mail archives, &acro.marc; records. If the system doesn't already
      know how to read the type of data you need to store, you can set up an
      input filter by preparing conversion rules based on regular
      expressions and possibly augmented by a flexible scripting language
      (Tcl).
      The input filter produces as output an internal representation,
      a tree structure.

     </para>
    </listitem>
    <listitem>

     <para>
      When records are processed by the system, they are represented
      in a tree-structure, constructed by tagged data elements hanging off a
      root node. The tagged elements may contain data or yet more tagged
      elements in a recursive structure. The system performs various
      actions on this tree structure (indexing, element selection, schema
      mapping, etc.),

     </para>
    </listitem>
    <listitem>

     <para>
      Before transmitting records to the client, they are first
      converted from the internal structure to a form suitable for exchange
      over the network - according to the &acro.z3950; standard.
     </para>
    </listitem>

   </itemizedlist>

  </para>
  </section>

  <section id="special-retrieval">
   <title>Retrieval of &zebra; internal record data</title>
   <para>
    Starting with <literal>&zebra;</literal> version 2.0.5 or newer, it is
    possible to use a special element set which has the prefix
    <literal>zebra::</literal>.
   </para>
   <para>
    Using this element will, regardless of record type, return
    &zebra;'s internal index structure/data for a record.
    In particular, the regular record filters are not invoked when
    these are in use.
    This can in some cases make the retrieval faster than regular
    retrieval operations (for &acro.marc;, &acro.xml; etc).
   </para>
   <table id="special-retrieval-types">
    <title>Special Retrieval Elements</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Element Set</entry>
       <entry>Description</entry>
       <entry>Syntax</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry><literal>zebra::meta::sysno</literal></entry>
       <entry>Get &zebra; record system ID</entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry><literal>zebra::data</literal></entry>
       <entry>Get raw record</entry>
       <entry>all</entry>
      </row>
      <row>
       <entry><literal>zebra::meta</literal></entry>
       <entry>Get &zebra; record internal metadata</entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry><literal>zebra::index</literal></entry>
       <entry>Get all indexed keys for record</entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry>
	<literal>zebra::index::</literal><replaceable>f</replaceable>
       </entry>
       <entry>
	Get indexed keys for field <replaceable>f</replaceable>	for record
       </entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry>
	<literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
       </entry>
       <entry>
	Get indexed keys for field <replaceable>f</replaceable>
	  and type <replaceable>t</replaceable> for record
       </entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry>
	<literal>zebra::snippet</literal>
       </entry>
       <entry>
	Get snippet for record for one or more indexes (f1,f2,..).
	This includes a phrase from the original
	record at the point where a match occurs (for a query). By default
	give terms before - and after are included in the snippet. The
	matching terms are enclosed within element
	<literal>&lt;s&gt;</literal>. The snippet facility requires
	Zebra 2.0.16 or later.
       </entry>
       <entry>&acro.xml; and &acro.sutrs;</entry>
      </row>
      <row>
       <entry>
	<literal>zebra::facet::</literal><replaceable>f1</replaceable>:<replaceable>t1</replaceable>,<replaceable>f2</replaceable>:<replaceable>t2</replaceable>,..
       </entry>
       <entry>
	Get facet of a result set. The facet result is returned
	as if it was a normal record, while in reality is a
	recap of most "important" terms in a result set for the fields
	given.
	The facet facility first appeared in Zebra 2.0.20.
       </entry>
       <entry>&acro.xml;</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
   <para>
    For example, to fetch the raw binary record data stored in the
    zebra internal storage, or on the filesystem, the following
    commands can be issued:
    <screen>
      Z> f @attr 1=title my
      Z> format xml
      Z> elements zebra::data
      Z> s 1+1
      Z> format sutrs
      Z> s 1+1
      Z> format usmarc
      Z> s 1+1
    </screen>
    </para>
   <para>
    The special
    <literal>zebra::data</literal> element set name is
    defined for any record syntax, but will always fetch
    the raw record data in exactly the original form. No record syntax
    specific transformations will be applied to the raw record data.
   </para>
   <para>
    Also, &zebra; internal metadata about the record can be accessed:
    <screen>
      Z> f @attr 1=title my
      Z> format xml
      Z> elements zebra::meta::sysno
      Z> s 1+1
    </screen>
    displays in <literal>&acro.xml;</literal> record syntax only internal
    record system number, whereas
    <screen>
      Z> f @attr 1=title my
      Z> format xml
      Z> elements zebra::meta
      Z> s 1+1
    </screen>
    displays all available metadata on the record. These include system
    number, database name,  indexed filename,  filter used for indexing,
    score and static ranking information and finally bytesize of record.
   </para>
   <para>
    Sometimes, it is very hard to figure out what exactly has been
    indexed how and in which indexes. Using the indexing stylesheet of
    the Alvis filter, one can at least see which portion of the record
    went into which index, but a similar aid does not exist for all
    other indexing filters.
   </para>
   <para>
    The special
    <literal>zebra::index</literal> element set names are provided to
    access information on per record indexed fields. For example, the
    queries
    <screen>
      Z> f @attr 1=title my
      Z> format sutrs
      Z> elements zebra::index
      Z> s 1+1
    </screen>
    will display all indexed tokens from all indexed fields of the
    first record, and it will display in <literal>&acro.sutrs;</literal>
    record syntax, whereas
    <screen>
      Z> f @attr 1=title my
      Z> format xml
      Z> elements zebra::index::title
      Z> s 1+1
      Z> elements zebra::index::title:p
      Z> s 1+1
    </screen>
    displays in <literal>&acro.xml;</literal> record syntax only the content
      of the zebra string index <literal>title</literal>, or
      even only the type <literal>p</literal> phrase indexed part of it.
   </para>
   <note>
    <para>
     Trying to access numeric <literal>&acro.bib1;</literal> use
     attributes or trying to access non-existent zebra intern string
     access points will result in a Diagnostic 25: Specified element set
     'name not valid for specified database.
    </para>
   </note>
  </section>

 </chapter>

 <!-- Keep this comment at the end of the file
 Local variables:
 mode: sgml
 sgml-omittag:t
 sgml-shorttag:t
 sgml-minimize-attributes:nil
 sgml-always-quote-attributes:t
 sgml-indent-step:1
 sgml-indent-data:t
 sgml-parent-document: "idzebra.xml"
 sgml-local-catalogs: nil
 sgml-namecase-general:t
 End:
 -->