1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629
|
<chapter id="architecture">
<title>Overview of &zebra; Architecture</title>
<section id="architecture-representation">
<title>Local Representation</title>
<para>
As mentioned earlier, &zebra; places few restrictions on the type of
data that you can index and manage. Generally, whatever the form of
the data, it is parsed by an input filter specific to that format, and
turned into an internal structure that &zebra; knows how to handle. This
process takes place whenever the record is accessed - for indexing and
retrieval.
</para>
<para>
The RecordType parameter in the <literal>zebra.cfg</literal> file, or
the <literal>-t</literal> option to the indexer tells &zebra; how to
process input records.
Two basic types of processing are available - raw text and structured
data. Raw text is just that, and it is selected by providing the
argument <emphasis>text</emphasis> to &zebra;. Structured records are
all handled internally using the basic mechanisms described in the
subsequent sections.
&zebra; can read structured records in many different formats.
<!--
How this is done is governed by additional parameters after the
"grs" keyword, separated by "." characters.
-->
</para>
</section>
<section id="architecture-maincomponents">
<title>Main Components</title>
<para>
The &zebra; system is designed to support a wide range of data management
applications. The system can be configured to handle virtually any
kind of structured data. Each record in the system is associated with
a <emphasis>record schema</emphasis> which lends context to the data
elements of the record.
Any number of record schemas can coexist in the system.
Although it may be wise to use only a single schema within
one database, the system poses no such restrictions.
</para>
<para>
The &zebra; indexer and information retrieval server consists of the
following main applications: the <command>zebraidx</command>
indexing maintenance utility, and the <command>zebrasrv</command>
information query and retrieval server. Both are using some of the
same main components, which are presented here.
</para>
<para>
The virtual Debian package <literal>idzebra-2.0</literal>
installs all the necessary packages to start
working with &zebra; - including utility programs, development libraries,
documentation and modules.
</para>
<section id="componentcore">
<title>Core &zebra; Libraries Containing Common Functionality</title>
<para>
The core &zebra; module is the meat of the <command>zebraidx</command>
indexing maintenance utility, and the <command>zebrasrv</command>
information query and retrieval server binaries. Shortly, the core
libraries are responsible for
<variablelist>
<varlistentry>
<term>Dynamic Loading</term>
<listitem>
<para>of external filter modules, in case the application is
not compiled statically. These filter modules define indexing,
search and retrieval capabilities of the various input formats.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Index Maintenance</term>
<listitem>
<para> &zebra; maintains Term Dictionaries and ISAM index
entries in inverted index structures kept on disk. These are
optimized for fast inset, update and delete, as well as good
search performance.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Search Evaluation</term>
<listitem>
<para>by execution of search requests expressed in &acro.pqf;/&acro.rpn;
data structures, which are handed over from
the &yaz; server frontend &acro.api;. Search evaluation includes
construction of hit lists according to boolean combinations
of simpler searches. Fast performance is achieved by careful
use of index structures, and by evaluation specific index hit
lists in correct order.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Ranking and Sorting</term>
<listitem>
<para>
components call resorting/re-ranking algorithms on the hit
sets. These might also be pre-sorted not only using the
assigned document ID's, but also using assigned static rank
information.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Record Presentation</term>
<listitem>
<para>returns - possibly ranked - result sets, hit
numbers, and the like internal data to the &yaz; server backend &acro.api;
for shipping to the client. Each individual filter module
implements it's own specific presentation formats.
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
<para>
The Debian package <literal>libidzebra-2.0</literal>
contains all run-time libraries for &zebra;, the
documentation in PDF and HTML is found in
<literal>idzebra-2.0-doc</literal>, and
<literal>idzebra-2.0-common</literal>
includes common essential &zebra; configuration files.
</para>
</section>
<section id="componentindexer">
<title>&zebra; Indexer</title>
<para>
The <command>zebraidx</command>
indexing maintenance utility
loads external filter modules used for indexing data records of
different type, and creates, updates and drops databases and
indexes according to the rules defined in the filter modules.
</para>
<para>
The Debian package <literal>idzebra-2.0-utils</literal> contains
the <command>zebraidx</command> utility.
</para>
</section>
<section id="componentsearcher">
<title>&zebra; Searcher/Retriever</title>
<para>
This is the executable which runs the &acro.z3950;/&acro.sru;/&acro.srw; server and
glues together the core libraries and the filter modules to one
great Information Retrieval server application.
</para>
<para>
The Debian package <literal>idzebra-2.0-utils</literal> contains
the <command>zebrasrv</command> utility.
</para>
</section>
<section id="componentyazserver">
<title>&yaz; Server Frontend</title>
<para>
The &yaz; server frontend is
a full fledged stateful &acro.z3950; server taking client
connections, and forwarding search and scan requests to the
&zebra; core indexer.
</para>
<para>
In addition to &acro.z3950; requests, the &yaz; server frontend acts
as HTTP server, honoring
<ulink url="&url.sru;">&acro.sru; &acro.soap;</ulink>
requests, and
&acro.sru; &acro.rest;
requests. Moreover, it can
translate incoming
<ulink url="&url.cql;">&acro.cql;</ulink>
queries to
<ulink url="&url.yaz.pqf;">&acro.pqf;</ulink>
queries, if
correctly configured.
</para>
<para>
<ulink url="&url.yaz;">&yaz;</ulink>
is an Open Source
toolkit that allows you to develop software using the
&acro.ansi; &acro.z3950;/ISO23950 standard for information retrieval.
It is packaged in the Debian packages
<literal>yaz</literal> and <literal>libyaz</literal>.
</para>
</section>
<section id="componentmodules">
<title>Record Models and Filter Modules</title>
<para>
The hard work of knowing <emphasis>what</emphasis> to index,
<emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
part of the records to send in a search/retrieve response is
implemented in
various filter modules. It is their responsibility to define the
exact indexing and record display filtering rules.
</para>
<para>
The virtual Debian package
<literal>libidzebra-2.0-modules</literal> installs all base filter
modules.
</para>
<section id="componentmodulesdom">
<title>&acro.dom; &acro.xml; Record Model and Filter Module</title>
<para>
The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
internal data model, and can thus parse, index, and display
any &acro.xml; document.
</para>
<para>
A parser for binary &acro.marc; records based on the ISO2709 library
standard is provided, it transforms these to the internal
&acro.marcxml; &acro.dom; representation.
</para>
<para>
The internal &acro.dom; &acro.xml; representation can be fed into four
different pipelines, consisting of arbitrarily many successive
&acro.xslt; transformations; these are for
<itemizedlist>
<listitem><para>input parsing and initial
transformations,</para></listitem>
<listitem><para>indexing term extraction
transformations</para></listitem>
<listitem><para>transformations before internal document
storage, and </para></listitem>
<listitem><para>retrieve transformations from storage to output
format</para></listitem>
</itemizedlist>
</para>
<para>
The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on
your platform, even &acro.exslt;), it brings thus full &acro.xpath;
support to the indexing, storage and display rules of not only
&acro.xml; documents, but also binary &acro.marc; records.
</para>
<para>
Finally, the &acro.dom; &acro.xml; filter allows for static ranking at index
time, and to to sort hit lists according to predefined
static ranks.
</para>
<para>
Details on the experimental &acro.dom; &acro.xml; filter are found in
<xref linkend="record-model-domxml"/>.
</para>
<para>
The Debian package <literal>libidzebra-2.0-mod-dom</literal>
contains the &acro.dom; filter module.
</para>
</section>
<section id="componentmodulesalvis">
<title>ALVIS &acro.xml; Record Model and Filter Module</title>
<note>
<para>
The functionality of this record model has been improved and
replaced by the &acro.dom; &acro.xml; record model. See
<xref linkend="componentmodulesdom"/>.
</para>
</note>
<para>
The Alvis filter for &acro.xml; files is an &acro.xslt; based input
filter.
It indexes element and attribute content of any thinkable &acro.xml; format
using full &acro.xpath; support, a feature which the standard &zebra;
&acro.grs1; &acro.sgml; and &acro.xml; filters lacked. The indexed documents are
parsed into a standard &acro.xml; &acro.dom; tree, which restricts record size
according to availability of memory.
</para>
<para>
The Alvis filter
uses &acro.xslt; display stylesheets, which let
the &zebra; DB administrator associate multiple, different views on
the same &acro.xml; document type. These views are chosen on-the-fly in
search time.
</para>
<para>
In addition, the Alvis filter configuration is not bound to the
arcane &acro.bib1; &acro.z3950; library catalogue indexing traditions and
folklore, and is therefore easier to understand.
</para>
<para>
Finally, the Alvis filter allows for static ranking at index
time, and to to sort hit lists according to predefined
static ranks. This imposes no overhead at all, both
search and indexing perform still
<emphasis>O(1)</emphasis> irrespectively of document
collection size. This feature resembles Google's pre-ranking using
their PageRank algorithm.
</para>
<para>
Details on the experimental Alvis &acro.xslt; filter are found in
<xref linkend="record-model-alvisxslt"/>.
</para>
<para>
The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
contains the Alvis filter module.
</para>
</section>
<section id="componentmodulesgrs">
<title>&acro.grs1; Record Model and Filter Modules</title>
<note>
<para>
The functionality of this record model has been improved and
replaced by the &acro.dom; &acro.xml; record model. See
<xref linkend="componentmodulesdom"/>.
</para>
</note>
<para>
The &acro.grs1; filter modules described in
<xref linkend="grs"/>
are all based on the &acro.z3950; specifications, and it is absolutely
mandatory to have the reference pages on &acro.bib1; attribute sets on
you hand when configuring &acro.grs1; filters. The GRS filters come in
different flavors, and a short introduction is needed here.
&acro.grs1; filters of various kind have also been called ABS filters due
to the <filename>*.abs</filename> configuration file suffix.
</para>
<para>
The <emphasis>grs.marc</emphasis> and
<emphasis>grs.marcxml</emphasis> filters are suited to parse and
index binary and &acro.xml; versions of traditional library &acro.marc; records
based on the ISO2709 standard. The Debian package for both
filters is
<literal>libidzebra-2.0-mod-grs-marc</literal>.
</para>
<para>
&acro.grs1; TCL scriptable filters for extensive user configuration come
in two flavors: a regular expression filter
<emphasis>grs.regx</emphasis> using TCL regular expressions, and
a general scriptable TCL filter called
<emphasis>grs.tcl</emphasis>
are both included in the
<literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
</para>
<para>
A general purpose &acro.sgml; filter is called
<emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
but planned to be in the
<literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
</para>
<para>
The Debian package
<literal>libidzebra-2.0-mod-grs-xml</literal> includes the
<emphasis>grs.xml</emphasis> filter which uses <ulink
url="&url.expat;">Expat</ulink> to
parse records in &acro.xml; and turn them into ID&zebra;'s internal &acro.grs1; node
trees. Have also a look at the Alvis &acro.xml;/&acro.xslt; filter described in
the next session.
</para>
</section>
<section id="componentmodulestext">
<title>TEXT Record Model and Filter Module</title>
<para>
Plain ASCII text filter. TODO: add information here.
</para>
</section>
<!--
<section id="componentmodulessafari">
<title>SAFARI Record Model and Filter Module</title>
<para>
SAFARI filter module TODO: add information here.
</para>
</section>
-->
</section>
</section>
<section id="architecture-workflow">
<title>Indexing and Retrieval Workflow</title>
<para>
Records pass through three different states during processing in the
system.
</para>
<para>
<itemizedlist>
<listitem>
<para>
When records are accessed by the system, they are represented
in their local, or native format. This might be &acro.sgml; or HTML files,
News or Mail archives, &acro.marc; records. If the system doesn't already
know how to read the type of data you need to store, you can set up an
input filter by preparing conversion rules based on regular
expressions and possibly augmented by a flexible scripting language
(Tcl).
The input filter produces as output an internal representation,
a tree structure.
</para>
</listitem>
<listitem>
<para>
When records are processed by the system, they are represented
in a tree-structure, constructed by tagged data elements hanging off a
root node. The tagged elements may contain data or yet more tagged
elements in a recursive structure. The system performs various
actions on this tree structure (indexing, element selection, schema
mapping, etc.),
</para>
</listitem>
<listitem>
<para>
Before transmitting records to the client, they are first
converted from the internal structure to a form suitable for exchange
over the network - according to the &acro.z3950; standard.
</para>
</listitem>
</itemizedlist>
</para>
</section>
<section id="special-retrieval">
<title>Retrieval of &zebra; internal record data</title>
<para>
Starting with <literal>&zebra;</literal> version 2.0.5 or newer, it is
possible to use a special element set which has the prefix
<literal>zebra::</literal>.
</para>
<para>
Using this element will, regardless of record type, return
&zebra;'s internal index structure/data for a record.
In particular, the regular record filters are not invoked when
these are in use.
This can in some cases make the retrieval faster than regular
retrieval operations (for &acro.marc;, &acro.xml; etc).
</para>
<table id="special-retrieval-types">
<title>Special Retrieval Elements</title>
<tgroup cols="2">
<thead>
<row>
<entry>Element Set</entry>
<entry>Description</entry>
<entry>Syntax</entry>
</row>
</thead>
<tbody>
<row>
<entry><literal>zebra::meta::sysno</literal></entry>
<entry>Get &zebra; record system ID</entry>
<entry>&acro.xml; and &acro.sutrs;</entry>
</row>
<row>
<entry><literal>zebra::data</literal></entry>
<entry>Get raw record</entry>
<entry>all</entry>
</row>
<row>
<entry><literal>zebra::meta</literal></entry>
<entry>Get &zebra; record internal metadata</entry>
<entry>&acro.xml; and &acro.sutrs;</entry>
</row>
<row>
<entry><literal>zebra::index</literal></entry>
<entry>Get all indexed keys for record</entry>
<entry>&acro.xml; and &acro.sutrs;</entry>
</row>
<row>
<entry>
<literal>zebra::index::</literal><replaceable>f</replaceable>
</entry>
<entry>
Get indexed keys for field <replaceable>f</replaceable> for record
</entry>
<entry>&acro.xml; and &acro.sutrs;</entry>
</row>
<row>
<entry>
<literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
</entry>
<entry>
Get indexed keys for field <replaceable>f</replaceable>
and type <replaceable>t</replaceable> for record
</entry>
<entry>&acro.xml; and &acro.sutrs;</entry>
</row>
<row>
<entry>
<literal>zebra::snippet</literal>
</entry>
<entry>
Get snippet for record for one or more indexes (f1,f2,..).
This includes a phrase from the original
record at the point where a match occurs (for a query). By default
give terms before - and after are included in the snippet. The
matching terms are enclosed within element
<literal><s></literal>. The snippet facility requires
Zebra 2.0.16 or later.
</entry>
<entry>&acro.xml; and &acro.sutrs;</entry>
</row>
<row>
<entry>
<literal>zebra::facet::</literal><replaceable>f1</replaceable>:<replaceable>t1</replaceable>,<replaceable>f2</replaceable>:<replaceable>t2</replaceable>,..
</entry>
<entry>
Get facet of a result set. The facet result is returned
as if it was a normal record, while in reality is a
recap of most "important" terms in a result set for the fields
given.
The facet facility first appeared in Zebra 2.0.20.
</entry>
<entry>&acro.xml;</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
For example, to fetch the raw binary record data stored in the
zebra internal storage, or on the filesystem, the following
commands can be issued:
<screen>
Z> f @attr 1=title my
Z> format xml
Z> elements zebra::data
Z> s 1+1
Z> format sutrs
Z> s 1+1
Z> format usmarc
Z> s 1+1
</screen>
</para>
<para>
The special
<literal>zebra::data</literal> element set name is
defined for any record syntax, but will always fetch
the raw record data in exactly the original form. No record syntax
specific transformations will be applied to the raw record data.
</para>
<para>
Also, &zebra; internal metadata about the record can be accessed:
<screen>
Z> f @attr 1=title my
Z> format xml
Z> elements zebra::meta::sysno
Z> s 1+1
</screen>
displays in <literal>&acro.xml;</literal> record syntax only internal
record system number, whereas
<screen>
Z> f @attr 1=title my
Z> format xml
Z> elements zebra::meta
Z> s 1+1
</screen>
displays all available metadata on the record. These include system
number, database name, indexed filename, filter used for indexing,
score and static ranking information and finally bytesize of record.
</para>
<para>
Sometimes, it is very hard to figure out what exactly has been
indexed how and in which indexes. Using the indexing stylesheet of
the Alvis filter, one can at least see which portion of the record
went into which index, but a similar aid does not exist for all
other indexing filters.
</para>
<para>
The special
<literal>zebra::index</literal> element set names are provided to
access information on per record indexed fields. For example, the
queries
<screen>
Z> f @attr 1=title my
Z> format sutrs
Z> elements zebra::index
Z> s 1+1
</screen>
will display all indexed tokens from all indexed fields of the
first record, and it will display in <literal>&acro.sutrs;</literal>
record syntax, whereas
<screen>
Z> f @attr 1=title my
Z> format xml
Z> elements zebra::index::title
Z> s 1+1
Z> elements zebra::index::title:p
Z> s 1+1
</screen>
displays in <literal>&acro.xml;</literal> record syntax only the content
of the zebra string index <literal>title</literal>, or
even only the type <literal>p</literal> phrase indexed part of it.
</para>
<note>
<para>
Trying to access numeric <literal>&acro.bib1;</literal> use
attributes or trying to access non-existent zebra intern string
access points will result in a Diagnostic 25: Specified element set
'name not valid for specified database.
</para>
</note>
</section>
</chapter>
<!-- Keep this comment at the end of the file
Local variables:
mode: sgml
sgml-omittag:t
sgml-shorttag:t
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End:
-->
|