File: fields-and-charsets.html

package info (click to toggle)
idzebra 2.2.10-1
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 10,644 kB
  • sloc: ansic: 54,389; xml: 27,054; sh: 6,211; makefile: 1,099; perl: 210; tcl: 64
file content (116 lines) | stat: -rw-r--r-- 8,886 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
<html><head><meta charset="ISO-8859-1"><title>Chapter10.Field Structure and Character Sets</title><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot"><link rel="home" href="index.html" title="Zebra - User's Guide and Reference"><link rel="up" href="index.html" title="Zebra - User's Guide and Reference"><link rel="prev" href="grs-extended-marc-indexing.html" title="5.Extended indexing of MARC records"><link rel="next" href="character-map-files.html" title="2.Charmap Files"></head><body><link rel="stylesheet" type="text/css" href="common/style1.css"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">Chapter10.Field Structure and Character Sets
  </th></tr><tr><td width="20%" align="left"><a accesskey="p" href="grs-extended-marc-indexing.html">Prev</a></td><th width="60%" align="center"></th><td width="20%" align="right"><a accesskey="n" href="character-map-files.html">Next</a></td></tr></table><hr></div><div class="chapter"><div class="titlepage"><div><div><h1 class="title"><a name="fields-and-charsets"></a>Chapter10.Field Structure and Character Sets
  </h1></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl class="toc"><dt><span class="section"><a href="fields-and-charsets.html#default-idx-file">1. The default.idx file</a></span></dt><dt><span class="section"><a href="character-map-files.html">2. Charmap Files</a></span></dt><dt><span class="section"><a href="icuchain-files.html">3. ICU Chain Files</a></span></dt></dl></div><p>
   In order to provide a flexible approach to national character set
   handling, <span class="application">Zebra</span> allows the administrator to configure the set up the
   system to handle any 8-bit character set &#8212; including sets that
   require multi-octet diacritics or other multi-octet characters. The
   definition of a character set includes a specification of the
   permissible values, their sort order (this affects the display in the
   SCAN function), and relationships between upper- and lowercase
   characters. Finally, the definition includes the specification of
   space characters for the set.
  </p><p>
   The operator can define different character sets for different fields,
   typical examples being standard text fields, numerical fields, and
   special-purpose fields such as WWW-style linkages (URx).
  </p><p>
   Zebra 1.3 and Zebra versions 2.0.18 and earlier required that the field
   type is a single character, e.g. <code class="literal">w</code> (for word), and
   <code class="literal">p</code> for phrase. Zebra 2.0.20 and later allow field types 
   to be any string. This allows for greater flexibility - in particular
   per-locale (language) fields can be defined.
  </p><p>
   Version 2.0.20 of Zebra can also be configured - per field - to use the
   <a class="ulink" href="https://github.com/unicode-org/icu" target="_top">ICU</a> library to perform tokenization and
   normalization of strings. This is an alternative to the "charmap"
   files which has been part of Zebra since its first release.
  </p><div class="section"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="default-idx-file"></a>1.The default.idx file</h2></div></div></div><p>
    The field types, and hence character sets, are associated with data
    elements by the indexing rules (say <code class="literal">title:w</code>) in the
    various filters. Fields are defined in a field definition file which,
    by default, is called <code class="filename">default.idx</code>. 
    This file provides the association between field type codes 
    and the character map files (with the .chr suffix). The format
    of the .idx file is as follows
   </p><p>
    </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">index <em class="replaceable"><code>field type code</code></em></span></dt><dd><p>
	This directive introduces a new search index code.
	The argument is a one-character code to be used in the
	.abs files to select this particular index type. An index, roughly,
	corresponds to a particular structure attribute during search. Refer
	to <a class="xref" href="zebrasrv.html#zebrasrv-search" title="Z39.50 Search">the section called &#8220;<acronym class="acronym">Z39.50</acronym> Search&#8221;</a>.
       </p></dd><dt><span class="term">sort <em class="replaceable"><code>field code type</code></em></span></dt><dd><p>
	This directive introduces a 
	sort index. The argument is a one-character code to be used in the
	.abs fie to select this particular index type. The corresponding
	use attribute must be used in the sort request to refer to this
	particular sort index. The corresponding character map (see below)
	is used in the sort process.
       </p></dd><dt><span class="term">completeness <em class="replaceable"><code>boolean</code></em></span></dt><dd><p>
	This directive enables or disables complete field indexing.
	The value of the <em class="replaceable"><code>boolean</code></em> should be 0
	(disable) or 1. If completeness is enabled, the index entry will
	contain the complete contents of the field (up to a limit), with words
	(non-space characters) separated by single space characters
	(normalized to " " on display). When completeness is
	disabled, each word is indexed as a separate entry. Complete subfield
	indexing is most useful for fields which are typically browsed (e.g.,
	titles, authors, or subjects), or instances where a match on a
	complete subfield is essential (e.g., exact title searching). For fields
	where completeness is disabled, the search engine will interpret a
	search containing space characters as a word proximity search.
       </p></dd><dt><a name="default.idx.firstinfield"></a><span class="term">firstinfield <em class="replaceable"><code>boolean</code></em></span></dt><dd><p>
	This directive enables or disables first-in-field indexing.
	The value of the <em class="replaceable"><code>boolean</code></em> should be 0
	(disable) or 1. 
       </p></dd><dt><a name="default.idx.alwaysmatches"></a><span class="term">alwaysmatches <em class="replaceable"><code>boolean</code></em></span></dt><dd><p>
	This directive enables or disables alwaysmatches indexing.
	The value of the <em class="replaceable"><code>boolean</code></em> should be 0
	(disable) or 1. 
       </p></dd><dt><span class="term">charmap <em class="replaceable"><code>filename</code></em></span></dt><dd><p>
	This is the filename of the character
	map to be used for this index for field type.
        See <a class="xref" href="character-map-files.html" title="2.Charmap Files">Section2, &#8220;Charmap Files&#8221;</a> for details.
       </p></dd><dt><span class="term">icuchain <em class="replaceable"><code>filename</code></em></span></dt><dd><p>
	Specifies the filename with ICU tokenization and
	normalization rules. 
	See <a class="xref" href="icuchain-files.html" title="3.ICU Chain Files">Section3, &#8220;ICU Chain Files&#8221;</a> for details.
	Using icuchain for a field type is an alternative to
	charmap. It does not make sense to define both
	icuchain and charmap for the same field type.
       </p></dd></dl></div><p>
   </p><div class="example"><a name="field-types"></a><p class="title"><b>Example10.1.Field types</b></p><div class="example-contents"><p>
     Following are three excerpts of the standard
     <code class="filename">tab/default.idx</code> configuration file. Notice
     that the <code class="literal">index</code> and <code class="literal">sort</code>
     are grouping directives, which bind all other following directives
     to them:
     </p><pre class="screen">
     # Traditional word index
     # Used if completeness is 'incomplete field' (@attr 6=1) and
     # structure is word/phrase/word-list/free-form-text/document-text
     index w
     completeness 0
     position 1
     alwaysmatches 1
     firstinfield 1
     charmap string.chr

     ...

     # Null map index (no mapping at all)
     # Used if structure=key (@attr 4=3)
     index 0
     completeness 0
     position 1
     charmap @

     ...

     # Sort register
     sort s
     completeness 1
     charmap string.chr
     </pre><p>
    </p></div></div><br class="example-break"></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="grs-extended-marc-indexing.html">Prev</a></td><td width="20%" align="center"></td><td width="40%" align="right"><a accesskey="n" href="character-map-files.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">5.Extended indexing of <acronym class="acronym">MARC</acronym> records</td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top">2.Charmap Files</td></tr></table></div></body></html>