File: Zend_Search_Lucene-Charset.xml

package info (click to toggle)
zendframework 1.12.9%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: jessie-kfreebsd
  • size: 133,584 kB
  • sloc: xml: 1,311,829; php: 570,173; sh: 170; makefile: 125; sql: 121
file content (174 lines) | stat: -rw-r--r-- 6,609 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
<?xml version="1.0" encoding="UTF-8"?>
<!-- Reviewed: no -->
<sect1 id="zend.search.lucene.charset">
    <title>Character Set</title>

    <sect2 id="zend.search.lucene.charset.description">
        <title>UTF-8 and single-byte character set support</title>

        <para>
            <classname>Zend_Search_Lucene</classname> works with the UTF-8 charset internally. Index
            files store unicode data in Java's "modified UTF-8 encoding".
            <classname>Zend_Search_Lucene</classname> core completely supports this encoding with
            one exception.

            <footnote>
               <para>
                   <classname>Zend_Search_Lucene</classname> supports only Basic Multilingual Plane
                   (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support
                   "supplementary characters" (characters whose code points are
                   greater than 0xFFFF)
               </para>

               <para>
                   Java 2 represents these characters as a pair of char (16-bit)
                   values, the first from the high-surrogates range (0xD800-0xDBFF),
                   the second from the low-surrogates range (0xDC00-0xDFFF). Then
                   they are encoded as usual UTF-8 characters in six bytes.
                   Standard UTF-8 representation uses four bytes for supplementary
                   characters.
               </para>
            </footnote>
        </para>

        <para>
            Actual input data encoding may be specified through
            <classname>Zend_Search_Lucene</classname> <acronym>API</acronym>. Data will be
            automatically converted into UTF-8 encoding.
        </para>
    </sect2>

    <sect2 id="zend.search.lucene.charset.default_analyzer">
        <title>Default text analyzer</title>

        <para>
            However, the default text analyzer (which is also used within query parser) uses
            ctype_alpha() for tokenizing text and queries.
        </para>

        <para>
            ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to
            'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently
            performed during query parsing.

            <footnote>
               <para>
                   Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.
               </para>
            </footnote>
        </para>

        <note>
            <title/>
            <para>
                Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num'
                analyzer if you don't want words to be broken by numbers.
            </para>
        </note>
    </sect2>

    <sect2 id="zend.search.lucene.charset.utf_analyzer">
        <title>UTF-8 compatible text analyzers</title>

        <para>
            <classname>Zend_Search_Lucene</classname> also contains a set of UTF-8 compatible
            analyzers: <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8</classname>,
            <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num</classname>,
            <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive</classname>,
            <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive</classname>.
        </para>

        <para>
            Any of this analyzers can be enabled with the code like this:
        </para>

        <programlisting language="php"><![CDATA[
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
]]></programlisting>

        <warning>
            <title/>
            <para>
                UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of
                analyzers assumed all non-ascii characters are letters. New analyzers implementation
                has more accurate behavior.
            </para>

            <para>
                This may need you to re-build index to have data and search queries tokenized in the
                same way, otherwise search engine may return wrong result sets.
            </para>
        </warning>

        <para>
            All of these analyzers need PCRE (Perl-compatible regular expressions) library to be
            compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE
            library sources bundled with <acronym>PHP</acronym> source code distribution, but if
            shared library is used instead of bundled with <acronym>PHP</acronym> sources, then
            UTF-8 support state may depend on you operating system.
        </para>

        <para>
            Use the following code to check, if PCRE UTF-8 support is enabled:
        </para>

        <programlisting language="php"><![CDATA[
if (@preg_match('/\pL/u', 'a') == 1) {
    echo "PCRE unicode support is turned on.\n";
} else {
    echo "PCRE unicode support is turned off.\n";
}
]]></programlisting>

        <para>
            Case insensitive versions of UTF-8 compatible analyzers also need <ulink
                url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to
            be enabled.
        </para>

        <para>
            If you don't want mbstring extension to be turned on, but need case insensitive search,
            you may use the following approach: normalize source data before indexing and query
            string before searching by converting them to lowercase:
        </para>

        <programlisting language="php"><![CDATA[
// Indexing
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

...

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

...

$doc = new Zend_Search_Lucene_Document();

$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
                                                  strtolower($contents)));

// Title field for search through (indexed, unstored)
$doc->addField(Zend_Search_Lucene_Field::UnStored('title',
                                                  strtolower($title)));

// Title field for retrieving (unindexed, stored)
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
]]></programlisting>

            <programlisting language="php"><![CDATA[
// Searching
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

...

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

...

$hits = $index->find(strtolower($query));
]]></programlisting>
    </sect2>
</sect1>