File: Zend_Search_Lucene-Charset.xml

package info (click to toggle)
zendframework 1.12.9%2Bdfsg-2
links: PTS, VCS
area: main
in suites: jessie-kfreebsd
size: 133,584 kB
sloc: xml: 1,311,829; php: 570,173; sh: 170; makefile: 125; sql: 121
file content (174 lines) | stat: -rw-r--r-- 6,609 bytes
parent folder | download | duplicates (2)
<?xml version="1.0" encoding="UTF-8"?>
<!-- Reviewed: no -->
<sect1 id="zend.search.lucene.charset">
    <title>Character Set</title>

    <sect2 id="zend.search.lucene.charset.description">
        <title>UTF-8 and single-byte character set support</title>

        <para>
            <classname>Zend_Search_Lucene</classname> works with the UTF-8 charset internally. Index
            files store unicode data in Java's "modified UTF-8 encoding".
            <classname>Zend_Search_Lucene</classname> core completely supports this encoding with
            one exception.

            <footnote>
               <para>
                   <classname>Zend_Search_Lucene</classname> supports only Basic Multilingual Plane
                   (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support
                   "supplementary characters" (characters whose code points are
                   greater than 0xFFFF)
               </para>

               <para>
                   Java 2 represents these characters as a pair of char (16-bit)
                   values, the first from the high-surrogates range (0xD800-0xDBFF),
                   the second from the low-surrogates range (0xDC00-0xDFFF). Then
                   they are encoded as usual UTF-8 characters in six bytes.
                   Standard UTF-8 representation uses four bytes for supplementary
                   characters.
               </para>
            </footnote>
        </para>

        <para>
            Actual input data encoding may be specified through
            <classname>Zend_Search_Lucene</classname> <acronym>API</acronym>. Data will be
            automatically converted into UTF-8 encoding.
        </para>
    </sect2>

    <sect2 id="zend.search.lucene.charset.default_analyzer">
        <title>Default text analyzer</title>

        <para>
            However, the default text analyzer (which is also used within query parser) uses
            ctype_alpha() for tokenizing text and queries.
        </para>

        <para>
            ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to
            'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently
            performed during query parsing.

            <footnote>
               <para>
                   Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.
               </para>
            </footnote>
        </para>

        <note>
            <title/>
            <para>
                Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num'
                analyzer if you don't want words to be broken by numbers.
            </para>
        </note>
    </sect2>

    <sect2 id="zend.search.lucene.charset.utf_analyzer">
        <title>UTF-8 compatible text analyzers</title>

        <para>
            <classname>Zend_Search_Lucene</classname> also contains a set of UTF-8 compatible
            analyzers: <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8</classname>,
            <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num</classname>,
            <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive</classname>,
            <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive</classname>.
        </para>

        <para>
            Any of this analyzers can be enabled with the code like this:
        </para>

        <programlisting language="php"><![CDATA[
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
]]></programlisting>

        <warning>
            <title/>
            <para>
                UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of
                analyzers assumed all non-ascii characters are letters. New analyzers implementation
                has more accurate behavior.
            </para>

            <para>
                This may need you to re-build index to have data and search queries tokenized in the
                same way, otherwise search engine may return wrong result sets.
            </para>
        </warning>

        <para>
            All of these analyzers need PCRE (Perl-compatible regular expressions) library to be
            compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE
            library sources bundled with <acronym>PHP</acronym> source code distribution, but if
            shared library is used instead of bundled with <acronym>PHP</acronym> sources, then
            UTF-8 support state may depend on you operating system.
        </para>

        <para>
            Use the following code to check, if PCRE UTF-8 support is enabled:
        </para>

        <programlisting language="php"><![CDATA[
if (@preg_match('/\pL/u', 'a') == 1) {
    echo "PCRE unicode support is turned on.\n";
} else {
    echo "PCRE unicode support is turned off.\n";
}
]]></programlisting>

        <para>
            Case insensitive versions of UTF-8 compatible analyzers also need <ulink
                url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to
            be enabled.
        </para>

        <para>
            If you don't want mbstring extension to be turned on, but need case insensitive search,
            you may use the following approach: normalize source data before indexing and query
            string before searching by converting them to lowercase:
        </para>

        <programlisting language="php"><![CDATA[
// Indexing
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

...

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

...

$doc = new Zend_Search_Lucene_Document();

$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
                                                  strtolower($contents)));

// Title field for search through (indexed, unstored)
$doc->addField(Zend_Search_Lucene_Field::UnStored('title',
                                                  strtolower($title)));

// Title field for retrieving (unindexed, stored)
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
]]></programlisting>

            <programlisting language="php"><![CDATA[
// Searching
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

...

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

...

$hits = $index->find(strtolower($query));
]]></programlisting>
    </sect2>
</sect1>