File: Zend_Search_Lucene-BestPractice.xml

package info (click to toggle)
zendframework 1.12.9%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: jessie-kfreebsd
  • size: 133,584 kB
  • sloc: xml: 1,311,829; php: 570,173; sh: 170; makefile: 125; sql: 121
file content (589 lines) | stat: -rw-r--r-- 21,554 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
<?xml version="1.0" encoding="UTF-8"?>
<!-- Reviewed: no -->
<sect1 id="zend.search.lucene.best-practice">
    <title>Best Practices</title>

    <sect2 id="zend.search.lucene.best-practice.field-names">
        <title>Field names</title>

        <para>
            There are no limitations for field names in <classname>Zend_Search_Lucene</classname>.
        </para>

        <para>
            Nevertheless it's a good idea not to use '<emphasis>id</emphasis>' and
            '<emphasis>score</emphasis>' names to avoid ambiguity in <classname>QueryHit</classname>
            properties names.
        </para>

        <para>
            The <classname>Zend_Search_Lucene_Search_QueryHit</classname> <property>id</property>
            and <property>score</property> properties always refer to internal Lucene document id
            and hit <link linkend="zend.search.lucene.searching.results-scoring">score</link>. If
            the indexed document has the same stored fields, you have to use the
            <methodname>getDocument()</methodname> method to access them:
        </para>

        <programlisting language="php"><![CDATA[
$hits = $index->find($query);

foreach ($hits as $hit) {
    // Get 'title' document field
    $title = $hit->title;

    // Get 'contents' document field
    $contents = $hit->contents;

    // Get internal Lucene document id
    $id = $hit->id;

    // Get query hit score
    $score = $hit->score;

    // Get 'id' document field
    $docId = $hit->getDocument()->id;

    // Get 'score' document field
    $docId = $hit->getDocument()->score;

    // Another way to get 'title' document field
    $title = $hit->getDocument()->title;
}
]]></programlisting>
    </sect2>

    <sect2 id="zend.search.lucene.best-practice.indexing-performance">
        <title>Indexing performance</title>

        <para>
            Indexing performance is a compromise between used resources, indexing time and index
            quality.
        </para>

        <para>
            Index quality is completely determined by number of index segments.
        </para>

        <para>
            Each index segment is entirely independent portion of data. So indexes containing more
            segments need more memory and time for searching.
        </para>

        <para>
            Index optimization is a process of merging several segments into a new one. A fully
            optimized index contains only one segment.
        </para>

        <para>
            Full index optimization may be performed with the <methodname>optimize()</methodname>
            method:
        </para>

        <programlisting language="php"><![CDATA[
$index = Zend_Search_Lucene::open($indexPath);

$index->optimize();
]]></programlisting>

        <para>
            Index optimization works with data streams and doesn't take a lot of memory but does
            require processor resources and time.
        </para>

        <para>
            Lucene index segments are not updatable by their nature (the update operation requires
            the segment file to be completely rewritten). So adding new document(s) to an index
            always generates a new segment. This, in turn, decreases index quality.
        </para>

        <para>
            An index auto-optimization process is performed after each segment generation and
            consists of merging partial segments.
        </para>

        <para>
            There are three options to control the behavior of auto-optimization (see <link
                linkend="zend.search.lucene.index-creation.optimization">Index optimization</link>
            section):

            <itemizedlist>
                <listitem>
                    <para>
                        <emphasis>MaxBufferedDocs</emphasis> is the number of documents that can be
                        buffered in memory before a new segment is generated and written to the hard
                        drive.
                    </para>
                </listitem>

                <listitem>
                    <para>
                        <emphasis>MaxMergeDocs</emphasis> is the maximum number of documents merged
                        by auto-optimization process into a new segment.
                    </para>
                </listitem>

                <listitem>
                    <para>
                        <emphasis>MergeFactor</emphasis> determines how often auto-optimization is
                        performed.
                    </para>
                </listitem>
            </itemizedlist>

            <note>
                <para>
                    All these options are <classname>Zend_Search_Lucene</classname> object
                    properties- not index properties. They affect only current
                    <classname>Zend_Search_Lucene</classname> object behavior and may vary for
                    different scripts.
                </para>
            </note>
        </para>

        <para>
            <emphasis>MaxBufferedDocs</emphasis> doesn't have any effect if you index only one
            document per script execution. On the other hand, it's very important for batch
            indexing. Greater values increase indexing performance, but also require more memory.
        </para>

        <para>
            There is simply no way to calculate the best value for the
            <emphasis>MaxBufferedDocs</emphasis> parameter because it depends on average document
            size, the analyzer in use and allowed memory.
        </para>

        <para>
            A good way to find the right value is to perform several tests with the largest document
            you expect to be added to the index

            <footnote>
                <para>
                    <methodname>memory_get_usage()</methodname> and
                    <methodname>memory_get_peak_usage()</methodname> may be used to control memory
                    usage.
                </para>
            </footnote>

            . It's a best practice not to use more than a half of the allowed memory.
        </para>

        <para>
            <emphasis>MaxMergeDocs</emphasis> limits the segment size (in terms of documents). It
            therefore also limits auto-optimization time by guaranteeing that the
            <methodname>addDocument()</methodname> method is not executed more than a certain number
            of times. This is very important for interactive applications.
        </para>

        <para>
            Lowering the <emphasis>MaxMergeDocs</emphasis> parameter also may improve batch indexing
            performance. Index auto-optimization is an iterative process and is performed from
            bottom up. Small segments are merged into larger segment, which are in turn merged into
            even larger segments and so on. Full index optimization is achieved when only one large
            segment file remains.
        </para>

        <para>
            Small segments generally decrease index quality. Many small segments may also trigger
            the "Too many open files" error determined by OS limitations

            <footnote>
                <para>
                    <classname>Zend_Search_Lucene</classname> keeps each segment file opened to
                    improve search performance.
                </para>
            </footnote>.
        </para>

        <para>
            in general, background index optimization should be performed for interactive indexing
            mode and <emphasis>MaxMergeDocs</emphasis> shouldn't be too low for batch indexing.
        </para>

        <para>
            <emphasis>MergeFactor</emphasis> affects auto-optimization frequency. Lower values
            increase the quality of unoptimized indexes. Larger values increase indexing
            performance, but also increase the number of merged segments. This again may trigger the
            "Too many open files" error.
        </para>

        <para>
            <emphasis>MergeFactor</emphasis> groups index segments by their size:

            <orderedlist>
                <listitem>
                    <para>Not greater than <emphasis>MaxBufferedDocs</emphasis>.</para>
                </listitem>

                <listitem>
                    <para>
                        Greater than <emphasis>MaxBufferedDocs</emphasis>, but not greater than
                        <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>.
                    </para>
                </listitem>

                <listitem>
                    <para>
                        Greater than
                        <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>, but
                        not greater than
                        <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>*<emphasis>MergeFactor</emphasis>.
                    </para>
                </listitem>

                <listitem><para>...</para></listitem>
            </orderedlist>
        </para>

        <para>
            <classname>Zend_Search_Lucene</classname> checks during each
            <methodname>addDocument()</methodname> call to see if merging any segments may move the
            newly created segment into the next group. If yes, then merging is performed.
        </para>

        <para>
            So an index with N groups may contain <emphasis>MaxBufferedDocs</emphasis> +
            (N-1)*<emphasis>MergeFactor</emphasis> segments and contains at least
            <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis><superscript>(N-1)</superscript>
            documents.
        </para>

        <para>
            This gives good approximation for the number of segments in the index:
        </para>
        <para>
            <emphasis>NumberOfSegments</emphasis> &lt;= <emphasis>MaxBufferedDocs</emphasis> +
            <emphasis>MergeFactor</emphasis>*log
            <subscript><emphasis>MergeFactor</emphasis></subscript>
            (<emphasis>NumberOfDocuments</emphasis>/<emphasis>MaxBufferedDocs</emphasis>)
        </para>

        <para>
            <emphasis>MaxBufferedDocs</emphasis> is determined by allowed memory. This allows for
            the appropriate merge factor to get a reasonable number of segments.
        </para>

        <para>
            Tuning the <emphasis>MergeFactor</emphasis> parameter is more effective for batch
            indexing performance than <emphasis>MaxMergeDocs</emphasis>. But it's also more
            course-grained. So use the estimation above for tuning <emphasis>MergeFactor</emphasis>,
            then play with <emphasis>MaxMergeDocs</emphasis> to get best batch indexing performance.
        </para>
    </sect2>

    <sect2 id="zend.search.lucene.best-practice.shutting-down">
        <title>Index during Shut Down</title>

        <para>
            The <classname>Zend_Search_Lucene</classname> instance performs some work at exit time
            if any documents were added to the index but not written to a new segment.
        </para>

        <para>
            It also may trigger an auto-optimization process.
        </para>

        <para>
            The index object is automatically closed when it, and all returned QueryHit objects, go
            out of scope.
        </para>

        <para>
            If index object is stored in global variable than it's closed only at the end of script
            execution

            <footnote>
                <para>
                    This also may occur if the index or QueryHit instances are referred to in some
                    cyclical data structures, because <acronym>PHP</acronym> garbage collects
                    objects with cyclic references only at the end of script execution.
                </para>
            </footnote>.
        </para>

        <para>
            <acronym>PHP</acronym> exception processing is also shut down at this moment.
        </para>

        <para>
            It doesn't prevent normal index shutdown process, but may prevent accurate error
            diagnostic if any error occurs during shutdown.
        </para>

        <para>
            There are two ways with which you may avoid this problem.
        </para>

        <para>
            The first is to force going out of scope:
        </para>

        <programlisting language="php"><![CDATA[
$index = Zend_Search_Lucene::open($indexPath);

...

unset($index);
]]></programlisting>

        <para>
            And the second is to perform a commit operation before the end of script execution:
        </para>

        <programlisting language="php"><![CDATA[
$index = Zend_Search_Lucene::open($indexPath);

$index->commit();
]]></programlisting>

        <para>
            This possibility is also described in the "<link
                linkend="zend.search.lucene.advanced.static">Advanced. Using index as static
                property</link>" section.
        </para>
    </sect2>

    <sect2 id="zend.search.lucene.best-practice.unique-id">
        <title>Retrieving documents by unique id</title>

        <para>
            It's a common practice to store some unique document id in the index. Examples include
            url, path, or database id.
        </para>

        <para>
            <classname>Zend_Search_Lucene</classname> provides a <methodname>termDocs()</methodname>
            method for retrieving documents containing specified terms.
        </para>

        <para>
            This is more efficient than using the <methodname>find()</methodname> method:
        </para>

        <programlisting language="php"><![CDATA[
// Retrieving documents with find() method using a query string
$query = $idFieldName . ':' . $docId;
$hits  = $index->find($query);
foreach ($hits as $hit) {
    $title    = $hit->title;
    $contents = $hit->contents;
    ...
}
...

// Retrieving documents with find() method using the query API
$term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
$query = new Zend_Search_Lucene_Search_Query_Term($term);
$hits  = $index->find($query);
foreach ($hits as $hit) {
    $title    = $hit->title;
    $contents = $hit->contents;
    ...
}

...

// Retrieving documents with termDocs() method
$term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
$docIds  = $index->termDocs($term);
foreach ($docIds as $id) {
    $doc = $index->getDocument($id);
    $title    = $doc->title;
    $contents = $doc->contents;
    ...
}
]]></programlisting>
    </sect2>

    <sect2 id="zend.search.lucene.best-practice.memory-usage">
        <title>Memory Usage</title>

        <para>
            <classname>Zend_Search_Lucene</classname> is a relatively memory-intensive module.
        </para>

        <para>
            It uses memory to cache some information and optimize searching and indexing
            performance.
        </para>

        <para>
            The memory required differs for different modes.
        </para>

        <para>
            The terms dictionary index is loaded during the search. It's actually each
            128<superscript>th</superscript>

            <footnote>
                <para>
                    The Lucene file format allows you to configure this number, but
                    <classname>Zend_Search_Lucene</classname> doesn't expose this in its
                    <acronym>API</acronym>. Nevertheless you still have the ability to configure
                    this value if the index is prepared with another Lucene implementation.
                </para>
            </footnote>

            term of the full dictionary.
        </para>

        <para>
            Thus memory usage is increased if you have a high number of unique terms. This may
            happen if you use untokenized phrases as a field values or index a large volume of
            non-text information.
        </para>

        <para>
            An unoptimized index consists of several segments. It also increases memory usage.
            Segments are independent, so each segment contains its own terms dictionary and terms
            dictionary index. If an index consists of <emphasis>N</emphasis> segments it may
            increase memory usage by <emphasis>N</emphasis> times in worst case. Perform index
            optimization to merge all segments into one to avoid such memory consumption.
        </para>

        <para>
            Indexing uses the same memory as searching plus memory for buffering documents. The
            amount of memory used may be managed with <emphasis>MaxBufferedDocs</emphasis>
            parameter.
        </para>

        <para>
            Index optimization (full or partial) uses stream-style data processing and doesn't
            require a lot of memory.
        </para>
    </sect2>

    <sect2 id="zend.search.lucene.best-practice.encoding">
        <title>Encoding</title>

        <para>
            <classname>Zend_Search_Lucene</classname> works with UTF-8 strings internally. So all
            strings returned by <classname>Zend_Search_Lucene</classname> are UTF-8 encoded.
        </para>

        <para>
            You shouldn't be concerned with encoding if you work with pure <acronym>ASCII</acronym>
            data, but you should be careful if this is not the case.
        </para>

        <para>
            Wrong encoding may cause error notices at the encoding conversion time or loss of data.
        </para>

        <para>
            <classname>Zend_Search_Lucene</classname> offers a wide range of encoding possibilities
            for indexed documents and parsed queries.
        </para>

        <para>
            Encoding may be explicitly specified as an optional parameter of field creation methods:
        </para>

        <programlisting language="php"><![CDATA[
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title',
                                              $title,
                                              'iso-8859-1'));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
                                                  $contents,
                                                  'utf-8'));
]]></programlisting>

        <para>
            This is the best way to avoid ambiguity in the encoding used.
        </para>

        <para>
            If optional encoding parameter is omitted, then the current locale is used. The current
            locale may contain character encoding data in addition to the language specification:
        </para>

        <programlisting language="php"><![CDATA[
setlocale(LC_ALL, 'fr_FR');
...

setlocale(LC_ALL, 'de_DE.iso-8859-1');
...

setlocale(LC_ALL, 'ru_RU.UTF-8');
...
]]></programlisting>

        <para>
            The same approach is used to set query string encoding.
        </para>

        <para>
            If encoding is not specified, then the current locale is used to determine the encoding.
        </para>

        <para>
            Encoding may be passed as an optional parameter, if the query is parsed explicitly
            before search:
        </para>

        <programlisting language="php"><![CDATA[
$query =
    Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
$hits = $index->find($query);
...
]]></programlisting>

        <para>
            The default encoding may also be specified with
            <methodname>setDefaultEncoding()</methodname> method:
        </para>

        <programlisting language="php"><![CDATA[
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
$hits = $index->find($queryStr);
...
]]></programlisting>

        <para>
            The empty string implies 'current locale'.
        </para>

        <para>
            If the correct encoding is specified it can be correctly processed by analyzer. The
            actual behavior depends on which analyzer is used. See the <link
                linkend="zend.search.lucene.charset">Character Set</link> documentation section for
            details.
        </para>
    </sect2>

    <sect2 id="zend.search.lucene.best-practice.maintenance">
        <title>Index maintenance</title>

        <para>
            It should be clear that <classname>Zend_Search_Lucene</classname> as well as any other
            Lucene implementation does not comprise a "database".
        </para>

        <para>
            Indexes should not be used for data storage. They do not provide partial backup/restore
            functionality, journaling, logging, transactions and many other features associated with
            database management systems.
        </para>

        <para>
            Nevertheless, <classname>Zend_Search_Lucene</classname> attempts to keep indexes in a
            consistent state at all times.
        </para>

        <para>
            Index backup and restoration should be performed by copying the contents of the index
            folder.
        </para>

        <para>
            If index corruption occurs for any reason, the corrupted index should be restored or
            completely rebuilt.
        </para>

        <para>
            So it's a good idea to backup large indexes and store changelogs to perform manual
            restoration and roll-forward operations if necessary. This practice dramatically reduces
            index restoration time.
        </para>
    </sect2>
</sect1>