1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589
|
<?xml version="1.0" encoding="UTF-8"?>
<!-- Reviewed: no -->
<sect1 id="zend.search.lucene.best-practice">
<title>Best Practices</title>
<sect2 id="zend.search.lucene.best-practice.field-names">
<title>Field names</title>
<para>
There are no limitations for field names in <classname>Zend_Search_Lucene</classname>.
</para>
<para>
Nevertheless it's a good idea not to use '<emphasis>id</emphasis>' and
'<emphasis>score</emphasis>' names to avoid ambiguity in <classname>QueryHit</classname>
properties names.
</para>
<para>
The <classname>Zend_Search_Lucene_Search_QueryHit</classname> <property>id</property>
and <property>score</property> properties always refer to internal Lucene document id
and hit <link linkend="zend.search.lucene.searching.results-scoring">score</link>. If
the indexed document has the same stored fields, you have to use the
<methodname>getDocument()</methodname> method to access them:
</para>
<programlisting language="php"><![CDATA[
$hits = $index->find($query);
foreach ($hits as $hit) {
// Get 'title' document field
$title = $hit->title;
// Get 'contents' document field
$contents = $hit->contents;
// Get internal Lucene document id
$id = $hit->id;
// Get query hit score
$score = $hit->score;
// Get 'id' document field
$docId = $hit->getDocument()->id;
// Get 'score' document field
$docId = $hit->getDocument()->score;
// Another way to get 'title' document field
$title = $hit->getDocument()->title;
}
]]></programlisting>
</sect2>
<sect2 id="zend.search.lucene.best-practice.indexing-performance">
<title>Indexing performance</title>
<para>
Indexing performance is a compromise between used resources, indexing time and index
quality.
</para>
<para>
Index quality is completely determined by number of index segments.
</para>
<para>
Each index segment is entirely independent portion of data. So indexes containing more
segments need more memory and time for searching.
</para>
<para>
Index optimization is a process of merging several segments into a new one. A fully
optimized index contains only one segment.
</para>
<para>
Full index optimization may be performed with the <methodname>optimize()</methodname>
method:
</para>
<programlisting language="php"><![CDATA[
$index = Zend_Search_Lucene::open($indexPath);
$index->optimize();
]]></programlisting>
<para>
Index optimization works with data streams and doesn't take a lot of memory but does
require processor resources and time.
</para>
<para>
Lucene index segments are not updatable by their nature (the update operation requires
the segment file to be completely rewritten). So adding new document(s) to an index
always generates a new segment. This, in turn, decreases index quality.
</para>
<para>
An index auto-optimization process is performed after each segment generation and
consists of merging partial segments.
</para>
<para>
There are three options to control the behavior of auto-optimization (see <link
linkend="zend.search.lucene.index-creation.optimization">Index optimization</link>
section):
<itemizedlist>
<listitem>
<para>
<emphasis>MaxBufferedDocs</emphasis> is the number of documents that can be
buffered in memory before a new segment is generated and written to the hard
drive.
</para>
</listitem>
<listitem>
<para>
<emphasis>MaxMergeDocs</emphasis> is the maximum number of documents merged
by auto-optimization process into a new segment.
</para>
</listitem>
<listitem>
<para>
<emphasis>MergeFactor</emphasis> determines how often auto-optimization is
performed.
</para>
</listitem>
</itemizedlist>
<note>
<para>
All these options are <classname>Zend_Search_Lucene</classname> object
properties- not index properties. They affect only current
<classname>Zend_Search_Lucene</classname> object behavior and may vary for
different scripts.
</para>
</note>
</para>
<para>
<emphasis>MaxBufferedDocs</emphasis> doesn't have any effect if you index only one
document per script execution. On the other hand, it's very important for batch
indexing. Greater values increase indexing performance, but also require more memory.
</para>
<para>
There is simply no way to calculate the best value for the
<emphasis>MaxBufferedDocs</emphasis> parameter because it depends on average document
size, the analyzer in use and allowed memory.
</para>
<para>
A good way to find the right value is to perform several tests with the largest document
you expect to be added to the index
<footnote>
<para>
<methodname>memory_get_usage()</methodname> and
<methodname>memory_get_peak_usage()</methodname> may be used to control memory
usage.
</para>
</footnote>
. It's a best practice not to use more than a half of the allowed memory.
</para>
<para>
<emphasis>MaxMergeDocs</emphasis> limits the segment size (in terms of documents). It
therefore also limits auto-optimization time by guaranteeing that the
<methodname>addDocument()</methodname> method is not executed more than a certain number
of times. This is very important for interactive applications.
</para>
<para>
Lowering the <emphasis>MaxMergeDocs</emphasis> parameter also may improve batch indexing
performance. Index auto-optimization is an iterative process and is performed from
bottom up. Small segments are merged into larger segment, which are in turn merged into
even larger segments and so on. Full index optimization is achieved when only one large
segment file remains.
</para>
<para>
Small segments generally decrease index quality. Many small segments may also trigger
the "Too many open files" error determined by OS limitations
<footnote>
<para>
<classname>Zend_Search_Lucene</classname> keeps each segment file opened to
improve search performance.
</para>
</footnote>.
</para>
<para>
in general, background index optimization should be performed for interactive indexing
mode and <emphasis>MaxMergeDocs</emphasis> shouldn't be too low for batch indexing.
</para>
<para>
<emphasis>MergeFactor</emphasis> affects auto-optimization frequency. Lower values
increase the quality of unoptimized indexes. Larger values increase indexing
performance, but also increase the number of merged segments. This again may trigger the
"Too many open files" error.
</para>
<para>
<emphasis>MergeFactor</emphasis> groups index segments by their size:
<orderedlist>
<listitem>
<para>Not greater than <emphasis>MaxBufferedDocs</emphasis>.</para>
</listitem>
<listitem>
<para>
Greater than <emphasis>MaxBufferedDocs</emphasis>, but not greater than
<emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>.
</para>
</listitem>
<listitem>
<para>
Greater than
<emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>, but
not greater than
<emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>*<emphasis>MergeFactor</emphasis>.
</para>
</listitem>
<listitem><para>...</para></listitem>
</orderedlist>
</para>
<para>
<classname>Zend_Search_Lucene</classname> checks during each
<methodname>addDocument()</methodname> call to see if merging any segments may move the
newly created segment into the next group. If yes, then merging is performed.
</para>
<para>
So an index with N groups may contain <emphasis>MaxBufferedDocs</emphasis> +
(N-1)*<emphasis>MergeFactor</emphasis> segments and contains at least
<emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis><superscript>(N-1)</superscript>
documents.
</para>
<para>
This gives good approximation for the number of segments in the index:
</para>
<para>
<emphasis>NumberOfSegments</emphasis> <= <emphasis>MaxBufferedDocs</emphasis> +
<emphasis>MergeFactor</emphasis>*log
<subscript><emphasis>MergeFactor</emphasis></subscript>
(<emphasis>NumberOfDocuments</emphasis>/<emphasis>MaxBufferedDocs</emphasis>)
</para>
<para>
<emphasis>MaxBufferedDocs</emphasis> is determined by allowed memory. This allows for
the appropriate merge factor to get a reasonable number of segments.
</para>
<para>
Tuning the <emphasis>MergeFactor</emphasis> parameter is more effective for batch
indexing performance than <emphasis>MaxMergeDocs</emphasis>. But it's also more
course-grained. So use the estimation above for tuning <emphasis>MergeFactor</emphasis>,
then play with <emphasis>MaxMergeDocs</emphasis> to get best batch indexing performance.
</para>
</sect2>
<sect2 id="zend.search.lucene.best-practice.shutting-down">
<title>Index during Shut Down</title>
<para>
The <classname>Zend_Search_Lucene</classname> instance performs some work at exit time
if any documents were added to the index but not written to a new segment.
</para>
<para>
It also may trigger an auto-optimization process.
</para>
<para>
The index object is automatically closed when it, and all returned QueryHit objects, go
out of scope.
</para>
<para>
If index object is stored in global variable than it's closed only at the end of script
execution
<footnote>
<para>
This also may occur if the index or QueryHit instances are referred to in some
cyclical data structures, because <acronym>PHP</acronym> garbage collects
objects with cyclic references only at the end of script execution.
</para>
</footnote>.
</para>
<para>
<acronym>PHP</acronym> exception processing is also shut down at this moment.
</para>
<para>
It doesn't prevent normal index shutdown process, but may prevent accurate error
diagnostic if any error occurs during shutdown.
</para>
<para>
There are two ways with which you may avoid this problem.
</para>
<para>
The first is to force going out of scope:
</para>
<programlisting language="php"><![CDATA[
$index = Zend_Search_Lucene::open($indexPath);
...
unset($index);
]]></programlisting>
<para>
And the second is to perform a commit operation before the end of script execution:
</para>
<programlisting language="php"><![CDATA[
$index = Zend_Search_Lucene::open($indexPath);
$index->commit();
]]></programlisting>
<para>
This possibility is also described in the "<link
linkend="zend.search.lucene.advanced.static">Advanced. Using index as static
property</link>" section.
</para>
</sect2>
<sect2 id="zend.search.lucene.best-practice.unique-id">
<title>Retrieving documents by unique id</title>
<para>
It's a common practice to store some unique document id in the index. Examples include
url, path, or database id.
</para>
<para>
<classname>Zend_Search_Lucene</classname> provides a <methodname>termDocs()</methodname>
method for retrieving documents containing specified terms.
</para>
<para>
This is more efficient than using the <methodname>find()</methodname> method:
</para>
<programlisting language="php"><![CDATA[
// Retrieving documents with find() method using a query string
$query = $idFieldName . ':' . $docId;
$hits = $index->find($query);
foreach ($hits as $hit) {
$title = $hit->title;
$contents = $hit->contents;
...
}
...
// Retrieving documents with find() method using the query API
$term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
$query = new Zend_Search_Lucene_Search_Query_Term($term);
$hits = $index->find($query);
foreach ($hits as $hit) {
$title = $hit->title;
$contents = $hit->contents;
...
}
...
// Retrieving documents with termDocs() method
$term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
$docIds = $index->termDocs($term);
foreach ($docIds as $id) {
$doc = $index->getDocument($id);
$title = $doc->title;
$contents = $doc->contents;
...
}
]]></programlisting>
</sect2>
<sect2 id="zend.search.lucene.best-practice.memory-usage">
<title>Memory Usage</title>
<para>
<classname>Zend_Search_Lucene</classname> is a relatively memory-intensive module.
</para>
<para>
It uses memory to cache some information and optimize searching and indexing
performance.
</para>
<para>
The memory required differs for different modes.
</para>
<para>
The terms dictionary index is loaded during the search. It's actually each
128<superscript>th</superscript>
<footnote>
<para>
The Lucene file format allows you to configure this number, but
<classname>Zend_Search_Lucene</classname> doesn't expose this in its
<acronym>API</acronym>. Nevertheless you still have the ability to configure
this value if the index is prepared with another Lucene implementation.
</para>
</footnote>
term of the full dictionary.
</para>
<para>
Thus memory usage is increased if you have a high number of unique terms. This may
happen if you use untokenized phrases as a field values or index a large volume of
non-text information.
</para>
<para>
An unoptimized index consists of several segments. It also increases memory usage.
Segments are independent, so each segment contains its own terms dictionary and terms
dictionary index. If an index consists of <emphasis>N</emphasis> segments it may
increase memory usage by <emphasis>N</emphasis> times in worst case. Perform index
optimization to merge all segments into one to avoid such memory consumption.
</para>
<para>
Indexing uses the same memory as searching plus memory for buffering documents. The
amount of memory used may be managed with <emphasis>MaxBufferedDocs</emphasis>
parameter.
</para>
<para>
Index optimization (full or partial) uses stream-style data processing and doesn't
require a lot of memory.
</para>
</sect2>
<sect2 id="zend.search.lucene.best-practice.encoding">
<title>Encoding</title>
<para>
<classname>Zend_Search_Lucene</classname> works with UTF-8 strings internally. So all
strings returned by <classname>Zend_Search_Lucene</classname> are UTF-8 encoded.
</para>
<para>
You shouldn't be concerned with encoding if you work with pure <acronym>ASCII</acronym>
data, but you should be careful if this is not the case.
</para>
<para>
Wrong encoding may cause error notices at the encoding conversion time or loss of data.
</para>
<para>
<classname>Zend_Search_Lucene</classname> offers a wide range of encoding possibilities
for indexed documents and parsed queries.
</para>
<para>
Encoding may be explicitly specified as an optional parameter of field creation methods:
</para>
<programlisting language="php"><![CDATA[
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title',
$title,
'iso-8859-1'));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
$contents,
'utf-8'));
]]></programlisting>
<para>
This is the best way to avoid ambiguity in the encoding used.
</para>
<para>
If optional encoding parameter is omitted, then the current locale is used. The current
locale may contain character encoding data in addition to the language specification:
</para>
<programlisting language="php"><![CDATA[
setlocale(LC_ALL, 'fr_FR');
...
setlocale(LC_ALL, 'de_DE.iso-8859-1');
...
setlocale(LC_ALL, 'ru_RU.UTF-8');
...
]]></programlisting>
<para>
The same approach is used to set query string encoding.
</para>
<para>
If encoding is not specified, then the current locale is used to determine the encoding.
</para>
<para>
Encoding may be passed as an optional parameter, if the query is parsed explicitly
before search:
</para>
<programlisting language="php"><![CDATA[
$query =
Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
$hits = $index->find($query);
...
]]></programlisting>
<para>
The default encoding may also be specified with
<methodname>setDefaultEncoding()</methodname> method:
</para>
<programlisting language="php"><![CDATA[
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
$hits = $index->find($queryStr);
...
]]></programlisting>
<para>
The empty string implies 'current locale'.
</para>
<para>
If the correct encoding is specified it can be correctly processed by analyzer. The
actual behavior depends on which analyzer is used. See the <link
linkend="zend.search.lucene.charset">Character Set</link> documentation section for
details.
</para>
</sect2>
<sect2 id="zend.search.lucene.best-practice.maintenance">
<title>Index maintenance</title>
<para>
It should be clear that <classname>Zend_Search_Lucene</classname> as well as any other
Lucene implementation does not comprise a "database".
</para>
<para>
Indexes should not be used for data storage. They do not provide partial backup/restore
functionality, journaling, logging, transactions and many other features associated with
database management systems.
</para>
<para>
Nevertheless, <classname>Zend_Search_Lucene</classname> attempts to keep indexes in a
consistent state at all times.
</para>
<para>
Index backup and restoration should be performed by copying the contents of the index
folder.
</para>
<para>
If index corruption occurs for any reason, the corrupted index should be restored or
completely rebuilt.
</para>
<para>
So it's a good idea to backup large indexes and store changelogs to perform manual
restoration and roll-forward operations if necessary. This practice dramatically reduces
index restoration time.
</para>
</sect2>
</sect1>
|