File: indexing.rst

package info (click to toggle)
python-whoosh 2.7.4%2Bgit6-g9134ad92-4
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 3,648 kB
  • sloc: python: 38,517; makefile: 118
file content (440 lines) | stat: -rw-r--r-- 15,880 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
======================
How to index documents
======================

Creating an Index object
========================

To create an index in a directory, use ``index.create_in``::

    import os, os.path
    from whoosh import index

    if not os.path.exists("indexdir"):
        os.mkdir("indexdir")

    ix = index.create_in("indexdir", schema)

To open an existing index in a directory, use ``index.open_dir``::

    import whoosh.index as index

    ix = index.open_dir("indexdir")

These are convenience methods for::

    from whoosh.filedb.filestore import FileStorage
    storage = FileStorage("indexdir")

    # Create an index
    ix = storage.create_index(schema)

    # Open an existing index
    storage.open_index()

The schema you created the index with is pickled and stored with the index.

You can keep multiple indexes in the same directory using the indexname keyword
argument::

    # Using the convenience functions
    ix = index.create_in("indexdir", schema=schema, indexname="usages")
    ix = index.open_dir("indexdir", indexname="usages")

    # Using the Storage object
    ix = storage.create_index(schema, indexname="usages")
    ix = storage.open_index(indexname="usages")


Clearing the index
==================

Calling ``index.create_in`` on a directory with an existing index will clear the
current contents of the index.

To test whether a directory currently contains a valid index, use
``index.exists_in``::

    exists = index.exists_in("indexdir")
    usages_exists = index.exists_in("indexdir", indexname="usages")

(Alternatively you can simply delete the index's files from the directory, e.g.
if you only have one index in the directory, use ``shutil.rmtree`` to remove the
directory and then recreate it.)


Indexing documents
==================

Once you've created an ``Index`` object, you can add documents to the index with an
``IndexWriter`` object. The easiest way to get the ``IndexWriter`` is to call
``Index.writer()``::

    ix = index.open_dir("index")
    writer = ix.writer()

Creating a writer locks the index for writing, so only one thread/process at
a time can have a writer open.

.. note::

    Because opening a writer locks the index for writing, in a multi-threaded
    or multi-process environment your code needs to be aware that opening a
    writer may raise an exception (``whoosh.store.LockError``) if a writer is
    already open. Whoosh includes a couple of example implementations
    (:class:`whoosh.writing.AsyncWriter` and
    :class:`whoosh.writing.BufferedWriter`) of ways to work around the write
    lock.

.. note::

    While the writer is open and during the commit, the index is still
    available for reading. Existing readers are unaffected and new readers can
    open the current index normally. Once the commit is finished, existing
    readers continue to see the previous version of the index (that is, they
    do not automatically see the newly committed changes). New readers will see
    the updated index.

The IndexWriter's ``add_document(**kwargs)`` method accepts keyword arguments
where the field name is mapped to a value::

    writer = ix.writer()
    writer.add_document(title=u"My document", content=u"This is my document!",
                        path=u"/a", tags=u"first short", icon=u"/icons/star.png")
    writer.add_document(title=u"Second try", content=u"This is the second example.",
                        path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")
    writer.add_document(title=u"Third time's the charm", content=u"Examples are many.",
                        path=u"/c", tags=u"short", icon=u"/icons/book.png")
    writer.commit()

You don't have to fill in a value for every field. Whoosh doesn't care if you
leave out a field from a document.

Indexed fields must be passed a unicode value. Fields that are stored but not
indexed (i.e. the ``STORED`` field type) can be passed any pickle-able object.

Whoosh will happily allow you to add documents with identical values, which can
be useful or annoying depending on what you're using the library for::

    writer.add_document(path=u"/a", title=u"A", content=u"Hello there")
    writer.add_document(path=u"/a", title=u"A", content=u"Deja vu!")

This adds two documents to the index with identical path and title fields. See
"updating documents" below for information on the ``update_document`` method, which
uses "unique" fields to replace old documents instead of appending.


Indexing and storing different values for the same field
--------------------------------------------------------

If you have a field that is both indexed and stored, you can index a unicode
value but store a different object if necessary (it's usually not, but sometimes
this is really useful) using a "special" keyword argument ``_stored_<fieldname>``.
The normal value will be analyzed and indexed, but the "stored" value will show
up in the results::

    writer.add_document(title=u"Title to be indexed", _stored_title=u"Stored title")


Finishing adding documents
--------------------------

An ``IndexWriter`` object is kind of like a database transaction. You specify a
bunch of changes to the index, and then "commit" them all at once.

Calling ``commit()`` on the ``IndexWriter`` saves the added documents to the
index::

    writer.commit()

Once your documents are in the index, you can search for them.

If you want to close the writer without committing the changes, call
``cancel()`` instead of ``commit()``::

    writer.cancel()

Keep in mind that while you have a writer open (including a writer you opened
and is still in scope), no other thread or process can get a writer or modify
the index. A writer also keeps several open files. So you should always remember
to call either ``commit()`` or ``cancel()`` when you're done with a writer object.


Merging segments
================

A Whoosh ``filedb`` index is really a container for one or more "sub-indexes"
called segments. When you add documents to an index, instead of integrating the
new documents with the existing documents (which could potentially be very
expensive, since it involves resorting all the indexed terms on disk), Whoosh
creates a new segment next to the existing segment. Then when you search the
index, Whoosh searches both segments individually and merges the results so the
segments appear to be one unified index. (This smart design is copied from
Lucene.)

So, having a few segments is more efficient than rewriting the entire index
every time you add some documents. But searching multiple segments does slow
down searching somewhat, and the more segments you have, the slower it gets. So
Whoosh has an algorithm that runs when you call ``commit()`` that looks for small
segments it can merge together to make fewer, bigger segments.

To prevent Whoosh from merging segments during a commit, use the ``merge``
keyword argument::

    writer.commit(merge=False)

To merge all segments together, optimizing the index into a single segment,
use the ``optimize`` keyword argument::

    writer.commit(optimize=True)

Since optimizing rewrites all the information in the index, it can be slow on
a large index. It's generally better to rely on Whoosh's merging algorithm than
to optimize all the time.

(The ``Index`` object also has an ``optimize()`` method that lets you optimize the
index (merge all the segments together). It simply creates a writer and calls
``commit(optimize=True)`` on it.)

For more control over segment merging, you can write your own merge policy
function and use it as an argument to the ``commit()`` method. See the
implementation of the ``NO_MERGE``, ``MERGE_SMALL``, and ``OPTIMIZE`` functions
in the ``whoosh.writing`` module.


Deleting documents
==================

You can delete documents using the following methods on an ``IndexWriter``
object. You then need to call ``commit()`` on the writer to save the deletions
to disk.

``delete_document(docnum)``

    Low-level method to delete a document by its internal document number.

``is_deleted(docnum)``

    Low-level method, returns ``True`` if the document with the given internal
    number is deleted.

``delete_by_term(fieldname, termtext)``

    Deletes any documents where the given (indexed) field contains the given
    term. This is mostly useful for ``ID`` or ``KEYWORD`` fields.

``delete_by_query(query)``

    Deletes any documents that match the given query.

::

    # Delete document by its path -- this field must be indexed
    ix.delete_by_term('path', u'/a/b/c')
    # Save the deletion to disk
    ix.commit()

In the ``filedb`` backend, "deleting" a document simply adds the document number
to a list of deleted documents stored with the index. When you search the index,
it knows not to return deleted documents in the results. However, the document's
contents are still stored in the index, and certain statistics (such as term
document frequencies) are not updated, until you merge the segments containing
deleted documents (see merging above). (This is because removing the information
immediately from the index would essentially involving rewriting the entire
index on disk, which would be very inefficient.)


Updating documents
==================

If you want to "replace" (re-index) a document, you can delete the old document
using one of the ``delete_*`` methods on ``Index`` or ``IndexWriter``, then use
``IndexWriter.add_document`` to add the new version. Or, you can use
``IndexWriter.update_document`` to do this in one step.

For ``update_document`` to work, you must have marked at least one of the fields
in the schema as "unique". Whoosh will then use the contents of the "unique"
field(s) to search for documents to delete::

    from whoosh.fields import Schema, ID, TEXT

    schema = Schema(path = ID(unique=True), content=TEXT)

    ix = index.create_in("index")
    writer = ix.writer()
    writer.add_document(path=u"/a", content=u"The first document")
    writer.add_document(path=u"/b", content=u"The second document")
    writer.commit()

    writer = ix.writer()
    # Because "path" is marked as unique, calling update_document with path="/a"
    # will delete any existing documents where the "path" field contains "/a".
    writer.update_document(path=u"/a", content="Replacement for the first document")
    writer.commit()

The "unique" field(s) must be indexed.

If no existing document matches the unique fields of the document you're
updating, ``update_document`` acts just like ``add_document``.

"Unique" fields and ``update_document`` are simply convenient shortcuts for deleting
and adding. Whoosh has no inherent concept of a unique identifier, and in no way
enforces uniqueness when you use ``add_document``.


Incremental indexing
====================

When you're indexing a collection of documents, you'll often want two code
paths: one to index all the documents from scratch, and one to only update the
documents that have changed (leaving aside web applications where you need to
add/update documents according to user actions).

Indexing everything from scratch is pretty easy. Here's a simple example::

    import os.path
    from whoosh import index
    from whoosh.fields import Schema, ID, TEXT

    def clean_index(dirname):
      # Always create the index from scratch
      ix = index.create_in(dirname, schema=get_schema())
      writer = ix.writer()

      # Assume we have a function that gathers the filenames of the
      # documents to be indexed
      for path in my_docs():
        add_doc(writer, path)

      writer.commit()


    def get_schema()
      return Schema(path=ID(unique=True, stored=True), content=TEXT)


    def add_doc(writer, path):
      fileobj = open(path, "rb")
      content = fileobj.read()
      fileobj.close()
      writer.add_document(path=path, content=content)

Now, for a small collection of documents, indexing from scratch every time might
actually be fast enough. But for large collections, you'll want to have the
script only re-index the documents that have changed.

To start we'll need to store each document's last-modified time, so we can check
if the file has changed. In this example, we'll just use the mtime for
simplicity::

    def get_schema()
      return Schema(path=ID(unique=True, stored=True), time=STORED, content=TEXT)

    def add_doc(writer, path):
      fileobj = open(path, "rb")
      content = fileobj.read()
      fileobj.close()
      modtime = os.path.getmtime(path)
      writer.add_document(path=path, content=content, time=modtime)

Now we can modify the script to allow either "clean" (from scratch) or
incremental indexing::

    def index_my_docs(dirname, clean=False):
      if clean:
        clean_index(dirname)
      else:
        incremental_index(dirname)


    def incremental_index(dirname)
        ix = index.open_dir(dirname)

        # The set of all paths in the index
        indexed_paths = set()
        # The set of all paths we need to re-index
        to_index = set()

        with ix.searcher() as searcher:
          writer = ix.writer()

          # Loop over the stored fields in the index
          for fields in searcher.all_stored_fields():
            indexed_path = fields['path']
            indexed_paths.add(indexed_path)

            if not os.path.exists(indexed_path):
              # This file was deleted since it was indexed
              writer.delete_by_term('path', indexed_path)

            else:
              # Check if this file was changed since it
              # was indexed
              indexed_time = fields['time']
              mtime = os.path.getmtime(indexed_path)
              if mtime > indexed_time:
                # The file has changed, delete it and add it to the list of
                # files to reindex
                writer.delete_by_term('path', indexed_path)
                to_index.add(indexed_path)

          # Loop over the files in the filesystem
          # Assume we have a function that gathers the filenames of the
          # documents to be indexed
          for path in my_docs():
            if path in to_index or path not in indexed_paths:
              # This is either a file that's changed, or a new file
              # that wasn't indexed before. So index it!
              add_doc(writer, path)

          writer.commit()

The ``incremental_index`` function:

* Loops through all the paths that are currently indexed.

  * If any of the files no longer exist, delete the corresponding document from
    the index.

  * If the file still exists, but has been modified, add it to the list of paths
    to be re-indexed.

  * If the file exists, whether it's been modified or not, add it to the list of
    all indexed paths.

* Loops through all the paths of the files on disk.

  * If a path is not in the set of all indexed paths, the file is new and we
    need to index it.

  * If a path is in the set of paths to re-index, we need to index it.

  * Otherwise, we can skip indexing the file.


Clearing the index
==================

In some cases you may want to re-index from scratch. To clear the index without
disrupting any existing readers::

    from whoosh import writing

    with myindex.writer() as mywriter:
        # You can optionally add documents to the writer here
        # e.g. mywriter.add_document(...)

        # Using mergetype=CLEAR clears all existing segments so the index will
        # only have any documents you've added to this writer
        mywriter.mergetype = writing.CLEAR

Or, if you don't use the writer as a context manager and call ``commit()``
directly, do it like this::

    mywriter = myindex.writer()
    # ...
    mywriter.commit(mergetype=writing.CLEAR)

.. note::
    If you don't need to worry about existing readers, a more efficient method
    is to simply delete the contents of the index directory and start over.