File: nested.rst

package info (click to toggle)
python-whoosh 2.7.4%2Bgit6-g9134ad92-4
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 3,648 kB
  • sloc: python: 38,517; makefile: 118
file content (238 lines) | stat: -rw-r--r-- 10,031 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
===========================================
Indexing and searching document hierarchies
===========================================

Overview
========

Whoosh's full-text index is essentially a flat database of documents. However,
Whoosh supports two techniques for simulating the indexing and querying of
hierarchical documents, that is, sets of documents that form a parent-child
hierarchy, such as "Chapter - Section - Paragraph" or
"Module - Class - Method".

You can specify parent-child relationships *at indexing time*, by grouping
documents in the same hierarchy, and then use the
:class:`whoosh.query.NestedParent` and/or :class:`whoosh.query.NestedChildren`
to find parents based on their children or vice-versa.

Alternatively, you can use *query time joins*, essentially like external key
joins in a database, where you perform one search to find a relevant document,
then use a stored value on that document (for example, a ``parent`` field) to
look up another document.

Both methods have pros and cons.


Using nested document indexing
==============================

Indexing
--------

This method works by indexing a "parent" document and all its "child" documents
*as a "group"* so they are guaranteed to end up in the same segment. You can
use the context manager returned by ``IndexWriter.group()`` to group
documents::

    with ix.writer() as w:
        with w.group():
            w.add_document(kind="class", name="Index")
            w.add_document(kind="method", name="add document")
            w.add_document(kind="method", name="add reader")
            w.add_document(kind="method", name="close")
        with w.group():
            w.add_document(kind="class", name="Accumulator")
            w.add_document(kind="method", name="add")
            w.add_document(kind="method", name="get result")
        with w.group():
            w.add_document(kind="class", name="Calculator")
            w.add_document(kind="method", name="add")
            w.add_document(kind="method", name="add all")
            w.add_document(kind="method", name="add some")
            w.add_document(kind="method", name="multiply")
            w.add_document(kind="method", name="close")
        with w.group():
            w.add_document(kind="class", name="Deleter")
            w.add_document(kind="method", name="add")
            w.add_document(kind="method", name="delete")

Alternatively you can use the ``start_group()`` and ``end_group()`` methods::

    with ix.writer() as w:
        w.start_group()
        w.add_document(kind="class", name="Index")
        w.add_document(kind="method", name="add document")
        w.add_document(kind="method", name="add reader")
        w.add_document(kind="method", name="close")
        w.end_group()

Each level of the hierarchy should have a query that distinguishes it from
other levels (for example, in the above index, you can use ``kind:class`` or
``kind:method`` to match different levels of the hierarchy).

Once you've indexed the hierarchy of documents, you can use two query types to
find parents based on children or vice-versa.

(There is currently no support in the default query parser for nested queries.)


NestedParent query
------------------

The :class:`whoosh.query.NestedParent` query type lets you specify a query for
child documents, but have the query return an "ancestor" document from higher
in the hierarchy::

    # First, we need a query that matches all the documents in the "parent"
    # level we want of the hierarchy
    all_parents = query.Term("kind", "class")

    # Then, we need a query that matches the children we want to find
    wanted_kids = query.Term("name", "close")

    # Now we can make a query that will match documents where "name" is
    # "close", but the query will return the "parent" documents of the matching
    # children
    q = query.NestedParent(all_parents, wanted_kids)
    # results = Index, Calculator

Note that in a hierarchy with more than two levels, you can specify a "parents"
query that matches any level of the hierarchy, so you can return the top-level
ancestors of the matching children, or the second level, third level, etc.

The query works by first building a bit vector representing which documents are
"parents"::

     Index
     |      Calculator
     |      |
     1000100100000100
         |        |
         |        Deleter
         Accumulator

Then for each match of the "child" query, it calculates the previous parent
from the bit vector and returns it as a match (it only returns each parent once
no matter how many children match). This parent lookup is very efficient::

     1000100100000100
        |
     |<-+ close


NestedChildren query
--------------------

The opposite of ``NestedParent`` is :class:`whoosh.query.NestedChildren`. This
query lets you match parents but return their children. This is useful, for
example, to search for an album title and return the songs in the album::

    # Query that matches all documents in the "parent" level we want to match
    # at
    all_parents = query.Term("kind", "album")

    # Parent documents we want to match
    wanted_parents = query.Term("album_title", "heaven")

    # Now we can make a query that will match parent documents where "album_title"
    # contains "heaven", but the query will return the "child" documents of the
    # matching parents
    q1 = query.NestedChildren(all_parents, wanted_parents)

You can then combine that query with an ``AND`` clause, for example to find
songs with "hell" in the song title that occur on albums with "heaven" in the
album title::

    q2 = query.And([q1, query.Term("song_title", "hell")])


Deleting and updating hierarchical documents
--------------------------------------------

The drawback of the index-time method is *updating and deleting*. Because the
implementation of the queries depends on the parent and child documents being
contiguous in the segment, you can't update/delete just one child document.
You can only update/delete an entire top-level document at once (for example,
if your hierarchy is "Chapter - Section - Paragraph", you can only update or
delete entire chapters, not a section or paragraph). If the top-level of the
hierarchy represents very large blocks of text, this can involve a lot of
deleting and reindexing.

Currently ``Writer.update_document()`` does not automatically work with nested
documents. You must manually delete and re-add document groups to update them.

To delete nested document groups, use the ``Writer.delete_by_query()``
method with a ``NestedParent`` query::

    # Delete the "Accumulator" class
    all_parents = query.Term("kind", "class")
    to_delete = query.Term("name", "Accumulator")
    q = query.NestedParent(all_parents, to_delete)
    with myindex.writer() as w:
        w.delete_by_query(q)


Using query-time joins
======================

A second technique for simulating hierarchical documents in Whoosh involves
using a stored field on each document to point to its parent, and then using
the value of that field at query time to find parents and children.

For example, if we index a hierarchy of classes and methods using pointers
to parents instead of nesting::

    # Store a pointer to the parent on each "method" document
    with ix.writer() as w:
        w.add_document(kind="class", c_name="Index", docstring="...")
        w.add_document(kind="method", m_name="add document", parent="Index")
        w.add_document(kind="method", m_name="add reader", parent="Index")
        w.add_document(kind="method", m_name="close", parent="Index")

        w.add_document(kind="class", c_name="Accumulator", docstring="...")
        w.add_document(kind="method", m_name="add", parent="Accumulator")
        w.add_document(kind="method", m_name="get result", parent="Accumulator")

        w.add_document(kind="class", c_name="Calculator", docstring="...")
        w.add_document(kind="method", m_name="add", parent="Calculator")
        w.add_document(kind="method", m_name="add all", parent="Calculator")
        w.add_document(kind="method", m_name="add some", parent="Calculator")
        w.add_document(kind="method", m_name="multiply", parent="Calculator")
        w.add_document(kind="method", m_name="close", parent="Calculator")

        w.add_document(kind="class", c_name="Deleter", docstring="...")
        w.add_document(kind="method", m_name="add", parent="Deleter")
        w.add_document(kind="method", m_name="delete", parent="Deleter")

    # Now do manual joins at query time
    with ix.searcher() as s:
        # Tip: Searcher.document() and Searcher.documents() let you look up
        # documents by field values more easily than using Searcher.search()

        # Children to parents:
        # Print the docstrings of classes on which "close" methods occur
        for child_doc in s.documents(m_name="close"):
            # Use the stored value of the "parent" field to look up the parent
            # document
            parent_doc = s.document(c_name=child_doc["parent"])
            # Print the parent document's stored docstring field
            print(parent_doc["docstring"])

        # Parents to children:
        # Find classes with "big" in the docstring and print their methods
        q = query.Term("kind", "class") & query.Term("docstring", "big")
        for hit in s.search(q, limit=None):
            print("Class name=", hit["c_name"], "methods:")
            for child_doc in s.documents(parent=hit["c_name"]):
                print("  Method name=", child_doc["m_name"])

This technique is more flexible than index-time nesting in that you can
delete/update individual documents in the hierarchy piece by piece, although it
doesn't support finding different parent levels as easily. It is also slower
than index-time nesting (potentially much slower), since you must perform
additional searches for each found document.

Future versions of Whoosh may include "join" queries to make this process more
efficient (or at least more automatic).