File: introduction.rst

package info (click to toggle)
xapian-bindings 1.4.31-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 21,456 kB
  • sloc: cpp: 379,927; python: 10,780; cs: 9,529; java: 6,949; sh: 5,017; perl: 4,435; makefile: 1,277; ruby: 1,028; php: 586; tcl: 246
file content (324 lines) | stat: -rw-r--r-- 16,338 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
.. Copyright (C) 2003 James Aylett
.. Copyright (C) 2004-2022 Olly Betts
.. Copyright (C) 2007,2008,2010 Richard Boulton

===========================
Python3 bindings for Xapian
===========================

.. contents:: Table of contents

The Python3 bindings for Xapian are packaged in the ``xapian`` module,
so to use them you need to add this to your code::

  import xapian

Since Xapian 1.4.22 these bindings require Python >= 3.3.  If you still need
support for older Python versions, Xapian <= 1.4.21 supports Python 3.2.
If you still need Python2 support, there are separate bindings for that.

The Python API largely follows the C++ API - the differences and
additions are noted below.

Strings
=======

The Xapian C++ API is largely agnostic about character encoding, and uses the
`std::string` type as an opaque container for a sequence of bytes.
In places where the bytes represent text (for example, in the
`Stem`, `QueryParser` and `TermGenerator` classes), UTF-8 encoding is used.  In
order to wrap this for Python, `std::string` is mapped to/from the Python
`bytes` type.

As a convenience, you can also pass Python
`str` objects as parameters where this is appropriate, which will be
converted to UTF-8 encoded text.  Where `std::string` is
returned, it's always mapped to `bytes` in Python, which you can
convert to a Python `str` by calling `.decode('utf-8')`
on it like so::

  for i in doc.termlist():
    print(i.term.decode('utf-8'))

Therefore, in order to avoid issues with character encodings, you should
always pass text data to Xapian as unicode strings, or UTF-8 encoded byte
strings.

There is, however, no requirement for byte strings passed into
Xapian to be valid UTF-8 encoded strings, unless they are being passed to a
text processing routine (such as the query parser, or the stemming
algorithms).  For example, it is perfectly valid to pass arbitrary binary
data to the ``xapian.Document.set_data()`` method.

Unicode
=======

Unicode text is most often in NFC already, but if you need to normalise text
before passing it to Xapian, the standard python module "``unicodedata``"
provides support for normalising unicode: you probably want the "``NFKC``"
normalisation scheme, so for example normalising a query string prior to
parsing it would look something like this:

::
    def parse_query(query_string):
        query_string = unicodedata.normalize('NFKC', query_string)
        qp = xapian.QueryParser()
        query_obj = qp.parse_query(query_string)

Exceptions
==========

Xapian-specific exceptions are subclasses of the :xapian-class:`Error`
class, so you can trap all Xapian-specific exceptions like so::

    try:
        do_something_with_xapian()
    except xapian.Error as e:
        print str(e)

`xapian.Error` is a subclass of the standard Python
`exceptions.Exception` class so will also be caught by `except Exception`.

Iterators
=========

The iterator classes in the Xapian C++ API are wrapped in a pythonic style.
The following are supported (where marked as "default iterator", it means
`__iter__()` does the right thing so you can for instance use
`for term in document` to iterate over terms in a Document object):

.. table:: Python iterators

+----------------------+------------------------------------------+---------------------------------------+----------------------+
| Class                | Python Method                            | Equivalent C++ Method                 | Python iterator type |
+======================+==========================================+=======================================+======================+
|``MSet``              | default iterator                         | ``begin()``                           | ``MSetIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``ESet``              | default iterator                         | ``begin()``                           | ``ESetIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Enquire``           | ``matching_terms()``                     | ``get_matching_terms_begin()``        | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Query``             | default iterator                         | ``get_terms_begin()``                 | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``allterms()`` (also as default iterator)| ``allterms_begin()``                  | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``postlist(term)``                       | ``postlist_begin(term)``              | ``PostingIter``      |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``termlist(docid)``                      | ``termlist_begin(docid)``             | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``positionlist(docid, term)``            | ``positionlist_begin(docid, term)``   | ``PositionIter``     |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``metadata_keys(prefix)``                | ``metadata_keys(prefix)``             | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``spellings()``                          | ``spellings_begin(term)``             | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``synonyms(term)``                       | ``synonyms_begin(term)``              | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``synonym_keys(prefix)``                 | ``synonym_keys_begin(prefix)``        | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Document``          | ``values()``                             | ``values_begin()``                    | ``ValueIter``        |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Document``          | ``termlist()`` (also as default iterator)| ``termlist_begin()``                  | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``QueryParser``       | ``stoplist()``                           | ``stoplist_begin()``                  | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``QueryParser``       | ``unstemlist(term)``                     | ``unstem_begin(term)``                | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``ValueCountMatchSpy``| ``values()``                             | ``values_begin()``                    | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``ValueCountMatchSpy``| ``top_values()``                         | ``top_values_begin()``                | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+


The pythonic iterators generally return Python objects, with properties
available as attribute values, with lazy evaluation where appropriate.  An
exception is `PositionIter` (as returned by `Database.positionlist` for
example), which returns an integer.

The lazy evaluation is mainly transparent, but does become visible in one
situation: if you keep an object returned by an iterator, without evaluating
its properties to force the lazy evaluation to happen, and then move the
iterator forward, the object may no longer be able to efficiently perform the
lazy evaluation.  In this situation, an exception will be raised indicating
that the information requested wasn't available.  This will only happen for a
few of the properties - most are either not evaluated lazily (because the
underlying Xapian implementation doesn't evaluate them lazily, so there's no
advantage in lazy evaluation), or can be accessed even after the iterator has
moved.  The simplest work around is to evaluate any properties you wish to use
which are affected by this before moving the iterator.  The complete set of
iterator properties affected by this is:

 * `Database.allterms` (also accessible as `Database.__iter__`): `termfreq`
 * `Database.termlist`: `termfreq` and `positer`
 * `Document.termlist` (also accessible as `Document.__iter__`): `termfreq` and `positer`
 * `Database.postlist`: `positer`

MSet
====

MSet objects have some additional methods to simplify access (these
work using the C++ array dereferencing):

.. table:: MSet additional methods

+--------------------------------+--------------------------------------+
| Method name                    |            Explanation               |
+================================+======================================+
| ``get_hit(i)``                 |  returns ``MSetItem`` at index ``i`` |
+--------------------------------+--------------------------------------+
| ``get_document_percentage(i)`` | ``convert_to_percent(get_hit(i))``   |
+--------------------------------+--------------------------------------+
| ``get_document(i)``            | ``get_hit(i).get_document()``        |
+--------------------------------+--------------------------------------+
| ``get_docid(i)``               | ``get_hit(i).get_docid()``           |
+--------------------------------+--------------------------------------+


Two MSet objects are equal if they have the same number and maximum possible
number of members, and if every document member of the first MSet exists at the
same index in the second MSet, with the same weight.

Non-Class Functions
===================

The C++ API contains a few non-class functions (the Database factory
functions, and some functions reporting version information), which are
wrapped like so for Python 3:

 * `Xapian::version_string()` is wrapped as `xapian.version_string()`
 * `Xapian::major_version()` is wrapped as `xapian.major_version()`
 * `Xapian::minor_version()` is wrapped as `xapian.minor_version()`
 * `Xapian::revision()` is wrapped as `xapian.revision()`

 * `Xapian::Remote::open()` is wrapped as `xapian.remote_open()` (both
   the TCP and "program" versions are wrapped - the SWIG wrapper checks the parameter list to
   decide which to call).
 * `Xapian::Remote::open_writable()` is wrapped as `xapian.remote_open_writable()` (both
   the TCP and "program" versions are wrapped - the SWIG wrapper checks the parameter list to
   decide which to call).

The following were deprecated in the C++ API before the Python 3 bindings saw
a stable release, so are not wrapped for Python 3:

 * `Xapian::Auto::open_stub()`
 * `Xapian::Chert::open()`
 * `Xapian::InMemory::open()`

The version of the bindings in use is available as `xapian.__version__` (as
recommended by PEP 396).  This may not be the same as `xapian.version_string()`
as the latter is the version of xapian-core (the C++ library) in use.

Query
=====

In C++ there's a Xapian::Query constructor which takes a query operator and
start/end iterators specifying a number of terms or queries, plus an optional
parameter.  In Python, this is wrapped to accept any Python sequence (for
example a list or tuple) of terms or queries (or even a mixture of terms
and queries).  For example:


::

  subq = xapian.Query(xapian.Query.OP_AND, "hello", "world")
  q = xapian.Query(xapian.Query.OP_AND, [subq, "foo", xapian.Query("bar", 2)])


MatchAll and MatchNothing
-------------------------

These are wrapped as `xapian.Query.MatchAll` and
`xapian.Query.MatchNothing`.


MatchDecider
============

Custom MatchDeciders can be created in Python - subclass
`xapian.MatchDecider`, ensure you call the super-constructor from your
constructor, and define a `__call__` method that will do the work. The
simplest example (which does nothing useful) would be as follows:

::

  class mymatchdecider(xapian.MatchDecider):
    def __init__(self):
      xapian.MatchDecider.__init__(self)

    def __call__(self, doc):
      # Accept all documents.
      return True

ValueRangeProcessor
===================

The `ValueRangeProcessor` class is deprecated and will be removed in Xapian
2.0.0.  The replacement is `RangeProcessor` (added in Xapian 1.3.6).  Use
`RangeProcessor` instead in new code - it's more flexible because it
can return an arbitrary `Query` object.  This section documenting
`ValueRangeProcessor` is here to aid migrating existing uses.

The ValueRangeProcessor class (and its subclasses) provide an operator() method
(which is exposed in python as a __call__() method, making the class instances
into callables).  This method checks whether the beginning and end of a range are
in a format understood by the ValueRangeProcessor, and if so, converts the
beginning and end into strings which sort appropriately.  ValueRangeProcessors
can be defined in python (and then passed to the QueryParser), or there are
several default built-in ones which can be used.

In C++ the operator() method takes two std::string arguments by reference,
which the subclassed method can modify, and returns a value slot number.
In Python, we wrap this by passing two `bytes` objects to
__call__ and having it return a tuple of (value_slot, modified_begin,
modified_end).  For example::

  vrp = xapian.NumberValueRangeProcessor(0, '$', True)
  a = '$10'
  b = '20'
  slot, a, b = vrp(a, b)

You can implement your own ValueRangeProcessor in Python.  The Python
implementation should override the __call__() method with its own
implementation, which returns a tuple as above.  For example::

  class MyVRP(xapian.ValueRangeProcessor):
    def __init__(self):
      xapian.ValueRangeProcessor.__init__(self)
    def __call__(self, begin, end):
      return (7, "A"+begin, "B"+end)

The equivalent `RangeProcessor` subclass to `MyVRP` would look like this:

::

  class MyRP(xapian.RangeProcessor):
      def __init__(self):
          xapian.RangeProcessor.__init__(self)
      def __call__(self, begin, end):
          return xapian.Query(xapian.Query.OP_VALUE_RANGE, "A"+begin, "B"+end)

Return `xapian.Query(xapian.Query.OP_INVALID)` to signal that you don't want to
handle an offered range.

Apache and mod_python/mod_wsgi
==============================

Prior to Xapian 1.3.0, you had to tell mod_python and mod_wsgi to run
applications which use Xapian in the main interpreter.  Xapian 1.3.0 no
longer uses the simplified GIL state API, and so this restriction no
longer applies.

Test Suite
==========

The Python bindings come with a test suite, consisting of two test files:
`smoketest.py` and `pythontest.py`. These are run by the `make check` command,
or may be run manually.  By default, they will display the names of any tests
which failed, and then display a count of tests which run and which failed.
The verbosity may be increased by setting the `VERBOSE` environment variable,
for example::

 make check VERBOSE=1

Setting VERBOSE to 1 will display detailed information about failures, and a
value of 2 will display further information about the progress of tests.