1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
|
.. Copyright (C) 2003 James Aylett
.. Copyright (C) 2004-2022 Olly Betts
.. Copyright (C) 2007,2008,2010 Richard Boulton
===========================
Python3 bindings for Xapian
===========================
.. contents:: Table of contents
The Python3 bindings for Xapian are packaged in the ``xapian`` module,
so to use them you need to add this to your code::
import xapian
Since Xapian 1.4.22 these bindings require Python >= 3.3. If you still need
support for older Python versions, Xapian <= 1.4.21 supports Python 3.2.
If you still need Python2 support, there are separate bindings for that.
The Python API largely follows the C++ API - the differences and
additions are noted below.
Strings
=======
The Xapian C++ API is largely agnostic about character encoding, and uses the
`std::string` type as an opaque container for a sequence of bytes.
In places where the bytes represent text (for example, in the
`Stem`, `QueryParser` and `TermGenerator` classes), UTF-8 encoding is used. In
order to wrap this for Python, `std::string` is mapped to/from the Python
`bytes` type.
As a convenience, you can also pass Python
`str` objects as parameters where this is appropriate, which will be
converted to UTF-8 encoded text. Where `std::string` is
returned, it's always mapped to `bytes` in Python, which you can
convert to a Python `str` by calling `.decode('utf-8')`
on it like so::
for i in doc.termlist():
print(i.term.decode('utf-8'))
Therefore, in order to avoid issues with character encodings, you should
always pass text data to Xapian as unicode strings, or UTF-8 encoded byte
strings.
There is, however, no requirement for byte strings passed into
Xapian to be valid UTF-8 encoded strings, unless they are being passed to a
text processing routine (such as the query parser, or the stemming
algorithms). For example, it is perfectly valid to pass arbitrary binary
data to the ``xapian.Document.set_data()`` method.
Unicode
=======
Unicode text is most often in NFC already, but if you need to normalise text
before passing it to Xapian, the standard python module "``unicodedata``"
provides support for normalising unicode: you probably want the "``NFKC``"
normalisation scheme, so for example normalising a query string prior to
parsing it would look something like this:
::
def parse_query(query_string):
query_string = unicodedata.normalize('NFKC', query_string)
qp = xapian.QueryParser()
query_obj = qp.parse_query(query_string)
Exceptions
==========
Xapian-specific exceptions are subclasses of the :xapian-class:`Error`
class, so you can trap all Xapian-specific exceptions like so::
try:
do_something_with_xapian()
except xapian.Error as e:
print str(e)
`xapian.Error` is a subclass of the standard Python
`exceptions.Exception` class so will also be caught by `except Exception`.
Iterators
=========
The iterator classes in the Xapian C++ API are wrapped in a pythonic style.
The following are supported (where marked as "default iterator", it means
`__iter__()` does the right thing so you can for instance use
`for term in document` to iterate over terms in a Document object):
.. table:: Python iterators
+----------------------+------------------------------------------+---------------------------------------+----------------------+
| Class | Python Method | Equivalent C++ Method | Python iterator type |
+======================+==========================================+=======================================+======================+
|``MSet`` | default iterator | ``begin()`` | ``MSetIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``ESet`` | default iterator | ``begin()`` | ``ESetIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Enquire`` | ``matching_terms()`` | ``get_matching_terms_begin()`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Query`` | default iterator | ``get_terms_begin()`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database`` | ``allterms()`` (also as default iterator)| ``allterms_begin()`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database`` | ``postlist(term)`` | ``postlist_begin(term)`` | ``PostingIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database`` | ``termlist(docid)`` | ``termlist_begin(docid)`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database`` | ``positionlist(docid, term)`` | ``positionlist_begin(docid, term)`` | ``PositionIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database`` | ``metadata_keys(prefix)`` | ``metadata_keys(prefix)`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database`` | ``spellings()`` | ``spellings_begin(term)`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database`` | ``synonyms(term)`` | ``synonyms_begin(term)`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database`` | ``synonym_keys(prefix)`` | ``synonym_keys_begin(prefix)`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Document`` | ``values()`` | ``values_begin()`` | ``ValueIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Document`` | ``termlist()`` (also as default iterator)| ``termlist_begin()`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``QueryParser`` | ``stoplist()`` | ``stoplist_begin()`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``QueryParser`` | ``unstemlist(term)`` | ``unstem_begin(term)`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``ValueCountMatchSpy``| ``values()`` | ``values_begin()`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``ValueCountMatchSpy``| ``top_values()`` | ``top_values_begin()`` | ``TermIter`` |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
The pythonic iterators generally return Python objects, with properties
available as attribute values, with lazy evaluation where appropriate. An
exception is `PositionIter` (as returned by `Database.positionlist` for
example), which returns an integer.
The lazy evaluation is mainly transparent, but does become visible in one
situation: if you keep an object returned by an iterator, without evaluating
its properties to force the lazy evaluation to happen, and then move the
iterator forward, the object may no longer be able to efficiently perform the
lazy evaluation. In this situation, an exception will be raised indicating
that the information requested wasn't available. This will only happen for a
few of the properties - most are either not evaluated lazily (because the
underlying Xapian implementation doesn't evaluate them lazily, so there's no
advantage in lazy evaluation), or can be accessed even after the iterator has
moved. The simplest work around is to evaluate any properties you wish to use
which are affected by this before moving the iterator. The complete set of
iterator properties affected by this is:
* `Database.allterms` (also accessible as `Database.__iter__`): `termfreq`
* `Database.termlist`: `termfreq` and `positer`
* `Document.termlist` (also accessible as `Document.__iter__`): `termfreq` and `positer`
* `Database.postlist`: `positer`
MSet
====
MSet objects have some additional methods to simplify access (these
work using the C++ array dereferencing):
.. table:: MSet additional methods
+--------------------------------+--------------------------------------+
| Method name | Explanation |
+================================+======================================+
| ``get_hit(i)`` | returns ``MSetItem`` at index ``i`` |
+--------------------------------+--------------------------------------+
| ``get_document_percentage(i)`` | ``convert_to_percent(get_hit(i))`` |
+--------------------------------+--------------------------------------+
| ``get_document(i)`` | ``get_hit(i).get_document()`` |
+--------------------------------+--------------------------------------+
| ``get_docid(i)`` | ``get_hit(i).get_docid()`` |
+--------------------------------+--------------------------------------+
Two MSet objects are equal if they have the same number and maximum possible
number of members, and if every document member of the first MSet exists at the
same index in the second MSet, with the same weight.
Non-Class Functions
===================
The C++ API contains a few non-class functions (the Database factory
functions, and some functions reporting version information), which are
wrapped like so for Python 3:
* `Xapian::version_string()` is wrapped as `xapian.version_string()`
* `Xapian::major_version()` is wrapped as `xapian.major_version()`
* `Xapian::minor_version()` is wrapped as `xapian.minor_version()`
* `Xapian::revision()` is wrapped as `xapian.revision()`
* `Xapian::Remote::open()` is wrapped as `xapian.remote_open()` (both
the TCP and "program" versions are wrapped - the SWIG wrapper checks the parameter list to
decide which to call).
* `Xapian::Remote::open_writable()` is wrapped as `xapian.remote_open_writable()` (both
the TCP and "program" versions are wrapped - the SWIG wrapper checks the parameter list to
decide which to call).
The following were deprecated in the C++ API before the Python 3 bindings saw
a stable release, so are not wrapped for Python 3:
* `Xapian::Auto::open_stub()`
* `Xapian::Chert::open()`
* `Xapian::InMemory::open()`
The version of the bindings in use is available as `xapian.__version__` (as
recommended by PEP 396). This may not be the same as `xapian.version_string()`
as the latter is the version of xapian-core (the C++ library) in use.
Query
=====
In C++ there's a Xapian::Query constructor which takes a query operator and
start/end iterators specifying a number of terms or queries, plus an optional
parameter. In Python, this is wrapped to accept any Python sequence (for
example a list or tuple) of terms or queries (or even a mixture of terms
and queries). For example:
::
subq = xapian.Query(xapian.Query.OP_AND, "hello", "world")
q = xapian.Query(xapian.Query.OP_AND, [subq, "foo", xapian.Query("bar", 2)])
MatchAll and MatchNothing
-------------------------
These are wrapped as `xapian.Query.MatchAll` and
`xapian.Query.MatchNothing`.
MatchDecider
============
Custom MatchDeciders can be created in Python - subclass
`xapian.MatchDecider`, ensure you call the super-constructor from your
constructor, and define a `__call__` method that will do the work. The
simplest example (which does nothing useful) would be as follows:
::
class mymatchdecider(xapian.MatchDecider):
def __init__(self):
xapian.MatchDecider.__init__(self)
def __call__(self, doc):
# Accept all documents.
return True
ValueRangeProcessor
===================
The `ValueRangeProcessor` class is deprecated and will be removed in Xapian
2.0.0. The replacement is `RangeProcessor` (added in Xapian 1.3.6). Use
`RangeProcessor` instead in new code - it's more flexible because it
can return an arbitrary `Query` object. This section documenting
`ValueRangeProcessor` is here to aid migrating existing uses.
The ValueRangeProcessor class (and its subclasses) provide an operator() method
(which is exposed in python as a __call__() method, making the class instances
into callables). This method checks whether the beginning and end of a range are
in a format understood by the ValueRangeProcessor, and if so, converts the
beginning and end into strings which sort appropriately. ValueRangeProcessors
can be defined in python (and then passed to the QueryParser), or there are
several default built-in ones which can be used.
In C++ the operator() method takes two std::string arguments by reference,
which the subclassed method can modify, and returns a value slot number.
In Python, we wrap this by passing two `bytes` objects to
__call__ and having it return a tuple of (value_slot, modified_begin,
modified_end). For example::
vrp = xapian.NumberValueRangeProcessor(0, '$', True)
a = '$10'
b = '20'
slot, a, b = vrp(a, b)
You can implement your own ValueRangeProcessor in Python. The Python
implementation should override the __call__() method with its own
implementation, which returns a tuple as above. For example::
class MyVRP(xapian.ValueRangeProcessor):
def __init__(self):
xapian.ValueRangeProcessor.__init__(self)
def __call__(self, begin, end):
return (7, "A"+begin, "B"+end)
The equivalent `RangeProcessor` subclass to `MyVRP` would look like this:
::
class MyRP(xapian.RangeProcessor):
def __init__(self):
xapian.RangeProcessor.__init__(self)
def __call__(self, begin, end):
return xapian.Query(xapian.Query.OP_VALUE_RANGE, "A"+begin, "B"+end)
Return `xapian.Query(xapian.Query.OP_INVALID)` to signal that you don't want to
handle an offered range.
Apache and mod_python/mod_wsgi
==============================
Prior to Xapian 1.3.0, you had to tell mod_python and mod_wsgi to run
applications which use Xapian in the main interpreter. Xapian 1.3.0 no
longer uses the simplified GIL state API, and so this restriction no
longer applies.
Test Suite
==========
The Python bindings come with a test suite, consisting of two test files:
`smoketest.py` and `pythontest.py`. These are run by the `make check` command,
or may be run manually. By default, they will display the names of any tests
which failed, and then display a count of tests which run and which failed.
The verbosity may be increased by setting the `VERBOSE` environment variable,
for example::
make check VERBOSE=1
Setting VERBOSE to 1 will display detailed information about failures, and a
value of 2 will display further information about the progress of tests.
|