File: introduction.rst

package info (click to toggle)
xapian-bindings 1.4.31-2
links: PTS, VCS
area: main
in suites: forky, sid
size: 21,456 kB
sloc: cpp: 379,927; python: 10,780; cs: 9,529; java: 6,949; sh: 5,017; perl: 4,435; makefile: 1,277; ruby: 1,028; php: 586; tcl: 246
file content (324 lines) | stat: -rw-r--r-- 16,338 bytes
.. Copyright (C) 2003 James Aylett
.. Copyright (C) 2004-2022 Olly Betts
.. Copyright (C) 2007,2008,2010 Richard Boulton

===========================
Python3 bindings for Xapian
===========================

.. contents:: Table of contents

The Python3 bindings for Xapian are packaged in the ``xapian`` module,
so to use them you need to add this to your code::

  import xapian

Since Xapian 1.4.22 these bindings require Python >= 3.3.  If you still need
support for older Python versions, Xapian <= 1.4.21 supports Python 3.2.
If you still need Python2 support, there are separate bindings for that.

The Python API largely follows the C++ API - the differences and
additions are noted below.

Strings
=======

The Xapian C++ API is largely agnostic about character encoding, and uses the
`std::string` type as an opaque container for a sequence of bytes.
In places where the bytes represent text (for example, in the
`Stem`, `QueryParser` and `TermGenerator` classes), UTF-8 encoding is used.  In
order to wrap this for Python, `std::string` is mapped to/from the Python
`bytes` type.

As a convenience, you can also pass Python
`str` objects as parameters where this is appropriate, which will be
converted to UTF-8 encoded text.  Where `std::string` is
returned, it's always mapped to `bytes` in Python, which you can
convert to a Python `str` by calling `.decode('utf-8')`
on it like so::

  for i in doc.termlist():
    print(i.term.decode('utf-8'))

Therefore, in order to avoid issues with character encodings, you should
always pass text data to Xapian as unicode strings, or UTF-8 encoded byte
strings.

There is, however, no requirement for byte strings passed into
Xapian to be valid UTF-8 encoded strings, unless they are being passed to a
text processing routine (such as the query parser, or the stemming
algorithms).  For example, it is perfectly valid to pass arbitrary binary
data to the ``xapian.Document.set_data()`` method.

Unicode
=======

Unicode text is most often in NFC already, but if you need to normalise text
before passing it to Xapian, the standard python module "``unicodedata``"
provides support for normalising unicode: you probably want the "``NFKC``"
normalisation scheme, so for example normalising a query string prior to
parsing it would look something like this:

::
    def parse_query(query_string):
        query_string = unicodedata.normalize('NFKC', query_string)
        qp = xapian.QueryParser()
        query_obj = qp.parse_query(query_string)

Exceptions
==========

Xapian-specific exceptions are subclasses of the :xapian-class:`Error`
class, so you can trap all Xapian-specific exceptions like so::

    try:
        do_something_with_xapian()
    except xapian.Error as e:
        print str(e)

`xapian.Error` is a subclass of the standard Python
`exceptions.Exception` class so will also be caught by `except Exception`.

Iterators
=========

The iterator classes in the Xapian C++ API are wrapped in a pythonic style.
The following are supported (where marked as "default iterator", it means
`__iter__()` does the right thing so you can for instance use
`for term in document` to iterate over terms in a Document object):

.. table:: Python iterators

+----------------------+------------------------------------------+---------------------------------------+----------------------+
| Class                | Python Method                            | Equivalent C++ Method                 | Python iterator type |
+======================+==========================================+=======================================+======================+
|``MSet``              | default iterator                         | ``begin()``                           | ``MSetIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``ESet``              | default iterator                         | ``begin()``                           | ``ESetIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Enquire``           | ``matching_terms()``                     | ``get_matching_terms_begin()``        | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Query``             | default iterator                         | ``get_terms_begin()``                 | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``allterms()`` (also as default iterator)| ``allterms_begin()``                  | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``postlist(term)``                       | ``postlist_begin(term)``              | ``PostingIter``      |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``termlist(docid)``                      | ``termlist_begin(docid)``             | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``positionlist(docid, term)``            | ``positionlist_begin(docid, term)``   | ``PositionIter``     |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``metadata_keys(prefix)``                | ``metadata_keys(prefix)``             | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``spellings()``                          | ``spellings_begin(term)``             | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``synonyms(term)``                       | ``synonyms_begin(term)``              | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Database``          | ``synonym_keys(prefix)``                 | ``synonym_keys_begin(prefix)``        | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Document``          | ``values()``                             | ``values_begin()``                    | ``ValueIter``        |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``Document``          | ``termlist()`` (also as default iterator)| ``termlist_begin()``                  | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``QueryParser``       | ``stoplist()``                           | ``stoplist_begin()``                  | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``QueryParser``       | ``unstemlist(term)``                     | ``unstem_begin(term)``                | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``ValueCountMatchSpy``| ``values()``                             | ``values_begin()``                    | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+
|``ValueCountMatchSpy``| ``top_values()``                         | ``top_values_begin()``                | ``TermIter``         |
+----------------------+------------------------------------------+---------------------------------------+----------------------+


The pythonic iterators generally return Python objects, with properties
available as attribute values, with lazy evaluation where appropriate.  An
exception is `PositionIter` (as returned by `Database.positionlist` for
example), which returns an integer.

The lazy evaluation is mainly transparent, but does become visible in one
situation: if you keep an object returned by an iterator, without evaluating
its properties to force the lazy evaluation to happen, and then move the
iterator forward, the object may no longer be able to efficiently perform the
lazy evaluation.  In this situation, an exception will be raised indicating
that the information requested wasn't available.  This will only happen for a
few of the properties - most are either not evaluated lazily (because the
underlying Xapian implementation doesn't evaluate them lazily, so there's no
advantage in lazy evaluation), or can be accessed even after the iterator has
moved.  The simplest work around is to evaluate any properties you wish to use
which are affected by this before moving the iterator.  The complete set of
iterator properties affected by this is:

 * `Database.allterms` (also accessible as `Database.__iter__`): `termfreq`
 * `Database.termlist`: `termfreq` and `positer`
 * `Document.termlist` (also accessible as `Document.__iter__`): `termfreq` and `positer`
 * `Database.postlist`: `positer`

MSet
====

MSet objects have some additional methods to simplify access (these
work using the C++ array dereferencing):

.. table:: MSet additional methods

+--------------------------------+--------------------------------------+
| Method name                    |            Explanation               |
+================================+======================================+
| ``get_hit(i)``                 |  returns ``MSetItem`` at index ``i`` |
+--------------------------------+--------------------------------------+
| ``get_document_percentage(i)`` | ``convert_to_percent(get_hit(i))``   |
+--------------------------------+--------------------------------------+
| ``get_document(i)``            | ``get_hit(i).get_document()``        |
+--------------------------------+--------------------------------------+
| ``get_docid(i)``               | ``get_hit(i).get_docid()``           |
+--------------------------------+--------------------------------------+


Two MSet objects are equal if they have the same number and maximum possible
number of members, and if every document member of the first MSet exists at the
same index in the second MSet, with the same weight.

Non-Class Functions
===================

The C++ API contains a few non-class functions (the Database factory
functions, and some functions reporting version information), which are
wrapped like so for Python 3:

 * `Xapian::version_string()` is wrapped as `xapian.version_string()`
 * `Xapian::major_version()` is wrapped as `xapian.major_version()`
 * `Xapian::minor_version()` is wrapped as `xapian.minor_version()`
 * `Xapian::revision()` is wrapped as `xapian.revision()`

 * `Xapian::Remote::open()` is wrapped as `xapian.remote_open()` (both
   the TCP and "program" versions are wrapped - the SWIG wrapper checks the parameter list to
   decide which to call).
 * `Xapian::Remote::open_writable()` is wrapped as `xapian.remote_open_writable()` (both
   the TCP and "program" versions are wrapped - the SWIG wrapper checks the parameter list to
   decide which to call).

The following were deprecated in the C++ API before the Python 3 bindings saw
a stable release, so are not wrapped for Python 3:

 * `Xapian::Auto::open_stub()`
 * `Xapian::Chert::open()`
 * `Xapian::InMemory::open()`

The version of the bindings in use is available as `xapian.__version__` (as
recommended by PEP 396).  This may not be the same as `xapian.version_string()`
as the latter is the version of xapian-core (the C++ library) in use.

Query
=====

In C++ there's a Xapian::Query constructor which takes a query operator and
start/end iterators specifying a number of terms or queries, plus an optional
parameter.  In Python, this is wrapped to accept any Python sequence (for
example a list or tuple) of terms or queries (or even a mixture of terms
and queries).  For example:


::

  subq = xapian.Query(xapian.Query.OP_AND, "hello", "world")
  q = xapian.Query(xapian.Query.OP_AND, [subq, "foo", xapian.Query("bar", 2)])


MatchAll and MatchNothing
-------------------------

These are wrapped as `xapian.Query.MatchAll` and
`xapian.Query.MatchNothing`.


MatchDecider
============

Custom MatchDeciders can be created in Python - subclass
`xapian.MatchDecider`, ensure you call the super-constructor from your
constructor, and define a `__call__` method that will do the work. The
simplest example (which does nothing useful) would be as follows:

::

  class mymatchdecider(xapian.MatchDecider):
    def __init__(self):
      xapian.MatchDecider.__init__(self)

    def __call__(self, doc):
      # Accept all documents.
      return True

ValueRangeProcessor
===================

The `ValueRangeProcessor` class is deprecated and will be removed in Xapian
2.0.0.  The replacement is `RangeProcessor` (added in Xapian 1.3.6).  Use
`RangeProcessor` instead in new code - it's more flexible because it
can return an arbitrary `Query` object.  This section documenting
`ValueRangeProcessor` is here to aid migrating existing uses.

The ValueRangeProcessor class (and its subclasses) provide an operator() method
(which is exposed in python as a __call__() method, making the class instances
into callables).  This method checks whether the beginning and end of a range are
in a format understood by the ValueRangeProcessor, and if so, converts the
beginning and end into strings which sort appropriately.  ValueRangeProcessors
can be defined in python (and then passed to the QueryParser), or there are
several default built-in ones which can be used.

In C++ the operator() method takes two std::string arguments by reference,
which the subclassed method can modify, and returns a value slot number.
In Python, we wrap this by passing two `bytes` objects to
__call__ and having it return a tuple of (value_slot, modified_begin,
modified_end).  For example::

  vrp = xapian.NumberValueRangeProcessor(0, '$', True)
  a = '$10'
  b = '20'
  slot, a, b = vrp(a, b)

You can implement your own ValueRangeProcessor in Python.  The Python
implementation should override the __call__() method with its own
implementation, which returns a tuple as above.  For example::

  class MyVRP(xapian.ValueRangeProcessor):
    def __init__(self):
      xapian.ValueRangeProcessor.__init__(self)
    def __call__(self, begin, end):
      return (7, "A"+begin, "B"+end)

The equivalent `RangeProcessor` subclass to `MyVRP` would look like this:

::

  class MyRP(xapian.RangeProcessor):
      def __init__(self):
          xapian.RangeProcessor.__init__(self)
      def __call__(self, begin, end):
          return xapian.Query(xapian.Query.OP_VALUE_RANGE, "A"+begin, "B"+end)

Return `xapian.Query(xapian.Query.OP_INVALID)` to signal that you don't want to
handle an offered range.

Apache and mod_python/mod_wsgi
==============================

Prior to Xapian 1.3.0, you had to tell mod_python and mod_wsgi to run
applications which use Xapian in the main interpreter.  Xapian 1.3.0 no
longer uses the simplified GIL state API, and so this restriction no
longer applies.

Test Suite
==========

The Python bindings come with a test suite, consisting of two test files:
`smoketest.py` and `pythontest.py`. These are run by the `make check` command,
or may be run manually.  By default, they will display the names of any tests
which failed, and then display a count of tests which run and which failed.
The verbosity may be increased by setting the `VERBOSE` environment variable,
for example::

 make check VERBOSE=1

Setting VERBOSE to 1 will display detailed information about failures, and a
value of 2 will display further information about the progress of tests.