File: replication.rst

package info (click to toggle)
xapian-core 1.2.3-2
  • links: PTS, VCS
  • area: main
  • in suites: squeeze
  • size: 17,528 kB
  • ctags: 13,287
  • sloc: cpp: 99,474; sh: 10,626; ansic: 7,551; perl: 758; makefile: 611; python: 40
file content (298 lines) | stat: -rw-r--r-- 15,292 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
.. Copyright (C) 2008 Lemur Consulting Ltd
.. Copyright (C) 2008 Olly Betts

=======================================
Xapian Database Replication Users Guide
=======================================

.. contents:: Table of contents

Introduction
============

It is often desirable to maintain multiple copies of a Xapian database, having
a "master" database which modifications are made on, and a set of secondary
(read-only, "slave") databases which these modifications propagate to.  For
example, to support a high query load there may be many search servers, each
with a local copy of the database, and a single indexing server.  In order to
allow scaling to a large number of search servers, with large databases and
frequent updates, we need an database replication implementation to have the
following characteristics:

 - Data transfer is (at most) proportional to the size of the updates, rather
   than the size of the database, to allow frequent small updates to large
   databases to be replicated efficiently.

 - Searching (on the slave databases) and indexing (on the master database) can
   continue during synchronisation.

 - Data cached (in memory) on the slave databases is not discarded (unless it's
   actually out of date) as updates arrive, to ensure that searches continue to
   be performed quickly during and after updates.

 - Synchronising each slave database involves low overhead (both IO and CPU) on
   the server holding the master database, so that many slaves can be updated
   from a single master without overloading it.

 - Database synchronisation can be recovered after network outages or server
   failures without manual intervention and without excessive data transfer.

The database replication protocol is intended to support replicating a single
writable database to multiple (read-only) search servers, while satisfying all
of the above properties.  It is not intended to support replication of multiple
writable databases - there must always be a single master database to which all
modifications are made.

This document gives an overview of how and why to use the replication protocol.
For technical details of the implementation of the replication protocol, see
the separate `Replication Protocol <replication_protocol.html>`_ document.

Backend Support
===============

Replication is supported by the chert, flint, and brass database backends,
and can cleanly handle the
master switching database type (a full copy is sent in this situation).  It
doesn't make a lot of sense to support replication for the remote backend.
Replication of inmemory databases isn't currently available.  We have a longer
term aim to replace the current inmemory backend with the current disk based
backend (e.g. chert) but storing its data in memory.  Once this is done, it
would probably be easy to support replication of inmemory databases.

Setting up replicated databases
===============================

.. FIXME - expand this section.

To replicate a database efficiently from one master machine to other machines,
there is one configuration step to be performed on the master machine, and two
servers to run.

Firstly, on the master machine, the indexer must be run with the environment
variable `XAPIAN_MAX_CHANGESETS` set to a non-zero value.  (Currently, the
actual value it is set to is irrelevant, but I suggest using a value of 10).
This will cause changeset files to be created whenever a transaction is
performed, which allow the transaction to be replaced efficiently on a replica
of the database.

Secondly, also on the master machine, run the `xapian-replicate-server` server
to serve the databases which are to be replicated.  This takes various
parameters to control the directory that databases are found in, and the
network interface to serve on.  The `--help` option will cause usage
information to be displayed.  For example, if `/var/search/dbs`` contains a
set of Xapian databases to be replicated::

  ./xapian-replicate-server /var/search/dbs -p 7010

would run a server allowing access to these databases, on port 7010.

Finally, on the client machine, run the `xapian-replicate` server to keep an
individual database up-to-date.  This will contact the server on the specified
host and port, and copy the database with the name (on the master) specified in
the `-m` option to the client.  One non-option argument is required - this is
the name that the database should be stored in on the slave machine.  For
example, contacting the above server from the same machine::

  ./xapian-replicate -h 127.0.0.1 -p 7010 -m foo foo2

would produce a database "foo2" containing a replica of the database
"/var/search/dbs/foo".

Both the server and client can be run in "one-shot" mode, by passing `-o`.
This may be particularly useful for the client, to allow a shell script to be
used to cycle through a set of databases, updating each in turn (and then
probably sleeping for a period).

Limitations
===========

It is possible to confuse the replication system in some cases, such that an
invalid database will be produced on the client.  However, this is easy to
avoid in practice.

To confuse the replication system, the following needs to happen:

 - Start with two databases, A and B.
 - Start a replication of database A.
 - While the replication is in progress, swap B in place of A (ie, by moving
   the files around, such that B is now at the path of A).
 - While the replication is still in progress, swap A back in place of B.

If this happens, the replication process will not detect the change in
database, and you are likely to end up with a database on the client which
contains parts of A and B mixed together.  You will need to delete the damaged
database on the client, and re-run the replication.

To avoid this, simply avoid swapping a database back in place of another one.
Or at least, if you must do this, wait until any replications in progress when
you were using the original database have finished.

Calling reopen
--------------

The reopen method is usually an efficient way to ensure that a database is
up-to-date with the latest changes.  Unfortunately, it does not currently work
correctly with databases which are being updated by the replication client.
The workaround is simple; don't use the reopen() method on databases created by
the replication client: instead, you should close the database and open it
again from scratch.

Briefly, the issue is that the databases created by the replication client are
created in a subdirectory of the target path supplied to the client, rather
than at that path.  A "stub database" file is then created in that directory,
pointing to the database.  This allows the database which readers open to be
switched atomically after a database copy has occurred.  The reopen() method
doesn't currently re-read the stub database file in this situation, so ends up
attempting to read the old database which has been deleted.

Ticket #434 in the Xapian Trac system gives some more details and discussion
about this issue, and will be updated when a fix is implemented.

Alternative approaches
======================

Without using the database replication protocol, there are various ways in
which the "single master, multiple slaves" setup could be implemented.

 - Copy database from master to all slaves after each update, then swap the new
   database for the old.

 - Synchronise databases from the master to the slaves using rsync.

 - Keep copy of database on master from before each update, and use a binary
   diff algorithm (e.g., xdelta) to calculate the changes, and then apply these
   same changes to the databases on each slave.

 - Serve database from master to slaves over NFS (or other remote file system).

 - Use the "remote database backend" facility of Xapian to allow slave servers
   to search the database directly on the master.

All of these could be made to work but have various drawbacks, and fail to
satisfy all the desired characteristics.  Let's examine them in detail:

Copying database after each update
----------------------------------

Databases could be pushed to the slaves after each update simply by copying the
entire database from the master (using scp, ftp, http or one of the many other
transfer options).  After the copy is completed, the new database would be made
live (perhaps by symlink switching, if symlinks are available).  After a
reasonable interval to allow searches in progress on the old database to
complete, the old database would be removed.  (On UNIX filesystems, the old
database could be unlinked immediately, and the resources used by it would be
automatically freed as soon as the current searches using it complete.)

This approach has the advantage of simplicity, and also ensures that the
databases can be correctly re-synchronised after network outages or hardware
failure.

However, this approach would involve copying a large amount of data for each
update, however small the update was.  Also, because the search server would
have to switch to access new files each time an update was pushed, the search
server will be likely to experience poor performance due to commonly accessed
pages falling out of the disk cache during the update.  In particular, although
some of the newly pushed data would be likely to be in the cache immediately
after the update, if the combination of the old and new database sizes exceeds
the size of the memory available on the search servers for caching, either some
of the live database will be dropped from the cache resulting in poor
performance during the update, or some of the new database will not initially
be present in the cache after update.

Synchronise database using rsync
--------------------------------

Rsync works by calculating hashes for the content on the client and the server,
sending the hashes from the client to the server, and then calculating (on the
server) which pieces of the file need to be sent to update the client.  This
results in a fairly low amount of network traffic, but puts a fairly high CPU
load on the server.  This would result in a large load being placed on the
master server if a large number of slaves tried to synchronise with it.

Also, rsync will not reliably update the database in a manner which allows the
database on a slave to be searched while being updated - therefore, a copy or
snapshot of the database would need to be taken first to allow searches to
continue (accessing the copy) while the database is being synchronised.

If a copy is used, the caching problems discussed in the previous section would
apply again.  If a snapshotting filesystem is used, it may be possible to take
a read-only snapshot copy cheaply (and without encountering poor caching
behaviour), but filesystems with support for this are not always available, and
may require considerable effort to set up even if they are available.

Use a binary diff algorithm
---------------------------

If a copy of the database on the master before the update was kept, a binary
diff algorithm (such as "xdelta") could be used to compare the old and new
versions of the database.  This would produce a patch file which could be
transferred to the slaves, and then applied - avoiding the need for specific
calculations to be performed for each slave.

However, this requires a copy or snapshot to be taken on the master - which has
the same problems as previously discussed.  A copy or snapshot would also need
to be taken on the slave, since a patch from xdelta couldn't safely be applied
to a live database.

Serve database from master to slaves over NFS
---------------------------------------------

NFS allows a section of a filesystem to be exported to a remote host.  Xapian
is quite capable of searching a database which is exported in such a manner,
and thus NFS can be used to quickly and easily share a database from the master
to multiple slaves.

A reasonable setup might be to use a powerful machine with a fast disk as the
master, and use that same machine as an NFS server.  Then, multiple slaves can
connect to that NFS server for searching the database. This setup is quite
convenient, because it separates the indexing workload from the search workload
to a reasonable extent, but may lead to performance problems.

There are two main problems which are likely to be encountered.  Firstly, in
order to work efficiently, NFS clients (or the OS filesystem layer above NFS)
cache information read from the remote file system in memory.  If there is
insufficient memory available to cache the whole database in memory, searches
will occasionally need to access parts of the database which are held only on
the master server.  Such searches will take a long time to complete, because
the round-trip time for an access to a disk block on the master is typically a
lot slower than the round-trip time for access to a local disk.  Additionally,
if the local network experiences problems, or the master server fails (or gets
overloaded due to all the search requests), the searches will be unable to be
completed.

Also, when a file is modified, the NFS protocol has no way of indicating that
only a small set of blocks in the file have been modified.  The caching is all
implemented by NFS clients, which can do little other than check the file
modification time periodically, and invalidate all cached blocks for the file
if the modification time has changed. For the Linux client, the time between
checks can be configured by setting the acregmin and acregmax mount options,
but whatever these are set to, the whole file will be dropped from the cache
when any modification is found.

This means that, after every update to the database on the master, searches on
the slaves will have to fetch all the blocks required for their search across
the network, which will likely result in extremely slow search times until the
cache on the slaves gets populated properly again.

Use the "remote database backend" facility
------------------------------------------

Xapian has supported a "remote" database backend since the very early days of
the project.  This allows a search to be run against a database on a remote
machine, which may seem to be exactly what we want.  However, the "remote"
database backend works by performing most of the work for a search on the
remote end - in the situation we're concerned with, this would mean that most
of the work was performed on the master, while slaves remain largely idle.

The "remote" database backend is intended to allow a large database to be
split, at the document level, between multiple hosts.  This allows systems to
be built which search a very large database with some degree of parallelism
(and thus provide faster individual searches than a system searching a single
database locally).  In contrast, the database replication protocol is intended
to allow a database to be copied to multiple machines to support a high
concurrent search load (and thus to allow a higher throughput of searches).

In some cases (i.e., a very large database and a high concurrent search load)
it may be perfectly reasonable to use both the database replication protocol in
conjunction with the "remote" database backend to get both of these advantages
- the two systems solve different problems.