File: intro_ir.html

package info (click to toggle)
xapian-core 1.4.29-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 22,840 kB
  • sloc: cpp: 92,356; ansic: 9,948; sh: 5,026; perl: 850; makefile: 509; javascript: 360; tcl: 319; python: 40
file content (904 lines) | stat: -rw-r--r-- 47,271 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.21.2: https://docutils.sourceforge.io/" />
<title>Theoretical Background</title>
<style type="text/css">

/*
:Author: David Goodger (goodger@python.org)
:Id: $Id: html4css1.css 9511 2024-01-13 09:50:07Z milde $
:Copyright: This stylesheet has been placed in the public domain.

Default cascading style sheet for the HTML output of Docutils.
Despite the name, some widely supported CSS2 features are used.

See https://docutils.sourceforge.io/docs/howto/html-stylesheets.html for how to
customize this style sheet.
*/

/* used to remove borders from tables and images */
.borderless, table.borderless td, table.borderless th {
  border: 0 }

table.borderless td, table.borderless th {
  /* Override padding for "table.docutils td" with "! important".
     The right padding separates the table cells. */
  padding: 0 0.5em 0 0 ! important }

.first {
  /* Override more specific margin styles with "! important". */
  margin-top: 0 ! important }

.last, .with-subtitle {
  margin-bottom: 0 ! important }

.hidden {
  display: none }

.subscript {
  vertical-align: sub;
  font-size: smaller }

.superscript {
  vertical-align: super;
  font-size: smaller }

a.toc-backref {
  text-decoration: none ;
  color: black }

blockquote.epigraph {
  margin: 2em 5em ; }

dl.docutils dd {
  margin-bottom: 0.5em }

object[type="image/svg+xml"], object[type="application/x-shockwave-flash"] {
  overflow: hidden;
}

/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
  font-weight: bold }
*/

div.abstract {
  margin: 2em 5em }

div.abstract p.topic-title {
  font-weight: bold ;
  text-align: center }

div.admonition, div.attention, div.caution, div.danger, div.error,
div.hint, div.important, div.note, div.tip, div.warning {
  margin: 2em ;
  border: medium outset ;
  padding: 1em }

div.admonition p.admonition-title, div.hint p.admonition-title,
div.important p.admonition-title, div.note p.admonition-title,
div.tip p.admonition-title {
  font-weight: bold ;
  font-family: sans-serif }

div.attention p.admonition-title, div.caution p.admonition-title,
div.danger p.admonition-title, div.error p.admonition-title,
div.warning p.admonition-title, .code .error {
  color: red ;
  font-weight: bold ;
  font-family: sans-serif }

/* Uncomment (and remove this text!) to get reduced vertical space in
   compound paragraphs.
div.compound .compound-first, div.compound .compound-middle {
  margin-bottom: 0.5em }

div.compound .compound-last, div.compound .compound-middle {
  margin-top: 0.5em }
*/

div.dedication {
  margin: 2em 5em ;
  text-align: center ;
  font-style: italic }

div.dedication p.topic-title {
  font-weight: bold ;
  font-style: normal }

div.figure {
  margin-left: 2em ;
  margin-right: 2em }

div.footer, div.header {
  clear: both;
  font-size: smaller }

div.line-block {
  display: block ;
  margin-top: 1em ;
  margin-bottom: 1em }

div.line-block div.line-block {
  margin-top: 0 ;
  margin-bottom: 0 ;
  margin-left: 1.5em }

div.sidebar {
  margin: 0 0 0.5em 1em ;
  border: medium outset ;
  padding: 1em ;
  background-color: #ffffee ;
  width: 40% ;
  float: right ;
  clear: right }

div.sidebar p.rubric {
  font-family: sans-serif ;
  font-size: medium }

div.system-messages {
  margin: 5em }

div.system-messages h1 {
  color: red }

div.system-message {
  border: medium outset ;
  padding: 1em }

div.system-message p.system-message-title {
  color: red ;
  font-weight: bold }

div.topic {
  margin: 2em }

h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
  margin-top: 0.4em }

h1.title {
  text-align: center }

h2.subtitle {
  text-align: center }

hr.docutils {
  width: 75% }

img.align-left, .figure.align-left, object.align-left, table.align-left {
  clear: left ;
  float: left ;
  margin-right: 1em }

img.align-right, .figure.align-right, object.align-right, table.align-right {
  clear: right ;
  float: right ;
  margin-left: 1em }

img.align-center, .figure.align-center, object.align-center {
  display: block;
  margin-left: auto;
  margin-right: auto;
}

table.align-center {
  margin-left: auto;
  margin-right: auto;
}

.align-left {
  text-align: left }

.align-center {
  clear: both ;
  text-align: center }

.align-right {
  text-align: right }

/* reset inner alignment in figures */
div.align-right {
  text-align: inherit }

/* div.align-center * { */
/*   text-align: left } */

.align-top    {
  vertical-align: top }

.align-middle {
  vertical-align: middle }

.align-bottom {
  vertical-align: bottom }

ol.simple, ul.simple {
  margin-bottom: 1em }

ol.arabic {
  list-style: decimal }

ol.loweralpha {
  list-style: lower-alpha }

ol.upperalpha {
  list-style: upper-alpha }

ol.lowerroman {
  list-style: lower-roman }

ol.upperroman {
  list-style: upper-roman }

p.attribution {
  text-align: right ;
  margin-left: 50% }

p.caption {
  font-style: italic }

p.credits {
  font-style: italic ;
  font-size: smaller }

p.label {
  white-space: nowrap }

p.rubric {
  font-weight: bold ;
  font-size: larger ;
  color: maroon ;
  text-align: center }

p.sidebar-title {
  font-family: sans-serif ;
  font-weight: bold ;
  font-size: larger }

p.sidebar-subtitle {
  font-family: sans-serif ;
  font-weight: bold }

p.topic-title {
  font-weight: bold }

pre.address {
  margin-bottom: 0 ;
  margin-top: 0 ;
  font: inherit }

pre.literal-block, pre.doctest-block, pre.math, pre.code {
  margin-left: 2em ;
  margin-right: 2em }

pre.code .ln { color: gray; } /* line numbers */
pre.code, code { background-color: #eeeeee }
pre.code .comment, code .comment { color: #5C6576 }
pre.code .keyword, code .keyword { color: #3B0D06; font-weight: bold }
pre.code .literal.string, code .literal.string { color: #0C5404 }
pre.code .name.builtin, code .name.builtin { color: #352B84 }
pre.code .deleted, code .deleted { background-color: #DEB0A1}
pre.code .inserted, code .inserted { background-color: #A3D289}

span.classifier {
  font-family: sans-serif ;
  font-style: oblique }

span.classifier-delimiter {
  font-family: sans-serif ;
  font-weight: bold }

span.interpreted {
  font-family: sans-serif }

span.option {
  white-space: nowrap }

span.pre {
  white-space: pre }

span.problematic, pre.problematic {
  color: red }

span.section-subtitle {
  /* font-size relative to parent (h1..h6 element) */
  font-size: 80% }

table.citation {
  border-left: solid 1px gray;
  margin-left: 1px }

table.docinfo {
  margin: 2em 4em }

table.docutils {
  margin-top: 0.5em ;
  margin-bottom: 0.5em }

table.footnote {
  border-left: solid 1px black;
  margin-left: 1px }

table.docutils td, table.docutils th,
table.docinfo td, table.docinfo th {
  padding-left: 0.5em ;
  padding-right: 0.5em ;
  vertical-align: top }

table.docutils th.field-name, table.docinfo th.docinfo-name {
  font-weight: bold ;
  text-align: left ;
  white-space: nowrap ;
  padding-left: 0 }

/* "booktabs" style (no vertical lines) */
table.docutils.booktabs {
  border: 0px;
  border-top: 2px solid;
  border-bottom: 2px solid;
  border-collapse: collapse;
}
table.docutils.booktabs * {
  border: 0px;
}
table.docutils.booktabs th {
  border-bottom: thin solid;
  text-align: left;
}

h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
  font-size: 100% }

ul.auto-toc {
  list-style-type: none }

</style>
</head>
<body>
<div class="document" id="theoretical-background">
<h1 class="title">Theoretical Background</h1>

<p>This document aims to provide some theoretical background to Xapian.</p>
<div class="section" id="documents-and-terms">
<h1>Documents and terms</h1>
<p>In Information Retrieval (IR), the items we are trying to retrieve are
called <em>documents</em>, and each document is described by a collection of
<em>terms</em>. These two words, <cite>document</cite> and <cite>term</cite>, are now traditional
in the vocabulary of IR, and reflect its Library Science origins.
Usually a document is thought of as a piece of text, most likely in a
machine readable form, and a term as a word or phrase which helps to
describe the document, and which may indeed occur one or more times in
the document. So a document might be about dental care, and could be
described by corresponding terms <cite>tooth</cite>, <cite>teeth</cite>, <cite>toothbrush</cite>,
<cite>decay</cite>, <cite>cavity</cite>, <cite>plaque</cite>, <cite>diet</cite> and so on.</p>
<p>More generally a document can be anything we want to retrieve, and a
term any feature that helps describe the documents. So the documents
could be a collection of fossils held in a big museum collection, and
the terms could be morphological characteristics of the fossils. Or the
documents could be tunes, and the terms could then be phrases of notes
that occur in the tunes.</p>
<p>If, in an IR system, a document, D, is described by a term, t, t is said
to <em>index</em> D, and we can write,</p>
<blockquote>
<span class="formula"><i>t</i> → <i>D</i></span></blockquote>
<p>In fact an IR system consists of a set of documents, <span class="formula"><i>D</i><sub>1</sub></span>, <span class="formula"><i>D</i><sub>2</sub></span>, <span class="formula"><i>D</i><sub>3</sub></span> ...,
a set of terms <span class="formula"><i>t</i><sub>1</sub></span>, <span class="formula"><i>t</i><sub>2</sub></span>, <span class="formula"><i>t</i><sub>3</sub></span> ..., and set of relationships,</p>
<blockquote>
<span class="formula"><i>t</i><sub><i>i</i></sub> → <i>D</i><sub><i>j</i></sub></span></blockquote>
<p>i.e. instances of terms indexing documents. A single instance of a
particular term indexing a particular document is called a <em>posting</em>.</p>
<p>For a document, D, there is a list of terms which index it. This is
called the <em>term list</em> of D.</p>
<p>For a term, t, there is a list of documents which it indexes. This is
called the <em>posting list</em> of t. (<cite>Document list</cite> would be more
consistent, but sounds a little too vague for this very important
concept.)</p>
<p>At a simple level a computerised IR system puts the terms in an <em>index</em>
file. A term can be efficiently looked up and its posting list found. In
the posting list, each document is represented by a short identifier. To
over-simplify a little, a posting list can be thought of as a list of
numbers (document ids), and term list as a list of strings (the terms).
Some systems represent each term by a number internally, so the term
list is then also a list of numbers. Xapian doesn't - it uses the terms
themselves, and uses prefix compression to store them compactly.</p>
<p>The terms needn't be (and often aren't) just the words from the
document. Usually they are converted to lower case, and often a stemming
algorithm is applied, so a single term <cite>connect</cite> might derive from a
number of words, <cite>connect</cite>, <cite>connects</cite>, <cite>connection</cite>, <cite>connected</cite>
and so on. A single word might also give rise to more than one term, for
example you might index both stemmed and unstemmed forms of some or all
terms. Or a stemming algorithm could conceivably produce more than one
stem in some cases (this isn't the case for any of the stemming
algorithms Xapian currently supports, but consider the <a class="reference external" href="https://en.wikipedia.org/wiki/Double_Metaphone">Double
Metaphone</a> phonetic
algorithm which can produce two codes from a single input).</p>
</div>
<div class="section" id="xapian-s-context-within-ir">
<h1>Xapian's context within IR</h1>
<p>In the beginning IR was dominated by Boolean retrieval, described in the
next section. This could be called the antediluvian period, or
generation zero. The first generation of IR research dates from the
early sixties, and was dominated by model building, experimentation, and
heuristics. The big names were <a class="reference external" href="https://en.wikipedia.org/wiki/Gerard_Salton">Gerard
Salton</a> and <a class="reference external" href="https://en.wikipedia.org/wiki/Karen_Sparck_Jones">Karen Sparck
Jones</a>. The second
period, which began in the mid-seventies, saw a big shift towards
mathematics, and a rise of the IR model based upon probability theory -
probabilistic IR. The big name here was, and continues to be, <a class="reference external" href="http://www.soi.city.ac.uk/~ser/homepage.html">Stephen
Robertson</a>. More
recently <a class="reference external" href="https://en.wikipedia.org/wiki/C._J._van_Rijsbergen">Keith van
Rijsbergen</a> has led
a group that has developed underlying logical models of IR, but
interesting as this new work is, it has not as yet led to results that
offer improvements for the IR system builder.</p>
<p>Xapian was built as a system for efficiently implementing the
probabilistic IR model (though this doesn't mean it is limited to only
implementing this model - other models can be implemented providing they
can be expressed in a suitable way). Xapian tries to implement the
probabilistic model faithfully, though in some places it can be told to
use short-cuts for efficiency.</p>
<p>The model has two striking advantages:</p>
<ol class="arabic simple">
<li>It leads to systems that give good retrieval performance. As the
model has developed over the last 25 years, this has proved so
consistently true that one is led to suspect that the probability
theory model is, in some sense, the &quot;correct&quot; model for IR. The IR
process would appear to function as the model suggests.</li>
<li>As new problems come up in IR, the probabilistic model can usually
suggest a solution. This makes it a very practical mental tool for
cutting through the jungle of possibilities when designing IR
software.</li>
</ol>
<p>In simple cases the model reduces to simple formulae in general use, so
don't be alarmed by the apparent complexity of the equations below. We
need them for a full understanding of the general case.</p>
</div>
<div class="section" id="boolean-retrieval">
<h1>Boolean retrieval</h1>
<p>A Boolean construct of terms retrieves a corresponding set of documents.
So, if:</p>
<blockquote>
<div class="line-block">
<div class="line"><span class="formula"><i>t</i><sub>1</sub></span> indexes documents  1 2 3 5 8</div>
<div class="line"><span class="formula"><i>t</i><sub>2</sub></span> indexes documents  2 3 6</div>
</div>
</blockquote>
<p>then</p>
<blockquote>
<div class="line-block">
<div class="line"><span class="formula"><i>t</i><sub>1</sub></span> <tt class="docutils literal">AND</tt> <span class="formula"><i>t</i><sub>2</sub></span>      retrieves  2 3</div>
<div class="line"><span class="formula"><i>t</i><sub>1</sub></span> <tt class="docutils literal">OR</tt> <span class="formula"><i>t</i><sub>2</sub></span>       retrieves  1 2 3 5 6 8</div>
<div class="line"><span class="formula"><i>t</i><sub>1</sub></span> <tt class="docutils literal">AND_NOT</tt> <span class="formula"><i>t</i><sub>2</sub></span>  retrieves  1 5 8</div>
<div class="line"><span class="formula"><i>t</i><sub>2</sub></span> <tt class="docutils literal">AND_NOT</tt> <span class="formula"><i>t</i><sub>1</sub></span>  retrieves  6</div>
</div>
</blockquote>
<p>The posting list of a term is a set of documents. IR becomes a matter of
constructing other sets by doing unions, intersections and differences
on posting lists.</p>
<p>For example, in an IR system of works of literature, a Boolean query</p>
<pre class="literal-block">
(lang:en OR lang:fr OR lang:de) AND (type:novel OR type:play) AND century:19
</pre>
<p>might be used to retrieve all English, French or German novels or plays
of the 19th century.</p>
<p>Boolean retrieval is often useful, but is rather inadequate on its own
as a general IR tool. Results aren't ordered by any measure of how
&quot;good&quot; they might be, and users require training to make effective use
of such a system. Despite this, purely boolean IR systems continue to
survive.</p>
<p>By default, Xapian uses probabilistic ranking to order retrieved
documents while allowing Boolean expressions of arbitrary complexity
(some boolean IR systems are restricted to queries in normal form) to
limit those documents retrieved, which provides the benefits of both
approaches. Pure Boolean retrieval is also supported (select the
<a class="reference external" href="apidoc/html/classXapian_1_1BoolWeight.html">BoolWeight</a> weighting
scheme using <tt class="docutils literal"><span class="pre">enquire.set_weighting_scheme(Xapian::BoolWeight());</span></tt>).</p>
</div>
<div class="section" id="relevance-and-the-idea-of-a-query">
<h1>Relevance and the idea of a query</h1>
<p><em>Relevance</em> is a central concept to the probabilistic model. Whole
academic papers have been devoted to discussing the nature of relevance
but essentially a document is relevant if it was what the user really
wanted! Retrieval is rarely perfect, so among documents retrieved there
will be non-relevant ones; among those not retrieved, relevant ones.</p>
<p>Relevance is modelled as a black or white attribute. There are no
degrees of relevance, a document either is, or is not, relevant. In the
probabilistic model there is however a probability of relevance, and
documents of low probability of relevance in the model generally
correspond to documents that, in practice, one would describe as having
low relevance.</p>
<p>What the user actually wants has to be expressed in some form, and the
expression of the user's need is the query. In the probabilistic model
the query is, usually, a list of terms, but that is the end process of a
chain of events. The user has a need; this is expressed in ordinary
language; this is then turned into a written form that the user judges
will yield good results in an IR system, and the IR system then turns
this form into a set, <em>Q</em>, of terms for processing the query. Relevance
must be judged against the user's original need, not against a later
interpretation of what <em>Q</em>, the set of terms, ought to mean.</p>
<p>Below, a query is taken to be just a set of terms, but it is important
to realise that this is a simplification. Each link in the chain that
takes us from the <em>information need</em> (&quot;what the user is looking for&quot;) to
the abstraction in <em>Q</em> is liable to error, and these errors compound to
affect IR performance. In fact the performance of IR systems as a whole
is much worse than most people generally imagine.</p>
</div>
<div class="section" id="evaluating-ir-performance">
<h1>Evaluating IR performance</h1>
<p>It is possible to set up a test to evaluate an IR system. Suppose <em>Q</em> is
a query, and out of the complete collection of documents in the IR
system, a set of documents <em>R</em> of size R are relevant to the query. So
if a document is in <em>R</em> it is relevant, and if not in <em>R</em> it is
non-relevant. Suppose the IR system is able to give us back K documents,
among which r are relevant. <em>Precision</em> and <em>recall</em> are defined as
being,</p>
<div class="formula">
<span class="text">precision = </span><span class="fraction"><span class="ignored">(</span><span class="numerator"><i>r</i></span><span class="ignored">)/(</span><span class="denominator"><i>K</i></span><span class="ignored">)</span></span><span class="text"> , recall = </span><span class="fraction"><span class="ignored">(</span><span class="numerator"><i>r</i></span><span class="ignored">)/(</span><span class="denominator"><i>R</i></span><span class="ignored">)</span></span>
</div>
<p>Precision is the density of relevant documents among those retrieved.
Recall is the proportion of relevant documents retrieved. In most IR
systems K is a parameter that can be varied, and what you find is that
when K is low you get high precision at the expense of low recall, and
when K is high you get high recall at the expense of low precision.</p>
<p>The ideal value of K will depend on the use of the system. For example,
if a user wants the answer to a simple question and the system contains
many documents which would answer it, a low value of K will be best to
give a small number of relevant results. But in a system indexing legal
cases, users will often wish to make sure no potentially relevant case
is missed even if that requires they check more non-relevant cases, so a
high value of K will be best.</p>
<p>Retrieval effectiveness is often shown as a graph of precision against
recall average over a number of queries, and plotted for different
values of K. Such curves typically have a shape similar to a hyperbola
(y=1/x).</p>
<p>A collection like this, consisting of a set of documents, a set of
queries, and for each query, a complete set of relevance assessments, is
called a <em>test collection</em>. With a test collection you can test out
different IR ideas, and see how well one performs against another. The
controversial part of establishing any test collection is the procedure
employed for determining the sets <span class="formula"><i>R</i><sub><i>i</i></sub></span>, of relevance
assessments. Subjectivity of judgement comes in here, and people will
differ about whether a particular document is relevant to a particular
query. Even so, the averaging across queries reduces the errors that may
occasionally arise through faulty relevance judgements, and averaging
important tests across a number of test collections reduces the effects
caused by accidental features of individual collections, and the results
obtained by these tests in modern research are generally accepted as
trustworthy. Nowadays such research with test collections is organised
from <a class="reference external" href="https://trec.nist.gov/">TREC</a>.</p>
</div>
<div class="section" id="probabilistic-term-weights">
<h1>Probabilistic term weights</h1>
<p>In this section we will try to present some of the thinking behind the
formulae. This is really to give a feel for where the probabilistic
model comes from. You may want to skim through this section if you're
not too interested.</p>
<p>Suppose we have an IR system with a total of N documents. And suppose
<em>Q</em> is a query in this IR system, made up of terms <span class="formula"><i>t</i><sub>1</sub></span>,
<span class="formula"><i>t</i><sub>2</sub></span> ... <span class="formula"><i>t</i><sub><i>Q</i></sub></span>. There is a set, <em>R</em>, of documents
relevant to the query.</p>
<p>In 1976, Stephen Robertson derived a formula which gives an ideal
numeric weight to a term t of Q. Just how this weight gets used we will
see below, but essentially a high weight means an important term and a
low weight means an unimportant term. The formula is,</p>
<div class="formula">
<i>w</i>(<i>t</i>) = log  <span class="fraction"><span class="ignored">(</span><span class="numerator"><i>p</i>(1 − <i>q</i>)</span><span class="ignored">)/(</span><span class="denominator">(1 − <i>p</i>)<i>q</i></span><span class="ignored">)</span></span>
</div>
<p>(The base of the logarithm doesn't matter, but we can suppose it is e.)
p is the probability that t indexes a relevant document, and q the
probability that t indexes a non-relevant document. And of course, 1 - p
is the probability that t does not index a relevant document, and 1 - q
the probability that t does not index a non-relevant document. More
mathematically,</p>
<div class="formula">
<i>p</i> = <i>P</i>(<i>t</i> → <i>D</i>|<i>D</i><i>in</i><i>R</i>)<i>q</i> = <i>P</i>(<i>t</i> → <i>D</i>|<i>D</i><i>not</i><i>in</i><i>R</i>)
</div>
<div class="formula">
1 − <i>p</i> = <i>P</i>(<i>t</i><i>not</i> → <i>D</i>|<i>D</i><i>in</i><i>R</i>)1 − <i>q</i> = <i>P</i>(<i>t</i><i>not</i> → <i>D</i>|<i>D</i><i>not</i><i>in</i><i>R</i>)
</div>
<p>Suppose that t indexes n of the N documents in the IR system. As before,
we suppose also that there are R documents in <em>R</em>, and that there are r
documents in <em>R</em> which are indexed by t.</p>
<p>p is easily estimated by r/R, the ratio of the number of relevant
documents indexed by t to the total number of relevant documents.</p>
<p>The total number of non-relevant documents is N - R, and the number of
those indexed by t is n - r, so we can estimate q as (n - r)/(N - R).
This gives us the estimates,</p>
<div class="formula">
<i>p</i> =  <span class="fraction"><span class="ignored">(</span><span class="numerator"><i>r</i></span><span class="ignored">)/(</span><span class="denominator"><i>R</i></span><span class="ignored">)</span></span>  ,   1 − <i>q</i> =  <span class="fraction"><span class="ignored">(</span><span class="numerator"><i>N</i> − <i>R</i> − <i>n</i> + <i>r</i></span><span class="ignored">)/(</span><span class="denominator"><i>N</i> − <i>R</i></span><span class="ignored">)</span></span>
</div>
<div class="formula">
1 − <i>p</i> =  <span class="fraction"><span class="ignored">(</span><span class="numerator"><i>R</i> − <i>r</i></span><span class="ignored">)/(</span><span class="denominator"><i>R</i></span><span class="ignored">)</span></span>  ,   <i>q</i> =  <span class="fraction"><span class="ignored">(</span><span class="numerator"><i>n</i> − <i>r</i></span><span class="ignored">)/(</span><span class="denominator"><i>N</i> − <i>R</i></span><span class="ignored">)</span></span>
</div>
<p>and so substituting in the formula for w(t) we get the estimate,</p>
<div class="formula">
<i>w</i>(<i>t</i>) = log  <span class="fraction"><span class="ignored">(</span><span class="numerator"><i>r</i>(<i>N</i> − <i>R</i> − <i>n</i> + <i>r</i>)</span><span class="ignored">)/(</span><span class="denominator">(<i>R</i> − <i>r</i>)(<i>n</i> − <i>r</i>)</span><span class="ignored">)</span></span>
</div>
<p>Unfortunately, this formula is subject to violent behaviour when, say, n
= r (infinity) or r = 0 (minus infinity), and so Robertson suggests the
modified form</p>
<div class="formula">
<i>w</i>(<i>t</i>) = log  <span class="fraction"><span class="ignored">(</span><span class="numerator">(<i>r</i> + <span class="fraction"><span class="ignored">(</span><span class="numerator">1</span><span class="ignored">)/(</span><span class="denominator">2</span><span class="ignored">)</span></span>)(<i>N</i> − <i>R</i> − <i>n</i> + <i>r</i> + <span class="fraction"><span class="ignored">(</span><span class="numerator">1</span><span class="ignored">)/(</span><span class="denominator">2</span><span class="ignored">)</span></span>)</span><span class="ignored">)/(</span><span class="denominator">(<i>R</i> − <i>r</i> + <span class="fraction"><span class="ignored">(</span><span class="numerator">1</span><span class="ignored">)/(</span><span class="denominator">2</span><span class="ignored">)</span></span>)(<i>n</i> − <i>r</i> + <span class="fraction"><span class="ignored">(</span><span class="numerator">1</span><span class="ignored">)/(</span><span class="denominator">2</span><span class="ignored">)</span></span>)</span><span class="ignored">)</span></span>
</div>
<p>with the reassurance that this has &quot;some theoretical justification&quot;.
This is the form of the term weighting formula used in Xapian's
BM25Weight.</p>
<p>Note that n is dependent on the term, t, and R on the query, <em>Q</em>, while
r depends both on t and <em>Q</em>. N is constant, at least until the IR system
changes.</p>
<p>At first sight this formula may appear to be quite useless. After all,
<em>R</em> is what we are trying to find. We can't evaluate w(t) until we have
<em>R</em>, and if we have <em>R</em> the retrieval process is over, and term weights
are no longer of any interest to us.</p>
<p>But the point is we can estimate p and q from a subset of <em>R</em>. As soon
as some records are found relevant by the user they can be used as a
working set for <em>R</em> from which the weights w(t) can be derived, and
these new weights can be used to improve the processing of the query.</p>
<p>In fact in the Xapian software <em>R</em> tends to mean not the complete set of
relevant documents, which indeed can rarely be discovered, but a small
set of documents which have been judged as relevant.</p>
<p>Suppose we have no documents marked as relevant. Then R = r = 0, and
w(t) becomes,</p>
<div class="formula">
log <span class="fraction"><span class="ignored">(</span><span class="numerator"><i>N</i> − <i>n</i> + <span class="fraction"><span class="ignored">(</span><span class="numerator">1</span><span class="ignored">)/(</span><span class="denominator">2</span><span class="ignored">)</span></span></span><span class="ignored">)/(</span><span class="denominator"><i>n</i> + <span class="fraction"><span class="ignored">(</span><span class="numerator">1</span><span class="ignored">)/(</span><span class="denominator">2</span><span class="ignored">)</span></span></span><span class="ignored">)</span></span>
</div>
<p>This is approximately log((N - n)/n). Or log(N/n), since n is usually
small compared with N. This is called inverse logarithmic weighting, and
has been used in IR for many decades, quite independently of the
probabilistic theory which underpins it. Weights of this form are in
fact the starting point in Xapian when no relevance information is
present.</p>
<p>The number n incidentally is often called the <em>frequency</em> of a term. We
prefer the phrase <em>term frequency</em>, to better distinguish it from wdf
and wqf introduced below.</p>
<p>In extreme cases w(t) can be negative. In Xapian, negative values are
disallowed, and simply replaced by a small positive value.</p>
</div>
<div class="section" id="wdp-wdf-ndl-and-wqf">
<h1>wdp, wdf, ndl and wqf</h1>
<p>Before we see how the weights are used there are a few more ideas to
introduce.</p>
<p>As mentioned before, a term t is said to index a document D, or <span class="formula"><i>t</i> → <i>D</i></span>.
We have emphasised that D may not be a piece of text in machine-readable
form, and that, even when it is, t may not actually occur in the text of
D. Nevertheless, it will often be the case that D is made up of a list
of words,</p>
<p><span class="formula"><i>D</i> = <i>w</i><sub>1</sub>, <i>w</i><sub>2</sub>, <i>w</i><sub>3</sub></span> ... <span class="formula"><i>w</i><sub><i>m</i></sub></span></p>
<p>and that many, if not all, of the terms which index D derive from these
words (for example, the terms are often lower-cased and stemmed forms of
these words).</p>
<p>If a term derives from words <span class="formula"><i>w</i><sub>9</sub>, <i>w</i><sub>38</sub>, <i>w</i><sub>97</sub></span> and <span class="formula"><i>w</i><sub>221</sub></span> in the indexing
process, we can say that the term &quot;occurs&quot; in D at positions 9, 38, 97 and
221, and so for each term a document may have a vector of positional
information. These are the <em>within-document positions</em> of t, or the <em>wdp</em>
information of t.</p>
<p>The <em>within-document frequency</em>, or <em>wdf</em>, of a term t in D is the
number of times it is pulled out of D in the indexing process. Usually
this is the size of the wdp vector, but in Xapian it can exceed it,
since we can apply extra wdf to some parts of the document text. For
example, often this is done for the document title and abstract to
attach extra importance to their contents compared to the rest of the
document text.</p>
<p>There are various ways in which we might measure the length of a
document, but the easiest is to suppose it is made up of m words,
<span class="formula"><i>w</i><sub>1</sub></span> to <span class="formula"><i>w</i><sub><i>m</i></sub></span>, and to define its length as m.</p>
<p>The <em>normalised document length</em>, or <em>ndl</em>, is then m divided by the
average length of the documents in the IR system. So the average length
document has ndl equal to 1, short documents are less than 1, long
documents greater than 1. We have found that very small ndl values
create problems, so Xapian actually allows for a non-zero minimum value
for the ndl.</p>
<p>In the probabilistic model the query, <em>Q</em>, is itself very much like
another document. Frequently indeed <em>Q</em> will be created from a document,
either one already in the IR system, or by an indexing process very
similar to the one used to add documents into the whole IR system. This
corresponds to a user saying &quot;give me other documents like this one&quot;.
One can therefore attach a similar meaning to within-query position
information, within-query frequency, and normalised query length, or
wqp, wqf and nql. Xapian does not currently use the concept of wqp.</p>
</div>
<div class="section" id="using-the-weights-the-mset">
<h1>Using the weights. The <em>MSet</em></h1>
<p>Now to pull everything together. From the probabilistic term weights we
can assign a weight to any document, d, as follows,</p>
<div class="formula">
<i>W</i>(<i>d</i>) = <span class="limits"><sup class="limit"> </sup><span class="limit"><span class="bigoperator">∑</span></span><sub class="limit"><i>t</i> → <i>d</i>, <i>t</i> ∈ <i>Q</i></sub></span> <span class="fraction"><span class="ignored">(</span><span class="numerator">(<i>k</i> + 1)<i>f</i><sub><i>t</i></sub></span><span class="ignored">)/(</span><span class="denominator"><i>k</i>.<i>L</i><sub><i>d</i></sub> + <i>f</i><sub><i>t</i></sub></span><span class="ignored">)</span></span> <i>w</i>(<i>t</i>)
</div>
<p>The sum extends over the terms of <em>Q</em> which index d. <span class="formula"><i>f</i><sub><i>t</i></sub></span> is
the wdf of t in d, <span class="formula"><i>L</i><sub><i>d</i></sub></span> is the ndl of d, and k is some suitably
chosen constant.</p>
<p>The factor <span class="formula"><i>k</i> + 1</span> is actually redundant, but helps with the interpretation
of the equation. In Xapian, this weighting scheme is implemented by the
<a class="reference external" href="apidoc/html/classXapian_1_1TradWeight.html">Xapian::TradWeight class</a>
and the factor <span class="formula">(<i>k</i> + 1)</span> is ignored.</p>
<p>If <span class="formula"><i>k</i></span> is set to zero the factor before <span class="formula"><i>w</i>(<i>t</i>)</span> is 1, and the wdfs are
ignored. As k tends to infinity, the factor becomes
<span class="formula"><i>f</i><sub><i>t</i></sub></span>/<span class="formula"><i>L</i><sub><i>d</i></sub></span>, and the wdfs take on their greatest
importance. Intermediate values scale the wdf contribution between these
extremes. The best <span class="formula"><i>k</i></span> actually depends on the characteristics of the IR
system as a whole, and unfortunately no rule can be given for choosing
it. By default, Xapian sets <span class="formula"><i>k</i></span> to 1 which should give reasonable results
for most systems. <span class="formula"><i>W</i>(<i>d</i>)</span> is merely tweaked a bit by the wdf values, and
users observe a simple pattern of retrieval. It is possible to tune <span class="formula"><i>k</i></span> to
provide optimal results for a specific system.</p>
<p>Any <span class="formula"><i>d</i></span> in the IR system has a value <span class="formula"><i>W</i>(<i>d</i>)</span>, but, if no term of the query
indexes <span class="formula"><i>d</i></span>, <span class="formula"><i>W</i>(<i>d</i>)</span> will be zero. In practice only documents for which
<span class="formula"><i>W</i>(<i>d</i>) &gt; 0</span> will be of interest, and these are the documents indexed by at least
one term of <em>Q</em>. If we now take these documents and arrange them by
decreasing <span class="formula"><i>W</i>(<i>d</i>)</span> value, we get a ranked list called the <em>match set</em>, or
<em>MSet</em>, of document and weight pairs:</p>
<div class="formula">
<i>item</i>0 :  <i>D</i><sub>0</sub>, <i>W</i>(<i>D</i><sub>0</sub>)
</div>
<div class="formula">
<i>item</i>1 :  <i>D</i><sub>1</sub>, <i>W</i>(<i>D</i><sub>1</sub>)
</div>
<div class="formula">
<i>item</i>2 :  <i>D</i><sub>2</sub>, <i>W</i>(<i>D</i><sub>2</sub>)
</div>
<div class="formula">
<span class="environment align"><span class="arrayrow">
<span class="arraycell align-r">
<span class="text">. <br/>
      . <br/>
      . </span>
</span>

</span>
</span>
</div>
<div class="formula">
<i>item</i><i>K</i> :  <i>D</i><sub><i>K</i></sub>, <i>W</i>(<i>D</i><sub><i>K</i></sub>)
</div>
<p>where <span class="formula"><i>W</i>(<i>D</i><sub><i>j</i></sub>) ≥ <i>W</i>(<i>D</i><sub><i>i</i></sub>)</span> if j &gt; i.</p>
<p>And according to the probabilistic model, the documents <span class="formula"><i>D</i><sub>0</sub>, <i>D</i><sub>1</sub>, <i>D</i><sub>2</sub>...</span>
are ranked by decreasing order of probability of relevance. So <span class="formula"><i>D</i><sub>0</sub></span> has highest
probability of being relevant, then <span class="formula"><i>D</i><sub>1</sub></span> and so on.</p>
<p>Xapian creates the MSet from the posting lists of the terms of the
query. This is the central operation of any IR system, and will be
familiar to anyone who has used one of the Internet's major search
engines, where the query is what you type in the query box, and the
resulting hit list corresponds to the top few items of the MSet.</p>
<p>The cutoff point, K, is chosen when the MSet is created. The candidates
for inclusion in the MSet are all documents indexed by at least one term
of <em>Q</em>, and their number will usually exceed the choice of K (K is
typically set to be 1000 or less). So the MSet is actually the best K
documents found in the match process.</p>
<p>A modification of this weighting scheme can be employed that takes into
account the query itself:</p>
<div class="formula">
<i>W</i>(<i>d</i>) = <span class="limits"><sup class="limit"> </sup><span class="limit"><span class="bigoperator">∑</span></span><sub class="limit"><i>t</i> → <i>d</i>,  <i>t</i> ∈ <i>Q</i></sub></span><span class="fraction"><span class="ignored">(</span><span class="numerator">(<i>k</i><sub>3</sub> + 1)<i>q</i><sub><i>t</i></sub></span><span class="ignored">)/(</span><span class="denominator">(<i>k</i><sub>3</sub><i>L</i>’ + <i>q</i><sub><i>t</i></sub>)</span><span class="ignored">)</span></span>  <span class="fraction"><span class="ignored">(</span><span class="numerator">(<i>k</i> + 1)<i>f</i><sub><i>t</i></sub></span><span class="ignored">)/(</span><span class="denominator">(<i>kL</i><sub><i>d</i></sub> + <i>f</i><sub><i>t</i></sub>)</span><span class="ignored">)</span></span> <i>w</i>(<i>t</i>)
</div>
<p>where <span class="formula"><i>q</i><sub><i>t</i></sub></span> is the wqf of t in <em>Q</em>, <span class="formula"><i>L</i>’</span> is the nql, or normalised
query length, and <span class="formula"><i>k</i><sub>3</sub></span> is a further constant. In computing <span class="formula"><i>W</i>(<i>d</i>)</span>
across the document space, this extra factor may be viewed as just a
modification to the basic term weights, <span class="formula"><i>w</i>(<i>t</i>)</span>. Like <span class="formula"><i>k</i></span> and <span class="formula"><i>k</i><sub>3</sub></span>,
we will need to make an inspired guess for <span class="formula"><i>L</i>’</span>. In fact the choices for
<span class="formula"><i>k</i><sub>3</sub></span> and <span class="formula"><i>L</i>’</span> will depend on the broader context of the use of
this formula, and more advice will be given as occasion arises.</p>
<p>Xapian's default weighting scheme is a generalised form of this
weighting scheme modification, known as <a class="reference external" href="bm25.html">BM25</a>. In BM25, <span class="formula"><i>L</i>’</span>
is always set to 1.</p>
</div>
<div class="section" id="using-the-weights-the-eset">
<h1>Using the weights: the <em>ESet</em></h1>
<p>But as well as ranking documents, Xapian can rank terms, and this is
most important. The higher up the ranking the term is, the more likely
it is to act as a good differentiator between relevant and non-relevant
documents. It is therefore a candidate for adding back into the query.
Terms from this list can therefore be used to expand the size of the
query, after which the query can be re-run to get a better MSet. Because
this list of terms is mainly used for query expansion, it is called the
<em>expand set</em> or <em>ESet</em>.</p>
<p>The term expansion weighting formula is as follows,</p>
<blockquote>
<span class="formula"><i>W</i>(<i>t</i>) = <i>r</i> <i>w</i>(<i>t</i>)</span></blockquote>
<p>in other words we multiply the term weight by the number of relevant
documents that have been indexed by the term.</p>
<p>The ESet then has this form,</p>
<div class="formula">
<i>item</i>0 :  <i>t</i><sub>0</sub>, <i>W</i>(<i>t</i><sub>0</sub>)
</div>
<div class="formula">
<i>item</i>1 :  <i>t</i><sub>1</sub>, <i>W</i>(<i>t</i><sub>1</sub>)
</div>
<div class="formula">
<i>item</i>2 :  <i>t</i><sub>2</sub>, <i>W</i>(<i>t</i><sub>2</sub>)
</div>
<div class="formula">
<span class="environment align"><span class="arrayrow">
<span class="arraycell align-r">
<span class="text">. <br/>
      . <br/>
      . </span>
</span>

</span>
</span>
</div>
<div class="formula">
<i>item</i><i>K</i> :  <i>t</i><sub><i>K</i></sub>, <i>W</i>(<i>t</i><sub><i>K</i></sub>)
</div>
<p>where <span class="formula"><i>W</i>(<i>t</i><sub><i>j</i></sub>) ≥ <i>W</i>(<i>t</i><sub><i>i</i></sub>)</span> if j &gt; i.</p>
<p>Since the main function of the ESet is to find new terms to be added to
<em>Q</em>, we usually omit from it terms already in <em>Q</em>.</p>
<p>The <span class="formula"><i>W</i>(<i>t</i>)</span> weight is applicable to any term in the IR system, but has a
value zero when t does not index a relevant document. The ESet is
therefore confined to be a ranking of the best K terms which index
relevant documents.</p>
<p>This simple form of <span class="formula"><i>W</i>(<i>t</i>)</span> is traditional in the probabilistic model, but
seems less than optimal because it does not take into account wdf
information. One can if fact try to generalise it to:</p>
<div class="formula">
<i>W</i>(<i>t</i>) = <span class="limits"><sup class="limit"> </sup><span class="limit"><span class="bigoperator">∑</span></span><sub class="limit"><i>t</i> → <i>d</i>, <i>d</i> ∈ <i>R</i></sub></span> <span class="fraction"><span class="ignored">(</span><span class="numerator">(<i>k</i> + 1)<i>f</i><sub><i>t</i></sub></span><span class="ignored">)/(</span><span class="denominator"><i>kL</i> + <i>f</i><sub><i>t</i></sub></span><span class="ignored">)</span></span><i>w</i>(<i>t</i>)
</div>
<p><span class="formula"><i>k</i></span> is again a constant, but it does not need to have the same value as
the <span class="formula"><i>k</i></span> used in the probabilistic term weights above. In Xapian, <span class="formula"><i>k</i></span>
defaults to 1.0 for ESet generation.</p>
<p>This reduces to <span class="formula"><i>W</i>(<i>t</i>) = <i>r</i> <i>w</i>(<i>t</i>)</span> when <span class="formula"><i>k</i> = 0</span>. Certainly this form can be
recommended in the very common case where <span class="formula"><i>r</i> = 1</span>, that is, we have a
single document marked relevant.</p>
</div>
<div class="section" id="the-progress-of-a-query">
<h1>The progress of a query</h1>
<p>Below we describe the general case of the IR model supported, including
use of a relevance set (<a class="reference external" href="glossary.html#rset">RSet</a>), query expansion,
improved term weights and reranking. You don't have to use any of these
for Xapian to be useful, but they are available should you need them.</p>
<p>The user enters a query. This is parsed into a form the IR system
understands, and run by the IR system, which returns two lists, a list
of captions, derived from the MSet, and a list of terms, from the ESet.
If the RSet is empty, the first few documents of the MSet can be used as
a stand-in - after all, they have a good chance of being relevant! You
can read a document by clicking on the caption. (We assume the usual
screen/mouse environment.) But you can also mark a document as relevant
(change <em>R</em>) or cause a term to be added from the ESet to the query
(change <em>Q</em>). As soon as any change is made to the query environment the
query can be rerun, although you might have a front-end where nothing
happens until you click on some &quot;Run Query&quot; button.</p>
<p>In any case rerunning the query leads to a new MSet and ESet, and so to
a new display. The IR process is then an iterative one. You can delete
terms from the query or add them in; mark or unmark documents as being
relevant. Eventually you converge on the answer to the query, or at
least, the best answer the IR system can give you.</p>
</div>
<div class="section" id="further-reading">
<h1>Further Reading</h1>
<p>If you want to find out more, then <a class="reference external" href="https://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.8337">&quot;Simple, proven approaches to text
retrieval&quot;</a>
is a worthwhile read. It's a good introduction to Probabilistic
Information retrieval, which is basically what Xapian provides.</p>
<p>There are also several good books on the subject of Information
retrieval.</p>
<ul class="simple">
<li>&quot;<em>Information Retrieval</em>&quot; by C. J. van Rijsbergen is well worth
reading. It's out of print, but is available for free <a class="reference external" href="http://www.dcs.gla.ac.uk/Keith/Preface.html">from the
author's website</a> (in
HTML or PDF).</li>
<li>&quot;<em>Readings in Information Retrieval</em>&quot; (published by Morgan Kaufmann,
edited by Karen Sparck Jones and Peter Willett) is a collection of
published papers covering many aspects of the subject.</li>
<li>&quot;<em>Managing Gigabytes</em>&quot; (also published by Morgan Kaufmann, written by
Ian H. Witten, Alistair Moffat and Timothy C. Bell) describes
information retrieval and compression techniques.</li>
<li>&quot;<em>Modern Information Retrieval</em>&quot; (published by Addison Wesley,
written by Ricardo Baeza-Yates and Berthier Ribeiro-Neto) gives a
good overview of the field. It was published more recently than the
books above, and so covers some more recent developments.</li>
<li>&quot;<em>Introduction to Information Retrieval</em>&quot; (published by Cambridge
University Press, written by Christopher D. Manning, Prabhakar
Raghavan and Hinrich Schütze) looks to be a good introductory work
(we've not read it in detail yet). As well as the print version,
there's an online version on <a class="reference external" href="http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html">the book's companion
website</a>.</li>
</ul>
</div>
</div>
</body>
</html>