1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
|
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>9.Relevance Ranking and Sorting of Result Sets</title><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot"><link rel="home" href="index.html" title="Zebra - User's Guide and Reference"><link rel="up" href="administration.html" title="Chapter6.Administrating Zebra"><link rel="prev" href="shadow-registers.html" title="8.Safe Updating - Using Shadow Registers"><link rel="next" href="administration-extended-services.html" title="10.Extended Services: Remote Insert, Update and Delete"></head><body><link rel="stylesheet" type="text/css" href="common/style1.css"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">9.Relevance Ranking and Sorting of Result Sets</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="shadow-registers.html">Prev</a></td><th width="60%" align="center">Chapter6.Administrating <span class="application">Zebra</span></th><td width="20%" align="right"><a accesskey="n" href="administration-extended-services.html">Next</a></td></tr></table><hr></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="administration-ranking"></a>9.Relevance Ranking and Sorting of Result Sets</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="administration-overview"></a>9.1.Overview</h3></div></div></div><p>
The default ordering of a result set is left up to the server,
which inside <span class="application">Zebra</span> means sorting in ascending document ID order.
This is not always the order humans want to browse the sometimes
quite large hit sets. Ranking and sorting comes to the rescue.
</p><p>
In cases where a good presentation ordering can be computed at
indexing time, we can use a fixed <code class="literal">static ranking</code>
scheme, which is provided for the <code class="literal">alvis</code>
indexing filter. This defines a fixed ordering of hit lists,
independently of the query issued.
</p><p>
There are cases, however, where relevance of hit set documents is
highly dependent on the query processed.
Simply put, <code class="literal">dynamic relevance ranking</code>
sorts a set of retrieved records such that those most likely to be
relevant to your request are retrieved first.
Internally, <span class="application">Zebra</span> retrieves all documents that satisfy your
query, and re-orders the hit list to arrange them based on
a measurement of similarity between your query and the content of
each record.
</p><p>
Finally, there are situations where hit sets of documents should be
<code class="literal">sorted</code> during query time according to the
lexicographical ordering of certain sort indexes created at
indexing time.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="administration-ranking-static"></a>9.2.Static Ranking</h3></div></div></div><p>
<span class="application">Zebra</span> uses internally inverted indexes to look up term frequencies
in documents. Multiple queries from different indexes can be
combined by the binary boolean operations <code class="literal">AND</code>,
<code class="literal">OR</code> and/or <code class="literal">NOT</code> (which
is in fact a binary <code class="literal">AND NOT</code> operation).
To ensure fast query execution
speed, all indexes have to be sorted in the same order.
</p><p>
The indexes are normally sorted according to document
<code class="literal">ID</code> in
ascending order, and any query which does not invoke a special
re-ranking function will therefore retrieve the result set in
document
<code class="literal">ID</code>
order.
</p><p>
If one defines the
</p><pre class="screen">
staticrank: 1
</pre><p>
directive in the main core <span class="application">Zebra</span> configuration file, the internal document
keys used for ordering are augmented by a preceding integer, which
contains the static rank of a given document, and the index lists
are ordered
first by ascending static rank,
then by ascending document <code class="literal">ID</code>.
Zero
is the ``best'' rank, as it occurs at the
beginning of the list; higher numbers represent worse scores.
</p><p>
The experimental <code class="literal">alvis</code> filter provides a
directive to fetch static rank information out of the indexed <acronym class="acronym">XML</acronym>
records, thus making <span class="emphasis"><em>all</em></span> hit sets ordered
after <span class="emphasis"><em>ascending</em></span> static
rank, and for those doc's which have the same static rank, ordered
after <span class="emphasis"><em>ascending</em></span> doc <code class="literal">ID</code>.
See <a class="xref" href="record-model-alvisxslt.html" title="Chapter8.ALVIS XML Record Model and Filter Module">Chapter8, <i>ALVIS <acronym class="acronym">XML</acronym> Record Model and Filter Module</i></a> for the gory details.
</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="administration-ranking-dynamic"></a>9.3.Dynamic Ranking</h3></div></div></div><p>
In order to fiddle with the static rank order, it is necessary to
invoke additional re-ranking/re-ordering using dynamic
ranking or score functions. These functions return positive
integer scores, where <span class="emphasis"><em>highest</em></span> score is
``best'';
hit sets are sorted according to <span class="emphasis"><em>descending</em></span>
scores (in contrary
to the index lists which are sorted according to
ascending rank number and document ID).
</p><p>
Dynamic ranking is enabled by a directive like one of the
following in the zebra configuration file (use only one of these a time!):
</p><pre class="screen">
rank: rank-1 # default TDF-IDF like
rank: rank-static # dummy do-nothing
</pre><p>
</p><p>
Dynamic ranking is done at query time rather than
indexing time (this is why we
call it ``dynamic ranking'' in the first place ...)
It is invoked by adding
the <acronym class="acronym">BIB-1</acronym> relation attribute with
value ``relevance'' to the <acronym class="acronym">PQF</acronym> query (that is,
<code class="literal">@attr2=102</code>, see also
<a class="ulink" href="https://www.loc.gov/z3950/agency/bib1.html" target="_top">
The <acronym class="acronym">BIB-1</acronym> Attribute Set Semantics</a>, also in
<a class="ulink" href="https://www.loc.gov/z3950/agency/defns/bib1.html" target="_top">HTML</a>).
To find all articles with the word <code class="literal">Eoraptor</code> in
the title, and present them relevance ranked, issue the <acronym class="acronym">PQF</acronym> query:
</p><pre class="screen">
@attr 2=102 @attr 1=4 Eoraptor
</pre><p>
</p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="administration-ranking-dynamic-rank1"></a>9.3.1.Dynamically ranking using <acronym class="acronym">PQF</acronym> queries with the 'rank-1'
algorithm</h4></div></div></div><p>
The default <code class="literal">rank-1</code> ranking module implements a
TF/IDF (Term Frequecy over Inverse Document Frequency) like
algorithm. In contrast to the usual definition of TF/IDF
algorithms, which only considers searching in one full-text
index, this one works on multiple indexes at the same time.
More precisely,
<span class="application">Zebra</span> does boolean queries and searches in specific addressed
indexes (there are inverted indexes pointing from terms in the
dictionary to documents and term positions inside documents).
It works like this:
</p><div class="variablelist"><dl class="variablelist"><dt><span class="term">Query Components</span></dt><dd><p>
First, the boolean query is dismantled into its principal components,
i.e. atomic queries where one term is looked up in one index.
For example, the query
</p><pre class="screen">
@attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
</pre><p>
is a boolean AND between the atomic parts
</p><pre class="screen">
@attr 2=102 @attr 1=1010 Utah
</pre><p>
and
</p><pre class="screen">
@attr 2=102 @attr 1=1018 Springer
</pre><p>
which gets processed each for itself.
</p></dd><dt><span class="term">Atomic hit lists</span></dt><dd><p>
Second, for each atomic query, the hit list of documents is
computed.
</p><p>
In this example, two hit lists for each index
<code class="literal">@attr 1=1010</code> and
<code class="literal">@attr 1=1018</code> are computed.
</p></dd><dt><span class="term">Atomic scores</span></dt><dd><p>
Third, each document in the hit list is assigned a score (_if_ ranking
is enabled and requested in the query) using a TF/IDF scheme.
</p><p>
In this example, both atomic parts of the query assign the magic
<code class="literal">@attr 2=102</code> relevance attribute, and are
to be used in the relevance ranking functions.
</p><p>
It is possible to apply dynamic ranking on only parts of the
<acronym class="acronym">PQF</acronym> query:
</p><pre class="screen">
@and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
</pre><p>
searches for all documents which have the term 'Utah' on the
body of text, and which have the term 'Springer' in the publisher
field, and sort them in the order of the relevance ranking made on
the body-of-text index only.
</p></dd><dt><span class="term">Hit list merging</span></dt><dd><p>
Fourth, the atomic hit lists are merged according to the boolean
conditions to a final hit list of documents to be returned.
</p><p>
This step is always performed, independently of the fact that
dynamic ranking is enabled or not.
</p></dd><dt><span class="term">Document score computation</span></dt><dd><p>
Fifth, the total score of a document is computed as a linear
combination of the atomic scores of the atomic hit lists
</p><p>
Ranking weights may be used to pass a value to a ranking
algorithm, using the non-standard <acronym class="acronym">BIB-1</acronym> attribute type 9.
This allows one branch of a query to use one value while
another branch uses a different one. For example, we can search
for <code class="literal">utah</code> in the
<code class="literal">@attr 1=4</code> index with weight 30, as
well as in the <code class="literal">@attr 1=1010</code> index with weight 20:
</p><pre class="screen">
@attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
</pre><p>
</p><p>
The default weight is
sqrt(1000) ~ 34 , as the <acronym class="acronym">Z39.50</acronym> standard prescribes that the top score
is 1000 and the bottom score is 0, encoded in integers.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Warning</h3><p>
The ranking-weight feature is experimental. It may change in future
releases of zebra.
</p></div></dd><dt><span class="term">Re-sorting of hit list</span></dt><dd><p>
Finally, the final hit list is re-ordered according to scores.
</p></dd></dl></div><p>
</p><p>
The <code class="literal">rank-1</code> algorithm
does not use the static rank
information in the list keys, and will produce the same ordering
with or without static ranking enabled.
</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Warning</h3><p>
<code class="literal">Dynamic ranking</code> is not compatible
with <code class="literal">estimated hit sizes</code>, as all documents in
a hit set must be accessed to compute the correct placing in a
ranking sorted list. Therefore the use attribute setting
<code class="literal">@attr2=102</code> clashes with
<code class="literal">@attr9=integer</code>.
</p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="administration-ranking-dynamic-cql"></a>9.3.2.Dynamically ranking <acronym class="acronym">CQL</acronym> queries</h4></div></div></div><p>
Dynamic ranking can be enabled during sever side <acronym class="acronym">CQL</acronym>
query expansion by adding <code class="literal">@attr2=102</code>
chunks to the <acronym class="acronym">CQL</acronym> config file. For example
</p><pre class="screen">
relationModifier.relevant = 2=102
</pre><p>
invokes dynamic ranking each time a <acronym class="acronym">CQL</acronym> query of the form
</p><pre class="screen">
Z> querytype cql
Z> f alvis.text =/relevant house
</pre><p>
is issued. Dynamic ranking can also be automatically used on
specific <acronym class="acronym">CQL</acronym> indexes by (for example) setting
</p><pre class="screen">
index.alvis.text = 1=text 2=102
</pre><p>
which then invokes dynamic ranking each time a <acronym class="acronym">CQL</acronym> query of the form
</p><pre class="screen">
Z> querytype cql
Z> f alvis.text = house
</pre><p>
is issued.
</p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="administration-ranking-sorting"></a>9.4.Sorting</h3></div></div></div><p>
<span class="application">Zebra</span> sorts efficiently using special sorting indexes
(type=<code class="literal">s</code>; so each sortable index must be known
at indexing time, specified in the configuration of record
indexing. For example, to enable sorting according to the <acronym class="acronym">BIB-1</acronym>
<code class="literal">Date/time-added-to-db</code> field, one could add the line
</p><pre class="screen">
xelm /*/@created Date/time-added-to-db:s
</pre><p>
to any <code class="literal">.abs</code> record-indexing configuration file.
Similarly, one could add an indexing element of the form
</p><pre class="screen">
<z:index name="date-modified" type="s">
<xsl:value-of select="some/xpath"/>
</z:index>
</pre><p>
to any <code class="literal">alvis</code>-filter indexing stylesheet.
</p><p>
Indexing can be specified at searching time using a query term
carrying the non-standard
<acronym class="acronym">BIB-1</acronym> attribute-type <code class="literal">7</code>. This removes the
need to send a <acronym class="acronym">Z39.50</acronym> <code class="literal">Sort Request</code>
separately, and can dramatically improve latency when the client
and server are on separate networks.
The sorting part of the query is separate from the rest of the
query - the actual search specification - and must be combined
with it using OR.
</p><p>
A sorting subquery needs two attributes: an index (such as a
<acronym class="acronym">BIB-1</acronym> type-1 attribute) specifying which index to sort on, and a
type-7 attribute whose value is be <code class="literal">1</code> for
ascending sorting, or <code class="literal">2</code> for descending. The
term associated with the sorting attribute is the priority of
the sort key, where <code class="literal">0</code> specifies the primary
sort key, <code class="literal">1</code> the secondary sort key, and so
on.
</p><p>For example, a search for water, sort by title (ascending),
is expressed by the <acronym class="acronym">PQF</acronym> query
</p><pre class="screen">
@or @attr 1=1016 water @attr 7=1 @attr 1=4 0
</pre><p>
whereas a search for water, sort by title ascending,
then date descending would be
</p><pre class="screen">
@or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
</pre><p>
</p><p>
Notice the fundamental differences between <code class="literal">dynamic
ranking</code> and <code class="literal">sorting</code>: there can be
only one ranking function defined and configured; but multiple
sorting indexes can be specified dynamically at search
time. Ranking does not need to use specific indexes, so
dynamic ranking can be enabled and disabled without
re-indexing; whereas, sorting indexes need to be
defined before indexing.
</p></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="shadow-registers.html">Prev</a></td><td width="20%" align="center"><a accesskey="u" href="administration.html">Up</a></td><td width="40%" align="right"><a accesskey="n" href="administration-extended-services.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">8.Safe Updating - Using Shadow Registers</td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top">10.Extended Services: Remote Insert, Update and Delete</td></tr></table></div></body></html>
|