File: administration-ranking.html

package info (click to toggle)
idzebra 2.2.8-2
links: PTS, VCS
area: main
in suites: forky, sid
size: 10,572 kB
sloc: ansic: 54,389; xml: 27,058; sh: 5,892; makefile: 1,102; perl: 210; tcl: 64
file content (271 lines) | stat: -rw-r--r-- 17,611 bytes
parent folder | download | duplicates (3)
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>9.Relevance Ranking and Sorting of Result Sets</title><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot"><link rel="home" href="index.html" title="Zebra - User's Guide and Reference"><link rel="up" href="administration.html" title="Chapter6.Administrating Zebra"><link rel="prev" href="shadow-registers.html" title="8.Safe Updating - Using Shadow Registers"><link rel="next" href="administration-extended-services.html" title="10.Extended Services: Remote Insert, Update and Delete"></head><body><link rel="stylesheet" type="text/css" href="common/style1.css"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">9.Relevance Ranking and Sorting of Result Sets</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="shadow-registers.html">Prev</a></td><th width="60%" align="center">Chapter6.Administrating <span class="application">Zebra</span></th><td width="20%" align="right"><a accesskey="n" href="administration-extended-services.html">Next</a></td></tr></table><hr></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="administration-ranking"></a>9.Relevance Ranking and Sorting of Result Sets</h2></div></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="administration-overview"></a>9.1.Overview</h3></div></div></div><p>
     The default ordering of a result set is left up to the server,
     which inside <span class="application">Zebra</span> means sorting in ascending document ID order.
     This is not always the order humans want to browse the sometimes
     quite large hit sets. Ranking and sorting comes to the rescue.
    </p><p>
     In cases where a good presentation ordering can be computed at
     indexing time, we can use a fixed <code class="literal">static ranking</code>
     scheme, which is provided for the <code class="literal">alvis</code>
     indexing filter. This defines a fixed ordering of hit lists,
     independently of the query issued.
    </p><p>
     There are cases, however, where relevance of hit set documents is
     highly dependent on the query processed.
     Simply put, <code class="literal">dynamic relevance ranking</code>
     sorts a set of retrieved records such that those most likely to be
     relevant to your request are retrieved first.
     Internally, <span class="application">Zebra</span> retrieves all documents that satisfy your
     query, and re-orders the hit list to arrange them based on
     a measurement of similarity between your query and the content of
     each record.
    </p><p>
     Finally, there are situations where hit sets of documents should be
     <code class="literal">sorted</code> during query time according to the
     lexicographical ordering of certain sort indexes created at
     indexing time.
    </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="administration-ranking-static"></a>9.2.Static Ranking</h3></div></div></div><p>
     <span class="application">Zebra</span> uses internally inverted indexes to look up term frequencies
     in documents. Multiple queries from different indexes can be
     combined by the binary boolean operations <code class="literal">AND</code>,
     <code class="literal">OR</code> and/or <code class="literal">NOT</code> (which
     is in fact a binary <code class="literal">AND NOT</code> operation).
     To ensure fast query execution
     speed, all indexes have to be sorted in the same order.
    </p><p>
     The indexes are normally sorted according to document
     <code class="literal">ID</code> in
     ascending order, and any query which does not invoke a special
     re-ranking function will therefore retrieve the result set in
     document
     <code class="literal">ID</code>
     order.
    </p><p>
     If one defines the
     </p><pre class="screen">
      staticrank: 1
     </pre><p>
     directive in the main core <span class="application">Zebra</span> configuration file, the internal document
     keys used for ordering are augmented by a preceding integer, which
     contains the static rank of a given document, and the index lists
     are ordered
     first by ascending static rank,
     then by ascending document <code class="literal">ID</code>.
     Zero
     is the ``best'' rank, as it occurs at the
     beginning of the list; higher numbers represent worse scores.
    </p><p>
     The experimental <code class="literal">alvis</code> filter provides a
     directive to fetch static rank information out of the indexed <acronym class="acronym">XML</acronym>
     records, thus making <span class="emphasis"><em>all</em></span> hit sets ordered
     after <span class="emphasis"><em>ascending</em></span> static
     rank, and for those doc's which have the same static rank, ordered
     after <span class="emphasis"><em>ascending</em></span> doc <code class="literal">ID</code>.
     See <a class="xref" href="record-model-alvisxslt.html" title="Chapter8.ALVIS XML Record Model and Filter Module">Chapter8, <i>ALVIS <acronym class="acronym">XML</acronym> Record Model and Filter Module</i></a> for the gory details.
    </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="administration-ranking-dynamic"></a>9.3.Dynamic Ranking</h3></div></div></div><p>
     In order to fiddle with the static rank order, it is necessary to
     invoke additional re-ranking/re-ordering using dynamic
     ranking or score functions. These functions return positive
     integer scores, where <span class="emphasis"><em>highest</em></span> score is
     ``best'';
     hit sets are sorted according to <span class="emphasis"><em>descending</em></span>
     scores (in contrary
     to the index lists which are sorted according to
     ascending rank number and document ID).
    </p><p>
     Dynamic ranking is enabled by a directive like one of the
     following in the zebra configuration file (use only one of these a time!):
     </p><pre class="screen">
      rank: rank-1        # default TDF-IDF like
      rank: rank-static   # dummy do-nothing
     </pre><p>
    </p><p>
     Dynamic ranking is done at query time rather than
     indexing time (this is why we
     call it ``dynamic ranking'' in the first place ...)
     It is invoked by adding
     the <acronym class="acronym">BIB-1</acronym> relation attribute with
     value ``relevance'' to the <acronym class="acronym">PQF</acronym> query (that is,
     <code class="literal">@attr2=102</code>, see also
     <a class="ulink" href="https://www.loc.gov/z3950/agency/bib1.html" target="_top">
      The <acronym class="acronym">BIB-1</acronym> Attribute Set Semantics</a>, also in
     <a class="ulink" href="https://www.loc.gov/z3950/agency/defns/bib1.html" target="_top">HTML</a>).
     To find all articles with the word <code class="literal">Eoraptor</code> in
     the title, and present them relevance ranked, issue the <acronym class="acronym">PQF</acronym> query:
     </p><pre class="screen">
      @attr 2=102 @attr 1=4 Eoraptor
     </pre><p>
    </p><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="administration-ranking-dynamic-rank1"></a>9.3.1.Dynamically ranking using <acronym class="acronym">PQF</acronym> queries with the 'rank-1'
      algorithm</h4></div></div></div><p>
      The default <code class="literal">rank-1</code> ranking module implements a
      TF/IDF (Term Frequecy over Inverse Document Frequency) like
      algorithm. In contrast to the usual definition of TF/IDF
      algorithms, which only considers searching in one full-text
      index, this one works on multiple indexes at the same time.
      More precisely,
      <span class="application">Zebra</span> does boolean queries and searches in specific addressed
      indexes (there are inverted indexes pointing from terms in the
      dictionary to documents and term positions inside documents).
      It works like this:
      </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">Query Components</span></dt><dd><p>
	  First, the boolean query is dismantled into its principal components,
	  i.e. atomic queries where one term is looked up in one index.
	  For example, the query
	  </p><pre class="screen">
	   @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
	  </pre><p>
	  is a boolean AND between the atomic parts
	  </p><pre class="screen">
	   @attr 2=102 @attr 1=1010 Utah
	  </pre><p>
          and
	  </p><pre class="screen">
	   @attr 2=102 @attr 1=1018 Springer
	  </pre><p>
	  which gets processed each for itself.
	 </p></dd><dt><span class="term">Atomic hit lists</span></dt><dd><p>
	  Second, for each atomic query, the hit list of documents is
	  computed.
	 </p><p>
	  In this example, two hit lists for each index
	  <code class="literal">@attr 1=1010</code>  and
	  <code class="literal">@attr 1=1018</code> are computed.
	 </p></dd><dt><span class="term">Atomic scores</span></dt><dd><p>
	  Third, each document in the hit list is assigned a score (_if_ ranking
	  is enabled and requested in the query)  using a TF/IDF scheme.
	 </p><p>
	  In this example, both atomic parts of the query assign the magic
	  <code class="literal">@attr 2=102</code> relevance attribute, and are
	  to be used in the relevance ranking functions.
	 </p><p>
	  It is possible to apply dynamic ranking on only parts of the
	  <acronym class="acronym">PQF</acronym> query:
	  </p><pre class="screen">
	   @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
	  </pre><p>
	  searches for all documents which have the term 'Utah' on the
	  body of text, and which have the term 'Springer' in the publisher
	  field, and sort them in the order of the relevance ranking made on
	  the body-of-text index only.
	 </p></dd><dt><span class="term">Hit list merging</span></dt><dd><p>
	  Fourth, the atomic hit lists are merged according to the boolean
	  conditions to a final hit list of documents to be returned.
	 </p><p>
	  This step is always performed, independently of the fact that
	  dynamic ranking is enabled or not.
	 </p></dd><dt><span class="term">Document score computation</span></dt><dd><p>
	  Fifth, the total score of a document is computed as a linear
	  combination of the atomic scores of the atomic hit lists
	 </p><p>
	  Ranking weights may be used to pass a value to a ranking
	  algorithm, using the non-standard <acronym class="acronym">BIB-1</acronym> attribute type 9.
	  This allows one branch of a query to use one value while
	  another branch uses a different one.  For example, we can search
	  for <code class="literal">utah</code> in the
	  <code class="literal">@attr 1=4</code> index with weight 30, as
	  well as in the <code class="literal">@attr 1=1010</code> index with weight 20:
	  </p><pre class="screen">
	   @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
	  </pre><p>
	 </p><p>
	  The default weight is
	  sqrt(1000) ~ 34 , as the <acronym class="acronym">Z39.50</acronym> standard prescribes that the top score
	  is 1000 and the bottom score is 0, encoded in integers.
	 </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Warning</h3><p>
	   The ranking-weight feature is experimental. It may change in future
	   releases of zebra.
	  </p></div></dd><dt><span class="term">Re-sorting of hit list</span></dt><dd><p>
	  Finally, the final hit list is re-ordered according to scores.
	 </p></dd></dl></div><p>

     </p><p>
      The <code class="literal">rank-1</code> algorithm
      does not use the static rank
      information in the list keys, and will produce the same ordering
      with or without static ranking enabled.
     </p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Warning</h3><p>
       <code class="literal">Dynamic ranking</code> is not compatible
       with <code class="literal">estimated hit sizes</code>, as all documents in
       a hit set must be accessed to compute the correct placing in a
       ranking sorted list. Therefore the use attribute setting
       <code class="literal">@attr2=102</code> clashes with
       <code class="literal">@attr9=integer</code>.
      </p></div></div><div class="sect3"><div class="titlepage"><div><div><h4 class="title"><a name="administration-ranking-dynamic-cql"></a>9.3.2.Dynamically ranking <acronym class="acronym">CQL</acronym> queries</h4></div></div></div><p>
      Dynamic ranking can be enabled during sever side <acronym class="acronym">CQL</acronym>
      query expansion by adding <code class="literal">@attr2=102</code>
      chunks to the <acronym class="acronym">CQL</acronym> config file. For example
      </p><pre class="screen">
       relationModifier.relevant		= 2=102
      </pre><p>
      invokes dynamic ranking each time a <acronym class="acronym">CQL</acronym> query of the form
      </p><pre class="screen">
       Z&gt; querytype cql
       Z&gt; f alvis.text =/relevant house
      </pre><p>
      is issued. Dynamic ranking can also be automatically used on
      specific <acronym class="acronym">CQL</acronym> indexes by (for example) setting
      </p><pre class="screen">
       index.alvis.text                        = 1=text 2=102
      </pre><p>
      which then invokes dynamic ranking each time a <acronym class="acronym">CQL</acronym> query of the form
      </p><pre class="screen">
       Z&gt; querytype cql
       Z&gt; f alvis.text = house
      </pre><p>
      is issued.
     </p></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="administration-ranking-sorting"></a>9.4.Sorting</h3></div></div></div><p>
     <span class="application">Zebra</span> sorts efficiently using special sorting indexes
     (type=<code class="literal">s</code>; so each sortable index must be known
     at indexing time, specified in the configuration of record
     indexing.  For example, to enable sorting according to the <acronym class="acronym">BIB-1</acronym>
     <code class="literal">Date/time-added-to-db</code> field, one could add the line
     </p><pre class="screen">
      xelm /*/@created               Date/time-added-to-db:s
     </pre><p>
     to any <code class="literal">.abs</code> record-indexing configuration file.
     Similarly, one could add an indexing element of the form
     </p><pre class="screen">
      &lt;z:index name="date-modified" type="s"&gt;
      &lt;xsl:value-of select="some/xpath"/&gt;
     &lt;/z:index&gt;
      </pre><p>
     to any <code class="literal">alvis</code>-filter indexing stylesheet.
    </p><p>
     Indexing can be specified at searching time using a query term
     carrying the non-standard
     <acronym class="acronym">BIB-1</acronym> attribute-type <code class="literal">7</code>.  This removes the
     need to send a <acronym class="acronym">Z39.50</acronym> <code class="literal">Sort Request</code>
     separately, and can dramatically improve latency when the client
     and server are on separate networks.
     The sorting part of the query is separate from the rest of the
     query - the actual search specification - and must be combined
     with it using OR.
    </p><p>
     A sorting subquery needs two attributes: an index (such as a
     <acronym class="acronym">BIB-1</acronym> type-1 attribute) specifying which index to sort on, and a
     type-7 attribute whose value is be <code class="literal">1</code> for
     ascending sorting, or <code class="literal">2</code> for descending.  The
     term associated with the sorting attribute is the priority of
     the sort key, where <code class="literal">0</code> specifies the primary
     sort key, <code class="literal">1</code> the secondary sort key, and so
     on.
    </p><p>For example, a search for water, sort by title (ascending),
     is expressed by the <acronym class="acronym">PQF</acronym> query
     </p><pre class="screen">
      @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
     </pre><p>
     whereas a search for water, sort by title ascending,
     then date descending would be
     </p><pre class="screen">
      @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
     </pre><p>
    </p><p>
     Notice the fundamental differences between <code class="literal">dynamic
      ranking</code> and <code class="literal">sorting</code>: there can be
     only one ranking function defined and configured; but multiple
     sorting indexes can be specified dynamically at search
     time. Ranking does not need to use specific indexes, so
     dynamic ranking can be enabled and disabled without
     re-indexing; whereas, sorting indexes need to be
     defined before indexing.
    </p></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="shadow-registers.html">Prev</a></td><td width="20%" align="center"><a accesskey="u" href="administration.html">Up</a></td><td width="40%" align="right"><a accesskey="n" href="administration-extended-services.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">8.Safe Updating - Using Shadow Registers</td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top">10.Extended Services: Remote Insert, Update and Delete</td></tr></table></div></body></html>