File: tutorial.xml

package info (click to toggle)
idzebra 2.2.8-2
links: PTS, VCS
area: main
in suites: forky, sid
size: 10,572 kB
sloc: ansic: 54,389; xml: 27,058; sh: 5,892; makefile: 1,102; perl: 210; tcl: 64
file content (571 lines) | stat: -rw-r--r-- 20,758 bytes
parent folder | download | duplicates (4)
 <chapter id="tutorial">
  <title>Tutorial</title>


  <sect1 id="tutorial-oai">
   <title>A first &acro.oai; indexing example</title>

   <para>
    In this section, we will test the system by indexing a small set of
    sample &acro.oai; records that are included with the &zebra; distribution,
    running a &zebra; server against the newly created database, and
    searching the indexes with a client that connects to that server.
   </para>
   <para>
    Go to the <literal>examples/oai-pmh</literal> subdirectory of the
    distribution archive, or make a deep copy of the Debian installation
    directory
    <literal>/usr/share/idzebra-2.0-examples/oai-pmh</literal>.
    An XML file containing multiple &acro.oai;
    records is located in the  sub
    directory <literal>examples/oai-pmh/data</literal>.
   </para>
   <para>
    Additional OAI test records can be downloaded by running a shell
    script (you may want to abort the script when you have waited
    longer than your coffee brews ..).
    <screen>
     cd data
     ./fetch_OAI_data.sh
     cd ../
    </screen>
   </para>
   <para>
    To index these &acro.oai; records, type:
    <screen>
     zebraidx-2.0 -c conf/zebra.cfg init
     zebraidx-2.0 -c conf/zebra.cfg update data
     zebraidx-2.0 -c conf/zebra.cfg commit
    </screen>
    In case you have not installed zebra yet but have compiled the
    binaries from this tarball, use the following command form:
    <screen>
     ../../index/zebraidx -c conf/zebra.cfg this and that
    </screen>
    On some systems the &zebra; binaries are installed under the
    generic names, you need to use  the following command form:
    <screen>
     zebraidx -c conf/zebra.cfg this and that
    </screen>
   </para>

   <para>
    In this command, the word <literal>update</literal> is followed
    by the name of a directory: <literal>zebraidx</literal> updates all
    files in the hierarchy rooted at <literal>data</literal>.
    The command option
    <literal>-c conf/zebra.cfg</literal> points to the proper
    configuration file.
   </para>

   <para>
    You might ask yourself how &acro.xml; content is indexed using &acro.xslt;
    stylesheets: to satisfy your curiosity, you might want to run the
    indexing transformation on an example debugging &acro.oai; record.
    <screen>
     xsltproc conf/oai2index.xsl data/debug-record.xml
    </screen>
    Here you see the &acro.oai; record transformed into the indexing
    &acro.xml; format. &zebra; is creating several inverted indexes,
    and their name and type are clearly visible in the indexing
    &acro.xml; format.
   </para>

   <para>
    If your indexing command was successful, you are now ready to
    fire up a server. To start a server on port 9999, type:
    <screen>
     zebrasrv-2.0 -c conf/zebra.cfg  @:9999
    </screen>
   </para>

   <para>
    The &zebra; index that you have just created has a single database
    named <literal>Default</literal>.
    The database contains  several &acro.oai; records, and the server will
    return records in the &acro.xml; format only. The indexing machine
    did the splitting into individual records just behind the scenes.
   </para>


  </sect1>

  <sect1 id="tutorial-oai-sru-pqf">
   <title>Searching the &acro.oai; database by web service</title>

   <para>
    &zebra; has a build-in web service, which is close to the
    &acro.sru; standard web service. We use it to access our new
    database using any   &acro.xml; enabled web browser.
    This service is using the  &acro.pqf; query language.
    In a later
    section we show how to run a fully compliant  &acro.sru; server,
    including support for the query language  &acro.cql;
   </para>

   <para>
    Searching and retrieving &acro.xml; records is easy. For example,
    you can point your browser to one of the following URLs to
    search for the term <literal>the</literal>. Just point your
    browser at this link:
    <ulink
     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the</ulink>
   </para>

   <warning>
    <para>
     These URLs won't work unless you have indexed the example data
     and started an &zebra; server as outlined in the previous section.
    </para>
   </warning>

   <para>
    In case we actually want to retrieve one record, we need to alter
    our URL to the following
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
    </ulink>
   </para>

   <para>
    This way we can page through our result set in chunks of records,
    for example, we access the 6th to the 10th record using the URL
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=6&amp;maximumRecords=5&amp;recordSchema=dc">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=6&amp;maximumRecords=5&amp;recordSchema=dc
    </ulink>
   </para>

   <!--
   relation tests:

   <ulink url="">

   http://localhost:9999/?version=1.1&amp;operation=searchRetrieve
   &amp;x-pquery=title%3Cthe
   -->
  </sect1>

  <sect1 id="tutorial-oai-sru-present">
   <title>Presenting search results in different formats</title>

   <para>
    &zebra; uses &acro.xslt; stylesheets for both &acro.xml;record
    indexing and
    display retrieval. In this example installation, they are two
    retrieval schema's defined in
    <literal>conf/dom-conf.xml</literal>:
    the <literal>dc</literal> schema implemented in
    <literal>conf/oai2dc.xsl</literal>, and
    the <literal>zebra</literal> schema implemented in
    <literal>conf/oai2zebra.xsl</literal>.
    The URLs for accessing both are the same, except for the different
    value of the <literal>recordSchema</literal> parameter:
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
    </ulink>
    and
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra
    </ulink>
    For the curious, one can see that the &acro.xslt; transformations
    really do the magic.
    <screen>
     xsltproc conf/oai2dc.xsl data/debug-record.xml
     xsltproc conf/oai2zebra.xsl data/debug-record.xml
    </screen>
    Notice also that the &zebra; specific parameters are injected by
    the engine when retrieving data, therefore some of the attributes
    in the <literal>zebra</literal> retrieval schema are not filled
    when running the transformation from the command line.
   </para>


   <para>
    In addition to the user defined retrieval schema's one can  always
    choose from many  build-in schema's. In case one is only
    interested in the &zebra; internal metadata about a certain
    record, one uses the <literal>zebra::meta</literal> schema.
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::meta">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::meta
    </ulink>
   </para>

   <para>
    The <literal>zebra::data</literal> schema is used to retrieve the
    original stored &acro.oai; &acro.xml; record.
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::data">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::data
    </ulink>
   </para>

  </sect1>

  <sect1 id="tutorial-oai-sru-searches">
   <title>More interesting searches</title>

   <para>
    The &acro.oai; indexing example defines many different index
    names, a study of the <literal>conf/oai2index.xsl</literal>
    stylesheet reveals the following word type indexes (i.e. those
    with suffix <literal>:w</literal>):
    <screen>
     any:w
     title:w
     author:w
     subject-heading:w
     description:w
     contributor:w
     publisher:w
     language:w
     rights:w
    </screen>
    By default, searches do access the <literal>any:w</literal> index,
    but we can direct searches to any access point by constructing the
    correct &acro.pqf; query. For example, to search in titles only,
    we use
    <ulink
     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@attr 1=title the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@attr 1=title the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
    </ulink>
   </para>

   <para>
    Similar we can direct searches to the other indexes defined. Or we
    can create boolean combinations of searches on different
    indexes. In this case we search for <literal>the</literal> in
    <literal>title</literal> and for <literal>fish</literal> in
    <literal>description</literal> using the query
    <literal>@and @attr 1=title the @attr 1=description fish</literal>.
    <ulink
     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@and @attr 1=title the @attr 1=description fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@and @attr 1=title the @attr 1=description fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
    </ulink>
   </para>


  </sect1>

  <sect1 id="tutorial-oai-sru-zebra-indexes">
   <title>Investigating the content of the indexes</title>

   <para>
    How does the magic work? What is inside the indexes? Why is a certain
    record found by a search, and another not?. The answer is in the
    inverted indexes. You can easily investigate them using the
    special &zebra; schema
    <literal>zebra::index::fieldname</literal>. In this example you
    can see that the <literal>title</literal> index has both word
    (type <literal>:w</literal>) and phrase (type
    <literal>:p</literal>)
    indexed fields,
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::index::title">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::index::title
    </ulink>
   </para>

   <para>
    But where in the indexes did the term match for the query occur?
    Easily answered with the special  &zebra; schema
    <literal>zebra::snippet</literal>. The matching terms are
    encapsulated by <literal>&lt;s&gt;</literal> tags.
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet
    </ulink>
   </para>

   <para>
    How can I refine my search? Which interesting search terms are
    found inside my hit set? Try the special  &zebra; schema
    <literal>zebra::facet::fieldname:type</literal>. In this case, we
    investigate additional search terms for the
    <literal>title:w</literal> index.
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::title:w">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::title:w
    </ulink>
   </para>

   <para>
    One can ask for multiple facets. Here, we want them from phrase
    indexes of type
    <literal>:p</literal>.
    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::publisher:p,title:p">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::publisher:p,title:p
    </ulink>
   </para>

  </sect1>


  <sect1 id="tutorial-oai-sru-yazfrontend">
   <title>Setting up a correct &acro.sru; web service</title>

   <para>
    The &acro.sru; specification mandates that the &acro.cql; query
    language is supported and properly configured. Also, the server
    needs to be able to emit a proper  &acro.explain; &acro.xml;
    record, which is used to determine the capabilities of the
    specific server instance.
   </para>

   <para>
    In this example configuration we exploit the similarities between
    the &acro.explain; record and the &acro.cql; query language
    configuration, we generate the later from the former using an
    &acro.xslt; transformation.
    <screen>
     xsltproc conf/explain2cqlpqftxt.xsl conf/explain.xml > conf/cql2pqf.txt
    </screen>
   </para>

   <para>
    We are all set to start the &acro.sru;/&acro.z3950; server including
    &acro.pqf; and &acro.cql; query configuration. It uses the &yaz; frontend
    server configuration - just type
    <screen>
     zebrasrv -f conf/yazserver.xml
    </screen>
   </para>

   <para>
    First, we'd like to be sure that we can see the  &acro.explain;
    &acro.xml; response correctly. You might use either of these equivalent
    requests:
    <ulink
     url="http://localhost:9999">http://localhost:9999
    </ulink>
    or
    <ulink
     url="http://localhost:9999/?version=1.1&amp;operation=explain">
     http://localhost:9999/?version=1.1&amp;operation=explain
    </ulink>

   </para>

   <para>
    Now we can issue true &acro.sru; requests. For example,
    <literal>dc.title=the
     and dc.description=fish</literal> results in the following page
    <ulink
     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the and dc.description=fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the and dc.description=fish &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
    </ulink>
   </para>

   <para>
    Scan of indexes is a part of the  &acro.sru; server business. For example,
    scanning the <literal>dc.title</literal> index gives us an idea
    what search terms are found there
    <ulink
     url="http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.title=fish">
     http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.title=fish
    </ulink>,
    whereas
    <ulink
     url="http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.identifier=fish">
     http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.identifier=fish
    </ulink>
    accesses the indexed identifiers.
   </para>

   <para>
    In addition, all &zebra; internal special element sets or record
    schema's of the form
    <literal>zebra::</literal> just work right out of the box
    <ulink
     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the and dc.description=fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet">
     http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the and dc.description=fish &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet
    </ulink>
   </para>



  </sect1>


  <sect1 id="tutorial-oai-z3950">
   <title>Searching the &acro.oai; database by &acro.z3950; protocol</title>

   <para>
    In this section we repeat the searches and presents we have done so
    far using the binary &acro.z3950; protocol, you can use any
    &acro.z3950; client.
    For instance, you can use the demo command-line client that comes
    with &yaz;.
   </para>
   <para>
    Connecting to the server is done by the command
    <screen>
     yaz-client localhost:9999
    </screen>
   </para>

   <para>
    When the client has connected, you can type:
    <screen>
     Z> format xml
     Z> querytype prefix
     Z> elements oai
     Z> find the
     Z> show 1+1
    </screen>
   </para>

   <para>
    &acro.z3950; presents using presentation stylesheets:
    <screen>
     Z> elements dc
     Z> show 2+1

     Z> elements zebra
     Z> show 3+1
    </screen>
   </para>

   <para>
    &acro.z3950; buildin Zebra presents (in this configuration only if
    started without yaz-frontendserver):

    <screen>
     Z> elements zebra::meta
     Z> show 4+1

     Z> elements zebra::meta::sysno
     Z> show 5+1

     Z> format sutrs
     Z> show 5+1
     Z> format xml

     Z> elements zebra::index
     Z> show 6+1

     Z> elements zebra::snippet
     Z> show 7+1

     Z> elements zebra::facet::any:w
     Z> show 1+1

     Z> elements zebra::facet::publisher:p,title:p
     Z> show 1+1
    </screen>
   </para>

   <para>
    &acro.z3950; searches targeted at specific indexes and boolean
    combinations of these can be issued as well.

    <screen>
     Z> elements dc
     Z> find @attr 1=oai_identifier @attr 4=3 oai:caltechcstr.library.caltech.edu:4
     Z> show 1+1

     Z> find @attr 1=oai_datestamp @attr 4=3 2001-04-20
     Z> show 1+1

     Z> find @attr 1=oai_setspec @attr 4=3 7374617475733D756E707562
     Z> show 1+1

     Z> find @attr 1=title communication
     Z> show 1+1

     Z> find @attr 1=identifier @attr 4=3
     http://resolver.caltech.edu/CaltechCSTR:1986.5228-tr-86
     Z> show 1+1
    </screen>
    etc, etc.
   </para>

   <para>
    &acro.z3950; scan:
    <screen>
     yaz-client localhost:9999
     Z> format xml
     Z> querytype prefix
     Z> scan @attr 1=oai_identifier @attr 4=3 oai
     Z> scan @attr 1=oai_datestamp @attr 4=3 1
     Z> scan @attr 1=oai_setspec @attr 4=3 2000
     Z>
     Z> scan @attr 1=title communication
     Z> scan @attr 1=identifier @attr 4=3 a
    </screen>
   </para>

   <para>
    &acro.z3950; search using server-side CQL conversion:
    <screen>
     Z> format xml
     Z> querytype cql
     Z> elements dc
     Z>
     Z> find harry
     Z>
     Z> find dc.creator = the
     Z> find dc.creator = the
     Z> find dc.title = the
     Z>
     Z> find dc.description &lt; the
     Z> find dc.title &gt; some
     Z>
     Z> find dc.identifier="http://resolver.caltech.edu/CaltechCSTR:1978.2276-tr-78"
     Z> find dc.relation = something
    </screen>
   </para>

   <!--
   etc, etc. Notice that  all indexes defined by 'type="0"' in the
   indexing style  sheet must be searched using the 'eq'
   relation.

   Z> find title <> and

   fails as well.  ???
   -->

   <tip>
    <para>
     &acro.z3950; scan using server side CQL conversion -
     unfortunately, this will _never_ work as it is not supported by the
     &acro.z3950; standard.
     If you want to use scan using server side CQL conversion, you need to
     make an SRW connection using  yaz-client, or a
     SRU connection using REST Web Services - any browser will do.
    </para>
   </tip>

   <tip>
    <para>
     All indexes defined by 'type="0"' in the
     indexing style  sheet must be searched using the '@attr 4=3'
     structure attribute instruction.
    </para>
   </tip>

   <para>
    Notice that searching and scan on indexes
    <literal>contributor</literal>,  <literal>language</literal>,
    <literal>rights</literal>, and <literal>source</literal>
    might fail, simply because none of the records in the small example set
    have these fields set, and consequently, these indexes might not
    been created.
   </para>

  </sect1>

 </chapter>


 <!-- Keep this comment at the end of the file
 Local variables:
 mode: sgml
 sgml-omittag:t
 sgml-shorttag:t
 sgml-minimize-attributes:nil
 sgml-always-quote-attributes:t
 sgml-indent-step:1
 sgml-indent-data:t
 sgml-parent-document: "idzebra.xml"
 sgml-local-catalogs: nil
 sgml-namecase-general:t
 End:
 -->