File: lucene-dev-HOWTO

package info (click to toggle)
axyl 2.1.13
links: PTS
area: main
in suites: etch, etch-m68k
size: 40,660 kB
ctags: 90,621
sloc: php: 50,149; sql: 11,220; sh: 3,488; perl: 649; xml: 206; makefile: 108
file content (422 lines) | stat: -rw-r--r-- 18,228 bytes
parent folder | download | duplicates (2)
#######################################################################
 LUCENE PHP/AXYL HOWTO
 Paul Waite
 October 2002
#######################################################################

1. INTRODUCTION
Lucene is a Java full-text search engine. Lucene is not a complete
application, but rather a code library and API that can easily be used
to add search capabilities to applications.

The Lucene web site is at:
  http://jakarta.apache.org/lucene

Join the Lucene-User mailing list by sending a message to:
  lucene-user-subscribe@jakarta.apache.org


Axyl locations:
Have a look in /etc/axyl/axyl.conf if you need to remind yourself
where you installed Axyl. All of the scripts etc. referred to below
will be in the path specified by AXYL_HOME in this .conf file.

.......................................................................
2. LUCENESERVER INSTALLATION

Lucene and the Catalyst Luceneserver come bundled with Axyl, ready to
go. All you have to do is add a Java Runtime Environment of your
choice and make sure the environment variable JAVA_HOME points to the
proper directory. Eg. if you have downloaded the JRE to the local
directory /usr/local/java then normally, you would be exporting
JAVA_HOME=/usr/local/java/jre to the shell.

All you need to do, once your JRE is set up, is to go to the Axyl
"install" sub-directory, and create a new website with the
installation script "create_axyl_website.sh".

This script will do everything necessary for the new website, and at
the end will create and customise a lucene domain to go with it.

Once you have done that, you are ready to go. The Luceneserver should
already be listening on Port 22222 (unless you changed it in the
"Server.config" file). All you need to do is set up your Axyl
application as described in the next section, to use it.


.......................................................................
3. MANUAL LUCENE APPLICATION SETUP

It should be noted that this section is for those wanting to set up
a website Lucene domain by hand. The "install/create_axyl_website.sh"
script will do this for you, normally.

When you decide you want to use Lucene as a search engine for your Axyl
application, you have to set up a default properties file on the Lucene
server first. Each Axyl application (website) has an APP_PREFIX defined
for it. This is a single-word application identifier, and must
be unique. To determine the APP_PREFIX of an existing Axyl website, edit
the "application.xml" file in the website doc root directory. This is
an XML-formatted file, but you should be able to read the value for this
setting in there, near the top. To define a Lucene properties file,
you go to the ${AXYL_DATA}/lucene/servers/axyl/etc/application directory
and create a new file APP_PREFIX.properties where APP_PREFIX is, of course,
replaced with whatever application name you chose.

NOTE: the Axyl data directory ${AXYL_DATA} should be defined in the
/etc/axyl/axyl.conf file. It is normally set to "/var/lib/axyl".

The best idea is to simply copy an existing properties file, for example
the "axyl.properties" file which comes with Axyl out of the box. Once
you have done this, you can look at the content of the file in more
detail, and change the "axyl" entries to your APP_PREFIX.

The file will have a few crucial default settings, such as the Domain,
and Stopwords list in it. It can also have a default for other fields.
Meaningful defauls would be for fields like: Query, Limit, Domain, Sort,
Return and Field-Definitions.

A word about each of these:
- Domain
A domain is a logical grouping of indexed items within the Lucene server.
For example if you were working on an application which was called 'news'
and which had its own set of news articles, then you would pick a likely
unique name (probably 'news' in this case!) and use that for your domain.

That way, other Lucene users can be kept separate when they make queries
or do indexing. You can also do clever things with multiple domains. For
example your "news" application might want to query across three domains.
In that case you would set up your application default properties file
with a Domain: header followed by the names of the three domains separated
by a space. That would cause queries to be made against all three of those
domains.

- Stopwords
This is just a list of words which Lucene will ignore both for indexing and
for querying. Use the default set by copying an existing properties file.

- Limit
The maximum number of results to return from any query, even if more are
available.

- Return
A list of fields, separated by a space, which will be returned. These fields
must have been stored by the indexing operation first of course. If, in a
query, you refer to a field which doesn't exist, Lucene doesn't report any
error - it just ignores the reference.

- Sort
A list of fields, separated by a space, which will be used to sort the
returned results. Note there is a pseudo-field called 'RANK' which sorts
results by relevance to your query. This is the fastest by a long way,
since it is the 'native' order Lucene returns results in. All other sorts
require Luceneserver to acquire all results and sort them, which may
take a while if the resultset is large.

- Field-Definitions
This is a list of definitions of fields you will be using in your indexing
operations and queries. An example of such a list in the properties file
might be:

  Field-Definitions=Id Id Yes Yes, \
                    Domain Text Yes Yes, \
                    MyField Text Yes Yes, \
                    Published Date Yes Yes, \
                    Category Text Yes Yes

This example defines 5 fields. The first two are always there, so you
always have these. The other three are hypothetical fields. Each one has
three parameters: Type, Indexed, Stored. The type can be one of: 'Text',
'Date' or 'Id'. The other two fields take values of either 'Yes' or 'No'
with meanings as follows:

    Indexed:  If 'Yes' then the field will be indexed by Lucene and
              is therefore available for use in querying. If 'No' then
              it will not be available for querying, and will only be
              available for returning (if stored).

    Stored:   If 'Yes' then Lucene will store this field value, and
              hence make it available to be returned as meta-data
              associated with the item which was found by a query.

Obviously you have to use thse sensibly. For example setting both of them to
'No' would be useless. Also, storing fields should be restricted to data
which is really required. In a large index of millions of articles, field
storage will have a big impact on space requirements.

That covers the contents of the application defaults file. When you make
calls to do indexing or querying from the application, you will specify your
application name (eg: 'news' in our example) and this will then pick up the
appropriate file. Note that everything in that file can be over-ridden in
your application.


.......................................................................
4. INDEXING ITEMS

We talk about 'items' rather than documents because Lucene is a generic
tool for indexing stuff, not just files containing documents. When
you index a thing, at the time you do it, you pass an identifier to
Lucene, called the 'Id' field. Then, if a subsequent query finds that
item, the Id is returned. The form that the Id takes is entirely up
to you. It can be a number, a string (eg. a path to a file), or some
weird and wonderful combination of these. As long as you can make sense
of it when it gets handed back to you, that's all that matters.

Example of Indexing a Thing
Here is a simple example of indexing a string of words, and associating
an Id of '1234' with it. Note we specify the application 'news' in the
creation of our indexing object:

  $I = new lucene_indexmsg("news", "lucene.catalyst.net.nz", "22222");
  $I->index_field("StoryDate:Date", time());
  $I->index_field("Category", "rude", NOT_STORED);
  $I->index_content("1234", "The quick brown fox humped the horny dog");
  $I->send();

  [ NB: The examples in this documentation won't work in real life
    since the application "news" is probably not set up in Lucene. ]

We are also associating two fields with this item. Looking at the first
one 'StoryDate, we see that there is a type-specifier of 'Date' added
onto the end of the field name. The default type is 'Text', and the
only other one you need to know about at present is 'Date'. When you
have a datetime field (as here) then you need to flag that it is such
a field like this. Since it is a datetime field, we must assign a
Unix timestamp to it, and we are assigning the current time here.

The second field is a text field, so no need to specify the type. It is
named 'Category' and contains the value 'rude' in this case. Note that
it is flagged for non-storage by Lucene. We have decided that, although
we want to query by Category, we don't ever need Lucene to actually return
us the category value.

The main call is to index_content(), and the first parameter is the Id
to associate with the item, followed by the content itself. Obviously
this content might have come from anywhere - a file, a generated
webpage, whatever.


.......................................................................
5. MAKING QUERIES

Before constructing queries, you would be well advised to peruse the
Lucene query syntax documentation. This is to be found at:

  http://jakarta.apache.org/lucene/docs/queryparsersyntax.html


Here's an example query based on the indexed item in Section 3 above:

  $M = new lucene_search("news", "lucene.catalyst.net.nz", "22222");
  $M->match("\"horny dog\"");
  $M->must_matchfield("Category", "rude");
  $M->set_returnfields("StoryDate:Date");
  $M->set_sortorder("StoryDate,Category");
  $M->execute();

This simple query picks the 'news' application, and specifies a phrase
match for "horny dog". We could have omitted the escaped quotes and
then we would have had all items which had either the word "horny" OR
the word "dog" in them. But we didn't.

The call to the must_matchfield() method specifies that the returned
item has to have its Category field equal to 'rude'.

The call to set_returnfields() sets the Return header to be the list
of comma-delimted fieldnames contained in the parameter string. In
our case there's only one of course, and it is a datetime, so we
have to add the Date specifier too.

The call to set_sortorder() sets the sort order of returned articles
to be by StoryDate, and then by Category.

When we execute the query we get a response back, and you can see these
with this bit of debugging:

  $hitcount = $M->hitcount();
  debugbr("We got $hitcount hits for that query.");
  foreach ($M->hit as $hitfields) {
    while (list($fld, $val) = each($hitfields)) {
      debugbr("$fld = $val");
    }
  }

So that little example shows you a basic way of accessing the fields
which were returned by the query.


.......................................................................
6. MORE ADVANCED QUERYING

There are quite a few methods for extracting data from Lucene. You can
specify the max number of 'hits' to get back, the sort order, and date
ranges. You can add multiple 'terms' to the query with any combination
of AND, OR, NOT or AND NOT (same as NOT) boolean joining operators.

Here are some examples:

  $M = new lucene_search("news", "lucene.catalyst.net.nz", "22222");
  $M->set_daterange(
      displaydate_to_timestamp("Yesterday"),
      time(),
      "StoryDate"
      );
  $M->may_match("dog");
  $M->must_match("horny AND humped");
  $M->must_matchfield("Category", "rude");
  $M->set_returnfields("Category,StoryDate:Date");
  $M->set_sortorder("RANK");
  $M->set_maxresults(10);
  $M->execute();

This example shows the use of the date range term. With this we have to
give it the start and end dates (as Unix timestamps), plus the name of
the field containing the date itself.

The next term is an optional (OR) match of the word "dog", so this term
is not a 'must have' case, but items which hit this term will feature
up the ranking.

The words "horny" and "humped" together are a must have term (AND) and
so only items with these two words will be returned.

Next we tie the search down to only those which have a field called
"Category" which contains "rude" in it.

The fields to be returned are specified next, and there are two, the
category, and the item date. Note that EVERY query will also return
the following fields always: Id, Domain and RANK. So these three
fields are added to our chosen ones.

Finally we set a maximum results limit to stop the query swamping us, and
then execute it.


.......................................................................
7. INDEXING A FILE HIERARCHY

Here is an example of how to index a bunch of files already sitting on
disk in a directory tree. The problem with indexing a lot of files
with no operator intervention is how to provide the Unique index Id
for each one, and how to supply meta-data - ie. the fields of data
to index with each file.

For the Id we provide 4 different methods of automatically generating
a unique identifier:
  ID_FROM_INC       Use an incremeting integer, plus optional offset
  ID_FROM_NAME      Use filename, stripped of any extension
  ID_FROM_FILENAME  Use full filename as the Id
  ID_FROM_PATH      Use full path as the Id

The latter three options, also have an optional prefix which can be
stuck on the front of the varying Id part.

With the ID_FROM_INC, the first Id will be '1', and future Id's will
be 2, 3, 4 etc. You may specify an offset as a second parameter to
the id_generate() method, which is an offset. The Id will start from
this offset + 1.

In our example we have many files with names of the form 12345.html
and hence we pick ID_FROM_NAME which will extract the '12345' and
use that for the Id.

As far as fields are concerned, a common technique is to embed meta
data in HTML files as meta tags. The indexer will assume this without
you specifying it, and if you define fields, then it will look for
these by name (case IS important!) inside each file as <meta> tags
and use the content therein if they are found. In addition, the special
tag <title> is also scanned, and if you specfy that as a field it
will use this too.

In this example we have defined the title, as a field we want, and also
quite a number of other fields which we know are present in our files
as HTML tags (<title> and <meta>):

  $IXR = new lucene_fileindexer("news", "lucene.catalyst.net.nz", "22222");
  $IXR->id_generate(ID_FROM_NAME);
  $IXR->define_field("title",       "Text", NOT_STORED);
  $IXR->define_field("slug",        "Text", NOT_STORED);
  $IXR->define_field("date",        "Date", STORED);
  $IXR->define_field("category",    "Text", STORED);
  $IXR->define_field("subcategory", "Text", STORED);
  $IXR->define_field("type",        "Text", NOT_STORED);
  $IXR->define_field("source",      "Text", NOT_STORED);
  $IXR->define_field("sourcetype",  "Text", NOT_STORED);
  $IXR->index_tree("/data/articles/archive/2001");

The final call to index_tree() executes the indexing process where
all files in the tree will be scanned and indexed from the path given
recursively.

NB: Note that we have to specify the application, the Host of the
Lucene server, and the port when we create the file indexer object.
[Note: Axyl users don't need to do this - see section below]


.......................................................................
8. INDEXING A SINGLE FILE

To index an individual file, you can call the same method that the
index_tree() method uses. The below example shows how:

  $IXR = new lucene_fileindexer("news", "lucene.catalyst.net.nz", "22222");
  $IXR->noscantags();
  $IXR->index_field("Category", "nonsense");
  $IXR->index_field("date:Date", time());
  $myfields = array("Subcategory" => "thingy", "Source" => "wherever");
  $IXR->index_file("/data/articles/archive/2001/10/02/3313.html", "9876", $myfields);

In this example we switch off the tag-scanning option by calling the
noscantags() method.

Next we have two fields we want to index, so we call index_field() with
the field name, and value. Note we use a type-secifier for the date field.
As well as repeated calls to index_field() you can also pass in an array
of fieldname/values (an associative array) and this is done in the
example to show how.

Finally a call to index_file() with the path of the file and the Id to
associate does the indexing. In our example there is also the fields
array passed in as the third parameter (optional).


.......................................................................
9. USING LUCENE WITH THE AXYL LIBRARY

Since Axyl is an integrated framework it provides a bit more in the
way of convenience using the Lucene classes described above, even though
the classes are still completely standalone.

Mainly this concerns the specification of the Application, and the
Lucene host and port.

If your code uses the Axyl library then you need not specify any of these
three parameters in your code, provided you set your system up correctly.

First of all create your application properties file, with a name which
is the same as the APP_PREFIX which you have defined in "application.php".
Let's say your APP_PREFIX is "news" for example. Then make sure Lucene
has a properties file called "news.properties" with all the appropriate
settings in there, such as a domain of "news", stopwords, etc.

Next, in your Axyl application, go to the System Setup maintenance page
and create two new configuration fields: "Lucene Host" and "Lucene Port".
When created, save the relevant host and port settings in there.

Axyl will then, by default, use your APP_PREFIX to set the application,
and will find the Lucene Host and Port settings as well, saving you having
to do this in every use of Lucene classes.

So creating a search object:
  $M = new lucene_search("news", "lucene.catalyst.net.nz", "22222");

  becomes..

  $M = new lucene_search();

And likewise for all other methods.

You can still over-ride this behaviour, by specifying the three parameters
explicitly at any time.