File: SWISH-FAQ.1

package info (click to toggle)
swish-e 2.4.3-7
links: PTS
area: main
in suites: etch, etch-m68k
size: 7,308 kB
ctags: 7,642
sloc: ansic: 47,402; sh: 8,508; perl: 5,281; makefile: 723; xml: 9
file content (1557 lines) | stat: -rw-r--r-- 65,913 bytes
parent folder | download | duplicates (2)
.\" Automatically generated by Pod::Man v1.37, Pod::Parser v1.14
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sh \" Subsection heading
.br
.if t .Sp
.ne 5
.PP
\fB\\$1\fR
.PP
..
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  | will give a
.\" real vertical bar.  \*(C+ will give a nicer C++.  Capital omega is used to
.\" do unbreakable dashes and therefore won't be available.  \*(C` and \*(C'
.\" expand to `' in nroff, nothing in troff, for use with C<>.
.tr \(*W-|\(bv\*(Tr
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
'br\}
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.Sh), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.if \nF \{\
.    de IX
.    tm Index:\\$1\t\\n%\t"\\$2"
..
.    nr % 0
.    rr F
.\}
.\"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.hy 0
.if n .na
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
.    \" fudge factors for nroff and troff
.if n \{\
.    ds #H 0
.    ds #V .8m
.    ds #F .3m
.    ds #[ \f1
.    ds #] \fP
.\}
.if t \{\
.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.    ds #V .6m
.    ds #F 0
.    ds #[ \&
.    ds #] \&
.\}
.    \" simple accents for nroff and troff
.if n \{\
.    ds ' \&
.    ds ` \&
.    ds ^ \&
.    ds , \&
.    ds ~ ~
.    ds /
.\}
.if t \{\
.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
.    \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
.    \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
.    \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
.    ds : e
.    ds 8 ss
.    ds o a
.    ds d- d\h'-1'\(ga
.    ds D- D\h'-1'\(hy
.    ds th \o'bp'
.    ds Th \o'LP'
.    ds ae ae
.    ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "SWISH-FAQ 1"
.TH SWISH-FAQ 1 "2004-10-04" "2.5.1" "SWISH-E Documentation"
.SH "NAME"
The Swish\-e FAQ \- Answers to Common Questions
.SH "Frequently Asked Questions"
.IX Header "Frequently Asked Questions"
.Sh "General Questions"
.IX Subsection "General Questions"
\fIWhat is Swish\-e?\fR
.IX Subsection "What is Swish-e?"
.PP
Swish-e is \fBS\fRimple \fBW\fReb \fBI\fRndexing \fBS\fRystem for \fBH\fRumans \-
\&\fBE\fRnhanced.  With it, you can quickly and easily index directories of
files or remote web sites and search the generated indexes for words
and phrases.
.PP
\fISo, is Swish-e a search engine?\fR
.IX Subsection "So, is Swish-e a search engine?"
.PP
Well, yes.  Probably the most common use of Swish-e is to provide a search
engine for web sites.  The Swish-e distribution includes \s-1CGI\s0 scripts that
can be used with it to add a \fIsearch engine\fR for your web site.  The \s-1CGI\s0
scripts can be found in the \fIexample\fR directory of the distribution
package.  See the \fI\s-1README\s0\fR file for information about the scripts.
.PP
But Swish-e can also be used to index all sorts of data, such as email
messages, data stored in a relational database management system,
\&\s-1XML\s0 documents, or documents such as Word and \s-1PDF\s0 documents \*(-- or any
combination of those sources at the same time.  Searches can be limited
to fields or \fIMetaNames\fR within a document, or limited to areas within
an \s-1HTML\s0 document (e.g. body, title).  Programs other than \s-1CGI\s0 applications
can use Swish\-e, as well.
.PP
\fIShould I upgrade if I'm already running a previous version of Swish\-e?\fR
.IX Subsection "Should I upgrade if I'm already running a previous version of Swish-e?"
.PP
A large number of bug fixes, feature additions, and logic corrections were
made in version 2.2.  In addition, indexing speed has been drastically
improved (reports of indexing times changing from four hours to 5
minutes), and major parts of the indexing and search parsers have been
rewritten.  There's better debugging options, enhanced output formats,
more document meta data (e.g. last modified date, document summary),
options for indexing from external data sources, and faster spidering
just to name a few changes.  (See the \s-1CHANGES\s0 file for more information.
.PP
Since so much effort has gone into version 2.2, support for previous
versions will probably be limited.
.PP
\fIAre there binary distributions available for Swish-e on platform foo?\fR
.IX Subsection "Are there binary distributions available for Swish-e on platform foo?"
.PP
Foo?  Well, yes there are some binary distributions available.  Please see
the Swish-e web site for a list at http://swish\-e.org/.
.PP
In general, it is recommended that you build Swish-e from source,
if possible.
.PP
\fIDo I need to reindex my site each time I upgrade to a new Swish-e version?\fR
.IX Subsection "Do I need to reindex my site each time I upgrade to a new Swish-e version?"
.PP
At times it might not strictly be necessary, but since you don't really
know if anything in the index has changed, it is a good rule to reindex.
.PP
\fIWhat's the advantage of using the libxml2 library for parsing \s-1HTML\s0?\fR
.IX Subsection "What's the advantage of using the libxml2 library for parsing HTML?"
.PP
Swish-e may be linked with libxml2, a library for working with \s-1HTML\s0 and \s-1XML\s0
documents.  Swish-e can use libxml2 for parsing \s-1HTML\s0 and \s-1XML\s0 documents.
.PP
The libxml2 parser is a better parser than Swish\-e's built-in \s-1HTML\s0
parser.  It offers more features, and it does a much better job at
extracting out the text from a web page.  In addition, you can use the
\&\f(CW\*(C`ParserWarningLevel\*(C'\fR configuration setting to find structural errors
in your documents that could (and would with Swish\-e's \s-1HTML\s0 parser)
cause documents to be indexed incorrectly.
.PP
Libxml2 is not required, but is strongly recommended for parsing \s-1HTML\s0
documents.  It's also recommended for parsing \s-1XML\s0, as it offers many
more features than the internal Expat xml.c parser.
.PP
The internal \s-1HTML\s0 parser will have limited support, and does have a
number of bugs.  For example, \s-1HTML\s0 entities may not always be correctly
converted and properties do not have entities converted.  The internal
parser tends to get confused when invalid \s-1HTML\s0 is parsed where the libxml2
parser doesn't get confused as often.  The structure is better detected
with the libxml2 parser.
.PP
If you are using the Perl module (the C interface to the Swish-e
library) you may wish to build two versions of Swish\-e, one with the
libxml2 library linked in the binary, and one without, and build the
Perl module against the library without the libxml2 code.  This is to
save space in the library.  Hopefully, the library will someday soon be
split into indexing and searching code (volunteers welcome).
.PP
\fIDoes Swish-e include a \s-1CGI\s0 interface?\fR
.IX Subsection "Does Swish-e include a CGI interface?"
.PP
Yes.  Kind of.
.PP
There's two example \s-1CGI\s0 scripts included, swish.cgi and search.cgi.
Both are installed at \fI$prefix/lib/swish\-e\fR.
.PP
Both require a bit of work to setup and use.  Swish.cgi is probably what most
people will want to use as it contains more features.  Search.cgi is for those
that want to start with a small script and customize it to fit their needs.
.PP
An example of using swish.cgi is given in
the \s-1INSTALL\s0 man page, and it the swish.cgi documentation.
Like often is the case, it will be easier to use if you first read the documentation.
.PP
Please use caution about \s-1CGI\s0 scripts found on the Internet for use with Swish\-e.
Some are not secure.
.PP
The included example \s-1CGI\s0 scripts were designed with security in mind.
Regardless, you are encouraged to have your local Perl expert review it
(and all other \s-1CGI\s0 scripts you use) before placing it into production.
This is just a good policy to follow.
.PP
\fIHow secure is Swish\-e?\fR
.IX Subsection "How secure is Swish-e?"
.PP
We know of no security issues with using Swish\-e.  Careful attention
has been made with regard to common security problems such as buffer
overruns when programming Swish\-e.
.PP
The most likely security issue with Swish-e is when it is run via
a poorly written \s-1CGI\s0 interface.  This is not limited to \s-1CGI\s0 scripts
written in Perl, as it's just as easy to write an insecure \s-1CGI\s0 script
in C, Java, \s-1PHP\s0, or Python.  A good source of information is included
with the Perl distribution.  Type \f(CW\*(C`perldoc perlsec\*(C'\fR at your local
prompt for more information.  Another must-read document is located at
\&\f(CW\*(C`http://www.w3.org/Security/faq/wwwsf4.html\*(C'\fR.
.PP
Note that there are many \fIfree\fR yet insecure and poorly written \s-1CGI\s0
scripts available \*(-- even some designed for use with Swish\-e.  Please
carefully review any \s-1CGI\s0 script you use.  Free is not such a good price
when you get your server hacked...
.PP
\fIShould I run Swish-e as the superuser (root)?\fR
.IX Subsection "Should I run Swish-e as the superuser (root)?"
.PP
No.  Never.
.PP
\fIWhat files does Swish-e write?\fR
.IX Subsection "What files does Swish-e write?"
.PP
Swish writes the index file, of course.  This is specified with the
\&\f(CW\*(C`IndexFile\*(C'\fR configuration directive or by the \f(CW\*(C`\-f\*(C'\fR command line switch.
.PP
The index file is actually a collection of files, but all start with
the file name specified with the \f(CW\*(C`IndexFile\*(C'\fR directive or the \f(CW\*(C`\-f\*(C'\fR
command line switch.
.PP
For example, the file ending in \fI.prop\fR contains the document properties.
.PP
When creating the index files Swish-e appends the extension \fI.temp\fR
to the index file names.  When indexing is complete Swish-e renames the
\&\fI.temp\fR files to the index files specified by \f(CW\*(C`IndexFile\*(C'\fR or \f(CW\*(C`\-f\*(C'\fR.
This is done so that existing indexes remain untouched until it completes
indexing.
.PP
Swish-e also writes temporary files in some cases during indexing
(e.g. \f(CW\*(C`\-s http\*(C'\fR, \f(CW\*(C`\-s prog\*(C'\fR with filters), when merging, and when
using \f(CW\*(C`\-e\*(C'\fR).  Temporary files are created with the \fImkstemp\fR\|(3) function
(with 0600 permission on unix-like operating systems).
.PP
The temporary files are created in the directory specified by the
environment variables \f(CW\*(C`TMPDIR\*(C'\fR and \f(CW\*(C`TMP\*(C'\fR in that order.  If those
are not set then swish uses the setting the configuration setting
TmpDir.  Otherwise, the temporary file
will be located in the current directory.
.PP
\fICan I index \s-1PDF\s0 and MS-Word documents?\fR
.IX Subsection "Can I index PDF and MS-Word documents?"
.PP
Yes, you can use a \fIFilter\fR to convert documents while indexing, or you
can use a program that \*(L"feeds\*(R" documents to Swish-e that have already
been converted.  See \f(CW\*(C`Indexing\*(C'\fR below.
.PP
\fICan I index documents on a web server?\fR
.IX Subsection "Can I index documents on a web server?"
.PP
Yes, Swish-e provides two ways to index (spider) documents on a web
server.  See \f(CW\*(C`Spidering\*(C'\fR below.
.PP
Swish-e can retrieve documents from a file system or from a remote web
server.  It can also execute a program that returns documents back
to it.  This program can retrieve documents from a database, filter
compressed documents files, convert \s-1PDF\s0 files, extract data from mail
archives, or spider remote web sites.
.PP
\fICan I implement keywords in my documents?\fR
.IX Subsection "Can I implement keywords in my documents?"
.PP
Yes, Swish-e can associate words with \fIMetaNames\fR while indexing,
and you can limit your searches to these MetaNames while searching.
.PP
In your \s-1HTML\s0 files you can put keywords in \s-1HTML\s0 \s-1META\s0 tags or in \s-1XML\s0 blocks.
.PP
\&\s-1META\s0 tags can have two formats in your source documents:
.PP
.Vb 1
\&    <META NAME="DC.subject" CONTENT="digital libraries">
.Ve
.PP
And in \s-1XML\s0 format (can also be used in \s-1HTML\s0 documents when using libxml2):
.PP
.Vb 3
\&    <meta2>
\&        Some Content
\&    </meta2>
.Ve
.PP
Then, to inform Swish-e about the existence of the meta name in your
documents, edit the line in your configuration file:
.PP
.Vb 1
\&    MetaNames DC.subject meta1 meta2
.Ve
.PP
When searching you can now limit some or all search terms to that
MetaName.  For example, to look for documents that contain the word
apple and also have either fruit or cooking in the \s-1DC\s0.subject meta tag.
.PP
\fIWhat are document properties?\fR
.IX Subsection "What are document properties?"
.PP
A document property is typically data that describes the document.
For example, properties might include a document's path name, its last
modified date, its title, or its size.  Swish-e stores a document's
properties in the index file, and they can be reported back in search
results.
.PP
Swish-e also uses properties for sorting.  You may sort your results by
one or more properties, in ascending or descending order.
.PP
Properties can also be defined within your documents.  \s-1HTML\s0 and
\&\s-1XML\s0 files can specify tags (see previous question) as properties.
The \fIcontents\fR of these tags can then be returned with search results.
These user-defined properties can also be used for sorting search results.
.PP
For example, if you had the following in your documents
.PP
.Vb 1
\&   <meta name="creator" content="accounting department">
.Ve
.PP
and \f(CW\*(C`creator\*(C'\fR is defined as a property (see \f(CW\*(C`PropertyNames\*(C'\fR in
SWISH-CONFIG) Swish-e can return \f(CW\*(C`accounting department\*(C'\fR
with the result for that document.
.PP
.Vb 1
\&    swish-e -w foo -p creator
.Ve
.PP
Or for sorting:
.PP
.Vb 1
\&    swish-e -w foo -s creator
.Ve
.PP
\fIWhat's the difference between MetaNames and PropertyNames?\fR
.IX Subsection "What's the difference between MetaNames and PropertyNames?"
.PP
MetaNames allows keywords searches in your documents.  That is, you can
use MetaNames to restrict searches to just parts of your documents.
.PP
PropertyNames, on the other hand, define text that can be returned with
results, and can be used for sorting.
.PP
Both use \fImeta tags\fR found in your documents (as shown in the above two
questions) to define the text you wish to use as a property or meta name.
.PP
You may define a tag as \fBboth\fR a property and a meta name.  For example:
.PP
.Vb 1
\&   <meta name="creator" content="accounting department">
.Ve
.PP
placed in your documents and then using configuration settings of:
.PP
.Vb 2
\&    PropertyNames creator
\&    MetaNames creator
.Ve
.PP
will allow you to limit your searches to documents created by accounting:
.PP
.Vb 1
\&    swish-e -w 'foo and creator=(accounting)'
.Ve
.PP
That will find all documents with the word \f(CW\*(C`foo\*(C'\fR that also have a creator
meta tag that contains the word \f(CW\*(C`accounting\*(C'\fR.  This is using MetaNames.
.PP
And you can also say:
.PP
.Vb 1
\&    swish-e -w foo -p creator
.Ve
.PP
which will return all documents with the word \f(CW\*(C`foo\*(C'\fR, but the results will
also include the contents of the \f(CW\*(C`creator\*(C'\fR meta tag along with results.
This is using properties.
.PP
You can use properties and meta names at the same time, too:
.PP
.Vb 1
\&    swish-e -w creator=(accounting or marketing) -p creator -s creator
.Ve
.PP
That searches only in the \f(CW\*(C`creator\*(C'\fR \fImeta name\fR for either of the words
\&\f(CW\*(C`accounting\*(C'\fR or \f(CW\*(C`marketing\*(C'\fR, prints out the contents of the contents
of the \f(CW\*(C`creator\*(C'\fR \fIproperty\fR, and sorts the results by the \f(CW\*(C`creator\*(C'\fR
\&\fIproperty name\fR.
.PP
(See also the \f(CW\*(C`\-x\*(C'\fR output format switch in SWISH-RUN.)
.PP
\fICan Swish-e index multi-byte characters?\fR
.IX Subsection "Can Swish-e index multi-byte characters?"
.PP
No.  This will require much work to change.  But, Swish-e works with
eight-bit characters, so many characters sets can be used.  Note that it
does call the ANSI-C \fItolower()\fR function which does depend on the current
locale setting.  See \f(CWlocale(7)\fR for more information.
.Sh "Indexing"
.IX Subsection "Indexing"
\fIHow do I pass Swish-e a list of files to index?\fR
.IX Subsection "How do I pass Swish-e a list of files to index?"
.PP
Currently, there is not a configuration directive to include a file that
contains a list of files to index.  But, there is a directive to include
another configuration file.
.PP
.Vb 1
\&    IncludeConfigFile /path/to/other/config
.Ve
.PP
And in \f(CW\*(C`/path/to/other/config\*(C'\fR you can say:
.PP
.Vb 2
\&    IndexDir file1 file2 file3 file4 file5 ...
\&    IndexDir file20 file21 file22
.Ve
.PP
You may also specify more than one configuration file on the command line:
.PP
.Vb 1
\&    ./swish-e -c config_one config_two config_three
.Ve
.PP
Another option is to create a directory with symbolic links of the files
to index, and index just that directory.
.PP
\fIHow does Swish-e know which parser to use?\fR
.IX Subsection "How does Swish-e know which parser to use?"
.PP
Swish can parse \s-1HTML\s0, \s-1XML\s0, and text documents.  The parser is set by
associating a file extension with a parser by the \f(CW\*(C`IndexContents\*(C'\fR
directive.  You may set the default parser with the \f(CW\*(C`DefaultContents\*(C'\fR
directive.  If a document is not assigned a parser it will default to
the \s-1HTML\s0 parser (\s-1HTML2\s0 if built with libxml2).
.PP
You may use Filters or an external program to convert documents to \s-1HTML\s0,
\&\s-1XML\s0, or text.
.PP
\fICan I reindex and search at the same time?\fR
.IX Subsection "Can I reindex and search at the same time?"
.PP
Yes.  Starting with version 2.2 Swish-e indexes to temporary files, and then
renames the files when indexing is complete.  On most systems renames
are atomic.  But, since Swish-e also generates more than one file during
indexing there will be a very short period of time between renaming the
various files when the index is out of sync.
.PP
Settings in \fIsrc/config.h\fR control some options related to temporary files,
and their use during indexing.
.PP
\fICan I index phrases?\fR
.IX Subsection "Can I index phrases?"
.PP
Phrases are indexed automatically.  To search for a phrase simply place
double quotes around the phrase.
.PP
For example:
.PP
.Vb 1
\&    swish-e -w 'free and "fast search engine"'
.Ve
.PP
\fIHow can I prevent phrases from matching across sentences?\fR
.IX Subsection "How can I prevent phrases from matching across sentences?"
.PP
Use the
BumpPositionCounterCharacters
configuration directive.
.PP
\fISwish-e isn't indexing a certain word or phrase.\fR
.IX Subsection "Swish-e isn't indexing a certain word or phrase."
.PP
There are a number of configuration parameters that control what Swish-e
considers a \*(L"word\*(R" and it has a debugging feature to help pinpoint
any indexing problems.
.PP
Configuration file directives (SWISH-CONFIG)
\&\f(CW\*(C`WordCharacters\*(C'\fR, \f(CW\*(C`BeginCharacters\*(C'\fR, \f(CW\*(C`EndCharacters\*(C'\fR,
\&\f(CW\*(C`IgnoreFirstChar\*(C'\fR, and \f(CW\*(C`IgnoreLastChar\*(C'\fR are the main settings that
Swish-e uses to define a \*(L"word\*(R".  See SWISH-CONFIG and
SWISH-RUN for details.
.PP
Swish-e also uses compile-time defaults for many settings.  These are
located in \fIsrc/config.h\fR file.
.PP
Use of the command line arguments \f(CW\*(C`\-k\*(C'\fR, \f(CW\*(C`\-v\*(C'\fR and \f(CW\*(C`\-T\*(C'\fR are useful when
debugging these problems.  Using \f(CW\*(C`\-T INDEXED_WORDS\*(C'\fR while indexing will
display each word as it is indexed.  You should specify one file when
using this feature since it can generate a lot of output.
.PP
.Vb 1
\&     ./swish-e -c my.conf -i problem.file -T INDEXED_WORDS
.Ve
.PP
You may also wish to index a single file that contains words that are or
are not indexing as you expect and use \-T to output debugging information
about the index.  A useful command might be:
.PP
.Vb 1
\&    ./swish-e -f index.swish-e -T INDEX_FULL
.Ve
.PP
Once you see how Swish-e is parsing and indexing your words, you can
adjust the configuration settings mentioned above to control what words
are indexed.
.PP
Another useful command might be:
.PP
.Vb 1
\&     ./swish-e -c my.conf -i problem.file -T PARSED_WORDS INDEXED_WORDS
.Ve
.PP
This will show white-spaced words parsed from the document (\s-1PARSED_WORDS\s0),
and how those words are split up into separate words for indexing
(\s-1INDEXED_WORDS\s0).
.PP
\fIHow do I keep Swish-e from indexing numbers?\fR
.IX Subsection "How do I keep Swish-e from indexing numbers?"
.PP
Swish-e indexes words as defined by the \f(CW\*(C`WordCharacters\*(C'\fR setting, as
described above.  So to avoid indexing numbers you simply remove digits
from the \f(CW\*(C`WordCharacters\*(C'\fR setting.
.PP
There are also some settings in \fIsrc/config.h\fR that control what \*(L"words\*(R"
are indexed.  You can configure swish to never index words that are all
digits, vowels, or consonants, or that contain more than some consecutive
number of digits, vowels, or consonants.  In general, you won't need to
change these settings.
.PP
Also, there's an experimental feature called \f(CW\*(C`IgnoreNumberChars\*(C'\fR
which allows you to define a set of characters that describe a number.
If a word is made up of \fBonly\fR those characters it will not be indexed.
.PP
\fISwish-e crashes and burns on a certain file. What can I do?\fR
.IX Subsection "Swish-e crashes and burns on a certain file. What can I do?"
.PP
This shouldn't happen.  If it does please post to the Swish-e discussion
list the details so it can be reproduced by the developers.
.PP
In the mean time, you can use a \f(CW\*(C`FileRules\*(C'\fR directive to exclude the
particular file name, or pathname, or its title.  If there are serious
problems in indexing certain types of files, they may not have valid text
in them (they may be binary files, for instance). You can use NoContents
to exclude that type of file.
.PP
Swish-e will issue a warning if an embedded null character is found in a
document.  This warning will be an indication that you are trying to index
binary data.  If you need to index binary files try to find a program
that will extract out the text (e.g. \fIstrings\fR\|(1), \fIcatdoc\fR\|(1), \fIpdftotext\fR\|(1)).
.PP
\fIHow to I prevent indexing of some documents?\fR
.IX Subsection "How to I prevent indexing of some documents?"
.PP
When using the file system to index your files you can use the
\&\f(CW\*(C`FileRules\*(C'\fR directive.  Other than \f(CW\*(C`FileRules title\*(C'\fR, \f(CW\*(C`FileRules\*(C'\fR
only works with the file system (\f(CW\*(C`\-S fs\*(C'\fR) indexing method, not with
\&\f(CW\*(C`\-S prog\*(C'\fR or \f(CW\*(C`\-S http\*(C'\fR.
.PP
If you are spidering, use a \fIrobots.text\fR file in your document root.
This is a standard way to excluded files from search engines, and is
fully supported by Swish\-e.  See http://www.robotstxt.org/
.PP
You can also modify the \fIspider.pl\fR spider perl program to skip, index
content only, or spider only listed web pages.  Type \f(CW\*(C`perldoc spider.pl\*(C'\fR
in the \f(CW\*(C`prog\-bin\*(C'\fR directory for details.
.PP
If using the libxml2 library for parsing \s-1HTML\s0, you may also use the Meta
Robots Exclusion in your documents:
.PP
.Vb 1
\&    <meta name="robots" content="noindex">
.Ve
.PP
See the obeyRobotsNoIndex directive.
.PP
\fIHow do I prevent indexing parts of a document?\fR
.IX Subsection "How do I prevent indexing parts of a document?"
.PP
To prevent Swish-e from indexing a common header, footer, or navigation
bar, \s-1AND\s0 you are using libxml2 for parsing \s-1HTML\s0, then you may
use a fake \s-1HTML\s0 tag around the text you wish to ignore and use the
\&\f(CW\*(C`IgnoreMetaTags\*(C'\fR directive.  This will generate an error message if
the \f(CW\*(C`ParserWarningLevel\*(C'\fR is set as it's invalid \s-1HTML\s0.
.PP
\&\f(CW\*(C`IgnoreMetaTags\*(C'\fR works with \s-1XML\s0 documents (and \s-1HTML\s0 documents when
using libxml2 as the parser), but not with documents parsed by the text
(\s-1TXT\s0) parser.
.PP
If you are using the libxml2 parser (\s-1HTML2\s0 and \s-1XML2\s0) then you can use the the following
comments in your documents to prevent indexing:
.PP
.Vb 2
\&       <!-- SwishCommand noindex -->
\&       <!-- SwishCommand index -->
.Ve
.PP
and/or these may be used also:
.PP
.Vb 2
\&       <!-- noindex -->
\&       <!-- index -->
.Ve
.PP
\fIHow do I modify the path or \s-1URL\s0 of the indexed documents.\fR
.IX Subsection "How do I modify the path or URL of the indexed documents."
.PP
Use the \f(CW\*(C`ReplaceRules\*(C'\fR configuration directive to rewrite path names
and URLs.  If you are using \f(CW\*(C`\-S prog\*(C'\fR input method you may set the path
to any string.
.PP
\fIHow can I index data from a database?\fR
.IX Subsection "How can I index data from a database?"
.PP
Use the \*(L"prog\*(R" document source method of indexing.  Write a program to
extract out the data from your database, and format it as \s-1XML\s0, \s-1HTML\s0,
or text.  See the examples in the \f(CW\*(C`prog\-bin\*(C'\fR directory, and the next
question.
.PP
\fIHow do I index my \s-1PDF\s0, Word, and compressed documents?\fR
.IX Subsection "How do I index my PDF, Word, and compressed documents?"
.PP
Swish-e can internally only parse \s-1HTML\s0, \s-1XML\s0 and \s-1TXT\s0 (text) files by
default, but can make use of \fIfilters\fR that will convert other types
of files such as \s-1MS\s0 Word documents, \s-1PDF\s0, or gzipped files into one of
the file types that Swish-e understands.
.PP
Please see SWISH-CONFIG
and the examples in the \fIfilters\fR and \fIfilter-bin\fR directory for more information.
.PP
See the next question to learn about the filtering options with Swish\-e.
.PP
\fIHow do I filter documents?\fR
.IX Subsection "How do I filter documents?"
.PP
The term \*(L"filter\*(R" in Swish-e means the converstion of a document of one type (one that
swish-e cannot index directly) into a type that Swish-e can index, namely \s-1HTML\s0, plain text, or \s-1XML\s0.
To add to the confusion, there are a number of ways to accomplish this in Swish\-e.
So here's a bit of background.
.PP
The FileFilter directive was added to swish first.
This feature allows you to specify a program to run for documents that match a given file extension.
For example, to filter \s-1PDF\s0 files (files that end in .pdf) you can specify the configuation setting of:
.PP
.Vb 1
\&    FileFilter .pdf pdftotext   "'%p' -"
.Ve
.PP
which says to run the program \*(L"pdftotext\*(R" passing it the pathname of the file (%p)
and a dash (which tells pdftotext to output to stdout).   Then for each .pdf file Swish-e runs this
program and reads in the filtered document from the output from the filter program.
.PP
This has the advantage that it is easy to setup \*(-- a single line in the config file is all that is
needed to add the filter into Swish\-e.  But it also has a number of problems.  For example,
if you use a Perl script to do your filtering it can be very slow since the filter script must be
run (and thus compiled) for each processed document.
This is exacerbated when using the \-S http method since the \-S http method also uses a Perl script
that is run for every \s-1URL\s0 fetched.  Also, when using \-S prog method of input
(reading input from a program) using FileFilter means that Swish-e must first read the file
in from the external program and then write the file out to a temporary file before running the
filter.
.PP
With \-S prog it makes much more sense to filter the document in the program that is
fetching the documents than to have swish-e read the file into memory, write it to a temporary
file and then run an external program.
.PP
The Swish-e distribution contains a couple of example \-S prog programs.  \fIspider.pl\fR is a reasonably
full-featured web spider that offers many more options than the \-S http method.  And it is much faster
than running \-S http, too.
.PP
The spider has a perl configuration file, which means you can add programming logic right into the
configuration file without editing the spider program.  One bit of logic that is provided in the
spider's configuration file is a \*(L"call\-back\*(R" function that allows you to filter the content.
In other words, before the spider passes a fetched web document to swish for indexing the spider can call
a simple subroutine in the spider's configuration file passing the document and its content type.
The subroutine can then look at the content type and decide if the document needs to be filtered.
.PP
For example, when processing a document of type \*(L"application/msword\*(R" the call-back subroutine
might call the doc2txt.pm perl module, and a document of type
\&\*(L"appliation/pdf\*(R" could use the pdf2html.pm module.  The \fIprog\-bin/SwishSpiderConfig.pl\fR file
shows this usage.
.PP
This system works reasonably well, but also means that more work is required
to setup the filters.  First, you must explicitly check for specific content types and then call
the appropriate Perl module, and second, you have to know how each module must be called and how
each returns the possibly modified content.
.PP
In comes SWISH::Filter.
.PP
To make things easier the SWISH::Filter Perl module was created.  The idea of this module is that
there is one interface used to filter all types of documents.  So instead of checking for specific
types of content you just pass the content type and the document to the SWISH::Filter module and
it returns a new content type and document if it was filtered.  The filters that do the actual work
are designed with a standard interface and work like filter \*(L"plug\-ins\*(R". Adding new filters
means just downloading the filter to a directory and no changes are needed to the spider's configuation
file.  Download a filter for Postscript and next time you run indexing your Postscript files will be indexed.
.PP
Since the filters are standardized, hopefully when you have the need to filter documents of a specific
type there will already be a filter ready for your use.
.PP
Now, note that the perl modules may or may not do the actual conversion of a document.
For example, the \s-1PDF\s0 conversion
module calls the pdfinfo and pdftotext programs.  Those programs (part of the Xpfd package)
must be installed separately from the filters.
.PP
The SwishSpiderConfig.pl examle spider configuration file shows how to use the SWISH::Filter module for filtering.
This file is installed at \f(CW$prefix\fR/share/doc/swish\-e/examples/prog\-bin, where \f(CW$prefix\fR is normally /usr/local on
unix-type machines.
.PP
The SWISH::Filter method of filtering can also be used with the \-S http method of indexing.  By default
the \fIswishspider\fR program (the Perl helper script that fetches documents from the web) will attempt to
use the SWISH::Filter module if it can be found in Perls library path.  This path is set automatically for
spider.pl but not for swishspider (because it would slow down a method that's already slow and spider.pl is
recommended over the \-S http method).
.PP
Therefore, all that's required to use this system with \-S http is setting
the \f(CW@INC\fR array to point to the filter directory.
.PP
For example, if the swish-e distribution was unpacked into ~/swish\-e:
.PP
.Vb 1
\&   PERL5LIB=~/swish-e/filters swish-e -c conf -S http
.Ve
.PP
will allow the \-S http method to make use of the SWISH::Filter module.
.PP
Note that if you are not using the SWISH::Filter module you may wish to edit the \fIswishspider\fR program
and disable the use of the SWISH::Filter module using this setting:
.PP
.Vb 1
\&    use constant USE_FILTERS  => 0;  # disable SWISH::Filter
.Ve
.PP
This prevents the program from attempting to use the SWISH::Filter module for every non-text
\&\s-1URL\s0 that is fetched.  Of course, if you are concerned with indexing speed you should be using
the \-S prog method with spider.pl instead of \-S http.
.PP
If you are not spidering, but you still want to make use of the SWISH::Filter module for
filtering you can use the DirTree.pl program (in \f(CW$prefix\fR/lib/swish\-e).  This is a simple
program that traverses the file system and uses SWISH::Filter for filtering.
.PP
Here's two examples of how to run a filter program, one using Swish\-e's
\&\f(CW\*(C`FileFilter\*(C'\fR directive, another using a \f(CW\*(C`prog\*(C'\fR input method program.
See the \fISwishSpiderConfig.pl\fR file for an example of using the SWISH::Filter
module.
.PP
These filters simply use the program \f(CW\*(C`/bin/cat\*(C'\fR as a filter and only
indexes .html files.
.PP
First, using the \f(CW\*(C`FileFilter\*(C'\fR method, here's the entire configuration
file (swish.conf):
.PP
.Vb 3
\&    IndexDir .
\&    IndexOnly .html
\&    FileFilter .html "/bin/cat"   "'%p'"
.Ve
.PP
and index with the command
.PP
.Vb 1
\&    swish-e -c swish.conf -v 1
.Ve
.PP
Now, the same thing with using the \f(CW\*(C`\-S prog\*(C'\fR document source input method
and a Perl program called catfilter.pl.  You can see that's it's much
more work than using the \f(CW\*(C`FileFilter\*(C'\fR method above, but provides a
place to do additional processing.  In this example, the \f(CW\*(C`prog\*(C'\fR method
is only slightly faster.  But if you needed a perl script to run as a
FileFilter then \f(CW\*(C`prog\*(C'\fR will be significantly faster.
.PP
.Vb 3
\&    #!/usr/local/bin/perl -w
\&    use strict;
\&    use File::Find;  # for recursing a directory tree
.Ve
.PP
.Vb 5
\&    $/ = undef;
\&    find(
\&        { wanted => \e&wanted, no_chdir => 1, },
\&        '.',
\&    );
.Ve
.PP
.Vb 3
\&    sub wanted {
\&        return if -d;
\&        return unless /\e.html$/;
.Ve
.PP
.Vb 1
\&        my $mtime  = (stat)[9];
.Ve
.PP
.Vb 3
\&        my $child = open( FH, '-|' );
\&        die "Failed to fork $!" unless defined $child;
\&        exec '/bin/cat', $_ unless $child;
.Ve
.PP
.Vb 2
\&        my $content = <FH>;
\&        my $size = length $content;
.Ve
.PP
.Vb 4
\&        print <<EOF;
\&    Content-Length: $size
\&    Last-Mtime: $mtime
\&    Path-Name: $_
.Ve
.PP
.Vb 1
\&    EOF
.Ve
.PP
.Vb 2
\&        print <FH>;
\&    }
.Ve
.PP
And index with the command:
.PP
.Vb 1
\&    swish-e -S prog -i ./catfilter.pl -v 1
.Ve
.PP
This example will probably not work under Windows due to the '\-|' open.
A simple piped open may work just as well:
.PP
That is, replace:
.PP
.Vb 3
\&    my $child = open( FH, '-|' );
\&    die "Failed to fork $!" unless defined $child;
\&    exec '/bin/cat', $_ unless $child;
.Ve
.PP
with this:
.PP
.Vb 1
\&    open( FH, "/bin/cat $_ |" ) or die $!;
.Ve
.PP
Perl will try to avoid running the command through the shell if meta
characters are not passed to the open.  See \f(CW\*(C`perldoc \-f open\*(C'\fR for
more information.
.PP
\fIEh, but I just want to know how to index \s-1PDF\s0 documents!\fR
.IX Subsection "Eh, but I just want to know how to index PDF documents!"
.PP
See the examples in the \fIconf\fR directory and the comments in
the \fISwishSpiderConfig.pl\fR file.
.PP
See the previous question for the details on filtering.  The method you decide to use
will depend on how fast you want to index, and your comfort level with using Perl modules.
.PP
Regardless of the filtering method you use you will need to install the Xpdf packages
available from http://www.foolabs.com/xpdf/.
.PP
\fII'm using Windows and can't get Filters or the prog input method to work!\fR
.IX Subsection "I'm using Windows and can't get Filters or the prog input method to work!"
.PP
Both the \f(CW\*(C`\-S prog\*(C'\fR input method and filters use the \f(CW\*(C`popen()\*(C'\fR system
call to run the external program.  If your external program is, for
example, a perl script, you have to tell Swish-e to run perl, instead of
the script.  Swish-e will convert forward slashes to backslashes
when running under Windows.
.PP
For example, you would need to specify the path to perl as (assuming
this is where perl is on your system):
.PP
.Vb 1
\&    IndexDir e:/perl/bin/perl.exe
.Ve
.PP
Or run a filter like:
.PP
.Vb 1
\&    FileFilter .foo e:/perl/bin/perl.exe 'myscript.pl "%p"'
.Ve
.PP
It's often easier to just install Linux.
.PP
\fIHow do I index non-English words?\fR
.IX Subsection "How do I index non-English words?"
.PP
Swish-e indexes 8\-bit characters only.  This is the \s-1ISO\s0 8859\-1 Latin\-1
character set, and includes many non-English letters (and symbols).
As long as they are listed in \f(CW\*(C`WordCharacters\*(C'\fR they will be indexed.
.PP
Actually, you probably can index any 8\-bit character set, as long as
you don't mix character sets in the same index and don't use libxml2 for
parsing (see below).
.PP
The \f(CW\*(C`TranslateCharacters\*(C'\fR directive (SWISH-CONFIG)
can translate characters while indexing and searching.  You may
specify the mapping of one character to another character with the
\&\f(CW\*(C`TranslateCharacters\*(C'\fR directive.
.PP
\&\f(CW\*(C`TranslateCharacters :ascii7:\*(C'\fR is a predefined set of characters that
will translate eight-bit characters to ascii7 characters.  Using the
\&\f(CW\*(C`:ascii7:\*(C'\fR rule will, for example, translate \*(L"\*(R" to \*(L"aac\*(R".  This means:
searching \*(L"elik\*(R", \*(L"elik\*(R" or \*(L"celik\*(R" will all match the same word.
.PP
Note: When using libxml2 for parsing, parsed documents are converted
internally (within libxml2) to \s-1UTF\-8\s0.  This is converted to \s-1ISO\s0 8859\-1
Latin\-1 when indexing.  In cases where a string can not be converted
from \s-1UTF\-8\s0 to \s-1ISO\s0 8859\-1 (because it contains non 8859\-1 characters),
the string will be sent to Swish-e in \s-1UTF\-8\s0 encoding.  This will results
in some words indexed incorrectly.  Setting \f(CW\*(C`ParserWarningLevel\*(C'\fR to 1
or more will display warnings when \s-1UTF\-8\s0 to 8859\-1 conversion fails.
.PP
\fICan I add/remove files from an index?\fR
.IX Subsection "Can I add/remove files from an index?"
.PP
Try building swish-e with the \f(CW\*(C`\-\-enable\-incremental\*(C'\fR option.
.PP
The rest of this \s-1FAQ\s0 applies to the default swish-e format.
.PP
Swish-e currently has no way to add or remove items from
its index.  But, Swish-e indexes so quickly that it's often possible to
reindex the entire document set when a file needs to be added, modified or removed.
If you are spidering a remote site then consider caching documents locally compressed.
.PP
Incremental additions can be handled in a couple of ways, depending on
your situation.  It's probably easiest to create one main index every
night (or every week), and then create an index of just the new files
between main indexing jobs and use the \f(CW\*(C`\-f\*(C'\fR option to pass both indexes
to Swish-e while searching.
.PP
You can merge the indexes into one index (instead of using \-f), but it's
not clear that this has any advantage over searching multiple indexes.
.PP
How does one create the incremental index?
.PP
One method is by using the \f(CW\*(C`\-N\*(C'\fR switch to pass a file path to
Swish-e when indexing.  It will only index files that have a last
modification date \f(CW\*(C`newer\*(C'\fR than the file supplied with the \f(CW\*(C`\-N\*(C'\fR switch.
.PP
This option has the disadvantage that Swish-e must process every file in
every directory as if they were going to be indexed (the test for \f(CW\*(C`\-N\*(C'\fR
is done last right before indexing of the file contents begin and after
all other tests on the file have been completed) \*(-- all that just to
find a few new files.
.PP
Also, if you use the Swish-e index file as the file passed to \f(CW\*(C`\-N\*(C'\fR there
may be files that were added after indexing was started, but before the
index file was written.  This could result in a file not being added to
the index.
.PP
Another option is to maintain a parallel directory tree that contains
symlinks pointing to the main files.  When a new file is added (or
changed) to the main directory tree you create a symlink to the real file
in the parallel directory tree.  Then just index the symlink directory
to generate the incremental index.
.PP
This option has the disadvantage that you need to have a central
program that creates the new files that can also create the symlinks.
But, indexing is quite fast since Swish-e only has to look at the files
that need to be indexed.  When you run full indexing you simply unlink
(delete) all the symlinks.
.PP
Both of these methods have issues where files could end up in both
indexes, or files being left out of an index.  Use of file locks while
indexing, and hash lookups during searches can help prevent these
problems.
.PP
\fII run out of memory trying to index my files.\fR
.IX Subsection "I run out of memory trying to index my files."
.PP
It's true that indexing can take up a lot of memory!  Swish-e is extremely
fast at indexing, but that comes at the cost of memory.
.PP
The best answer is install more memory.
.PP
Another option is use the \f(CW\*(C`\-e\*(C'\fR switch.  This will require less memory,
but indexing will take longer as not all data will be stored in memory
while indexing.  How much less memory and how much more time depends on
the documents you are indexing, and the hardware that you are using.
.PP
Here's an example of indexing all .html files in /usr/doc on Linux.
This first example is \fIwithout\fR \f(CW\*(C`\-e\*(C'\fR and used about 84M of memory:
.PP
.Vb 3
\&    270279 unique words indexed.
\&    23841 files indexed.  177640166 total bytes.
\&    Elapsed time: 00:04:45 CPU time: 00:03:19
.Ve
.PP
This is \fIwith\fR \f(CW\*(C`\-e\*(C'\fR, and used about 26M or memory:    
.PP
.Vb 3
\&    270279 unique words indexed.
\&    23841 files indexed.  177640166 total bytes.
\&    Elapsed time: 00:06:43 CPU time: 00:04:12
.Ve
.PP
You can also build a number of smaller indexes and then merge together
with \f(CW\*(C`\-M\*(C'\fR.  Using \f(CW\*(C`\-e\*(C'\fR while merging will save memory.
.PP
Finally, if you do build a number of smaller indexes, you can specify more
than one index when searching by using the \f(CW\*(C`\-f\*(C'\fR switch.  Sorting large
results sets by a property will be slower when specifying multiple index
files while searching.
.PP
\fI\*(L"too many open files\*(R" when indexing with \-e option\fR
.IX Subsection "too many open files when indexing with -e option"
.PP
Some platforms report \*(L"too many open files\*(R" when using the \-e economy option.
The \-e feature uses many temporary files (something like 377) plus 
the index files
and this may exceed your system's limits.
.PP
Depending on your platform you may need to set \*(L"ulimit\*(R" or \*(L"unlimit\*(R".
.PP
For example, under Linux bash shell:
.PP
.Vb 1
\&  $ ulimit -n 1024
.Ve
.PP
Or under an old Sparc
.PP
.Vb 1
\&  % unlimit openfiles
.Ve
.PP
\fIMy system admin says Swish-e uses too much of the \s-1CPU\s0!\fR
.IX Subsection "My system admin says Swish-e uses too much of the CPU!"
.PP
That's a good thing!  That expensive \s-1CPU\s0 is supposed to be busy.
.PP
Indexing takes a lot of work \*(-- to make indexing fast much of the work is
done in memory which reduces the amount of time Swish-e is waiting on I/O.
But, there's two things you can try:
.PP
The \f(CW\*(C`\-e\*(C'\fR option will run Swish-e in economy mode, which uses the disk
to store data while indexing.  This makes Swish-e run somewhat slower,
but also uses less memory.  Since it is writing to disk more often it
will be spending more time waiting on I/O and less time in \s-1CPU\s0.  Maybe.
.PP
The other thing is to simply lower the priority of the job using the
\&\fInice\fR\|(1) command:
.PP
.Vb 1
\&    /bin/nice -15 swish-e -c search.conf
.Ve
.PP
If concerned about searching time, make sure you are using the \-b and \-m
switches to only return a page at a time.  If you know that your result
sets will be large, and that you wish to return results one page at a
time, and that often times many pages of the same query will be requested,
you may be smart to request all the documents on the first request, and
then cache the results to a temporary file.  The perl module File::Cache
makes this very simple to accomplish.
.Sh "Spidering"
.IX Subsection "Spidering"
\fIHow can I index documents on a web server?\fR
.IX Subsection "How can I index documents on a web server?"
.PP
If possible, use the file system method \f(CW\*(C`\-S fs\*(C'\fR of indexing to index
documents in you web area of the file system.  This avoids the overhead
of spidering a web server and is much faster.  (\f(CW\*(C`\-S fs\*(C'\fR is the default
method if \f(CW\*(C`\-S\*(C'\fR is not specified).
.PP
If this is impossible (the web server is not local, or documents are dynamically
generated), Swish-e provides two methods of spidering. First, it includes the http method
of indexing \f(CW\*(C`\-S http\*(C'\fR. A number of special configuration directives are available that
control spidering (see \*(L"Directives for the \s-1HTTP\s0 Access Method Only\*(R" in SWISH-CONFIG).  A perl helper
script (swishspider) is included in the \fIsrc\fR directory to assist with spidering web
servers. There are example configurations for spidering in the \fIconf\fR directory.
.PP
As of Swish-e 2.2, there's a general purpose \*(L"prog\*(R" document source where
a program can feed documents to it for indexing.  A number of example
programs can be found in the \f(CW\*(C`prog\-bin\*(C'\fR directory, including a program
to spider web servers.  The provided spider.pl program is full-featured
and is easily customized.
.PP
The advantage of the \*(L"prog\*(R" document source feature over the \*(L"http\*(R" method
is that the program is only executed one time, where the swishspider.pl
program used in the \*(L"http\*(R" method is executed once for every document
read from the web server.  The forking of Swish-e and compiling of the
perl script can be quite expensive, time\-wise.
.PP
The other advantage of the \f(CW\*(C`spider.pl\*(C'\fR program is that it's simple and
efficient to add filtering (such as for \s-1PDF\s0 or \s-1MS\s0 Word docs) right into
the spider.pl's configuration, and it includes features such as \s-1MD5\s0 checks
to prevent duplicate indexing, options to avoid spidering some files,
or index but avoid spidering.  And since it's a perl program there's no
limit on the features you can add.
.PP
\fIWhy does swish report \*(L"./swishspider: not found\*(R"?\fR
.IX Subsection "Why does swish report ./swishspider: not found?"
.PP
Does the file \fIswishspider\fR exist where the error message displays?  If not, either
set the configuration option SpiderDirectory
to point to the directory where the \fIswishspider\fR program is found, or place the
\&\fIswishspider\fR program in the current directory when running swish\-e.
.PP
If you are running Windows, make sure \*(L"perl\*(R" is in your path.  Try typing \fIperl\fR from
a command prompt.
.PP
If you not running windows, make sure that the shebang line (the first line of the
swishspider program that starts with #!) points to the correct location of perl.
Typically this will be \fI/usr/bin/perl\fR or \fI/usr/local/bin/perl\fR.  Also, make sure that
you have execute and read permissions on \fIswishspider\fR.
.PP
The \fIswishspider\fR perl script is only used with the \-S http method of indexing.
.PP
\fII'm using the spider.pl program to spider my web site, but some large files are not indexed.\fR
.IX Subsection "I'm using the spider.pl program to spider my web site, but some large files are not indexed."
.PP
The \f(CW\*(C`spider.pl\*(C'\fR program has a default limit of 5MB file size.  This can
be changed with the \f(CW\*(C`max_size\*(C'\fR parameter setting.  See \f(CW\*(C`perldoc
spider.pl\*(C'\fR for more information.
.PP
\fII still don't think all my web pages are being indexed.\fR
.IX Subsection "I still don't think all my web pages are being indexed."
.PP
The \fIspider.pl\fR program has a number of debugging switches and can be
quite verbose in telling you what's happening, and why.  See \f(CW\*(C`perldoc
spider.pl\*(C'\fR for instructions.
.PP
\fISwish is not spidering Javascript links!\fR
.IX Subsection "Swish is not spidering Javascript links!"
.PP
Swish cannot follow links generated by Javascript, as they are generated
by the browser and are not part of the document.
.PP
\fIHow do I spider other websites and combine it with my own (filesystem) index?\fR
.IX Subsection "How do I spider other websites and combine it with my own (filesystem) index?"
.PP
You can either merge \f(CW\*(C`\-M\*(C'\fR two indexes into a single index, or use \f(CW\*(C`\-f\*(C'\fR
to specify more than one index while searching.
.PP
You will have better results with the \f(CW\*(C`\-f\*(C'\fR method.
.Sh "Searching"
.IX Subsection "Searching"
\fIHow do I limit searches to just parts of the index?\fR
.IX Subsection "How do I limit searches to just parts of the index?"
.PP
If you can identify \*(L"parts\*(R" of your index by the path name you have
two options.
.PP
The first options is by indexing the document path.  Add this to your
configuration:
.PP
.Vb 1
\&    MetaNames swishdocpath
.Ve
.PP
Now you can search for words or phrases in the path name:
.PP
.Vb 1
\&    swish-e -w 'foo AND swishdocpath=(sales)'
.Ve
.PP
So that will only find documents with the word \*(L"foo\*(R" and where the file's
path contains \*(L"sales\*(R".  That might not works as well as you like, though,
as both of these paths will match:
.PP
.Vb 2
\&    /web/sales/products/index.html
\&    /web/accounting/private/sales_we_messed_up.html
.Ve
.PP
This can be solved by searching with a phrase (assuming \*(L"/\*(R" is not
a WordCharacter):
.PP
.Vb 2
\&    swish-e -w 'foo AND swishdocpath=("/web/sales/")'
\&    swish-e -w 'foo AND swishdocpath=("web sales")'  (same thing)
.Ve
.PP
The second option is a bit more powerful.  With the \f(CW\*(C`ExtractPath\*(C'\fR
directive you can use a regular expression to extract out a sub-set of
the path and save it as a separate meta name:
.PP
.Vb 2
\&    MetaNames department
\&    ExtractPath department regex !^/web/([^/]+).+$!$1/
.Ve
.PP
Which says match a path that starts with \*(L"/web/\*(R" and extract out
everything after that up to, but not including the next \*(L"/\*(R" and save it in
variable \f(CW$1\fR, and then match everything from the \*(L"/\*(R" onward.  Then replace
the entire matches string with \f(CW$1\fR.  And that gets indexed as meta name
\&\*(L"department\*(R".
.PP
Now you can search like:
.PP
.Vb 1
\&    swish-e -w 'foo AND department=sales'
.Ve
.PP
and be sure that you will only match the documents in the /www/sales/*
path.  Note that you can map completely different areas of your file
system to the same metaname:
.PP
.Vb 3
\&    # flag the marketing specific pages
\&    ExtractPath department regex !^/web/(marketing|sales)/.+$!marketing/
\&    ExtractPath department regex !^/internal/marketing/.+$!marketing/
.Ve
.PP
.Vb 2
\&    # flag the technical departments pages
\&    ExtractPath department regex !^/web/(tech|bugs)/.+$!tech/
.Ve
.PP
Finally, if you have something more complicated, use \f(CW\*(C`\-S prog\*(C'\fR and
write a perl program or use a filter to set a meta tag when processing
each file.
.PP
\fIHow is ranking calculated?\fR
.IX Subsection "How is ranking calculated?"
.PP
The \f(CW\*(C`swishrank\*(C'\fR property value is calculated based on which Ranking Scheme (or algorithm)
you have selected. In this discussion, any time the word \fBfancy\fR is used, you should
consult the actual code for more details. It is open source, after all.
.PP
Things you can do to affect ranking:
.IP "MetaRankBias" 4
.IX Item "MetaRankBias"
You may configure your index to bias certain metaname values more or less than others.
See the \f(CW\*(C`MetaRankBias\*(C'\fR configuration option in SWISH-CONFIG.
.IP "IgnoreTotalWordCountWhenRanking" 4
.IX Item "IgnoreTotalWordCountWhenRanking"
Set to 1 (default) or 0 in your config file. See SWISH-CONFIG.
\&\fB\s-1NOTE:\s0\fR You must set this to 0 to use the \s-1IDF\s0 Ranking Scheme.
.IP "structure" 4
.IX Item "structure"
Each term's position in each \s-1HTML\s0 document is given a structure value based on the context
in which the word appears. The structure value is used to artificially inflate
the frequency of each term in that particular document.
These structural values are defined in \fIconfig.h\fR:
.Sp
.Vb 5
\& #define RANK_TITLE             7
\& #define RANK_HEADER            5
\& #define RANK_META              3
\& #define RANK_COMMENTS          1
\& #define RANK_EMPHASIZED        0
.Ve
.Sp
For example, if the word \f(CW\*(C`foo\*(C'\fR appears in the title of a document, the Scheme
will treat that document as if \f(CW\*(C`foo\*(C'\fR appeared 7 additional times.
.PP
All Schemes share the following characteristics:
.IP "\s-1AND\s0 searches" 4
.IX Item "AND searches"
The rank value is averaged for all \s-1AND\s0'd terms. Terms within a set of parentheses () are
averaged as a single term (this is an acknowledged weakness and is on the \s-1TODO\s0 list).
.IP "\s-1OR\s0 searches" 4
.IX Item "OR searches"
The rank value is summed and then doubled for each pair of \s-1OR\s0'd terms. This results
in higher ranks for documents that have multiple \s-1OR\s0'd terms.
.IP "scaled rank" 4
.IX Item "scaled rank"
After a document's raw rank score is calculated, a final rank score is calculated using a
fancy \f(CW\*(C`log()\*(C'\fR function. All the documents are then scaled against a base score of 1000.
The top-ranked document will therefore always have a \f(CW\*(C`swishrank\*(C'\fR value of 1000.
.PP
Here is a brief overview of how the different Schemes work. The number in parentheses after
the name is the value to invoke that scheme with \f(CW\*(C`swish\-e \-R\*(C'\fR or \f(CW\*(C`RankScheme()\*(C'\fR.
.IP "Default (0)" 4
.IX Item "Default (0)"
The default ranking scheme considers the number of times a term appears in a 
document (frequency), the MetaRankBias and the structure value. The rank might be summarized
as:
.Sp
.Vb 1
\& DocRank = Sum of ( structure + metabias )
.Ve
.Sp
Consider this output with the \s-1DEBUG_RANK\s0 variable set at compile time: 
.Sp
.Vb 12
\& Ranking Scheme: 0 
\& Word entry 0 at position 6 has struct 7
\& Word entry 1 at position 64 has struct 41
\& Word entry 2 at position 71 has struct 9
\& Word entry 3 at position 132 has struct 9
\& Word entry 4 at position 154 has struct 9
\& Word entry 5 at position 423 has struct 73
\& Word entry 6 at position 541 has struct 73
\& Word entry 7 at position 662 has struct 73
\& File num: 1104.  Raw Rank: 21.  Frequency: 8 scaled rank: 30445
\&  Structure tally:
\&  struct 0x7 = count of 1 ( HEAD TITLE FILE ) x rank map of 8 = 8
.Ve
.Sp
.Vb 1
\&  struct 0x9 = count of 3 ( BODY FILE ) x rank map of 1 = 3
.Ve
.Sp
.Vb 1
\&  struct 0x29 = count of 1 ( HEADING BODY FILE ) x rank map of 6 = 6
.Ve
.Sp
.Vb 1
\&  struct 0x49 = count of 3 ( EM BODY FILE ) x rank map of 1 = 3
.Ve
.Sp
Every word instance starts with a base score of 1.
Then for each instance of your word, a running
sum is taken of the structural value of that word position plus any bias you've configured.
In the example above, the raw rank is \f(CW\*(C`1 + 8 + 3 + 6 + 3 = 21\*(C'\fR.
.Sp
Consider this line:
.Sp
.Vb 1
\&  struct 0x7 = count of 1 ( HEAD TITLE FILE ) x rank map of 8 = 8
.Ve
.Sp
That means there was one instance of our word in the title of the file.
It's context was in the <head> tagset, inside the <title>. 
The <title> is the most specific structure, so it gets the
\&\s-1RANK_TITLE\s0 score: 7. The base rank of 1 plus the structure score of 7 equals 8. If there
had been two instances of this word in the title, then the score would have been \f(CW\*(C`8 + 8 = 16\*(C'\fR.
.IP "\s-1IDF\s0 (1)" 4
.IX Item "IDF (1)"
\&\s-1IDF\s0 is short for Inverse Document Frequency. That's fancy ranking lingo for taking into
account the total frequency of a term across the entire index, in addition to the term's
frequency in a single document. \s-1IDF\s0 ranking also uses the relative density of a word in a
document to judge its relevancy. Words that appear more often in a doc make that doc's rank
higher, and longer docs are not weighted higher than shorter docs.
.Sp
The \s-1IDF\s0 Scheme might be summarized as:
.Sp
.Vb 1
\&  DocRank = Sum of ( density * idf * ( structure + metabias ) )
.Ve
.Sp
Consider this output from \s-1DEBUG_RANK:\s0
.Sp
.Vb 17
\& Ranking Scheme: 1 
\& File num: 1104  Word Score: 1  Frequency: 8  Total files: 1451   
\& Total word freq: 108   IDF: 2564  
\& Total words: 1145877   Indexed words in this doc: 562   
\& Average words: 789   Density: 1120    Word Weight: 28716   
\& Word entry 0 at position 6 has struct 7
\& Word entry 1 at position 64 has struct 41
\& Word entry 2 at position 71 has struct 9
\& Word entry 3 at position 132 has struct 9
\& Word entry 4 at position 154 has struct 9
\& Word entry 5 at position 423 has struct 73
\& Word entry 6 at position 541 has struct 73
\& Word entry 7 at position 662 has struct 73
\& Rank after IDF weighting: 574321  
\& scaled rank: 132609
\&  Structure tally:
\&  struct 0x7 = count of  1 ( HEAD TITLE FILE ) x rank map of 8 = 8
.Ve
.Sp
.Vb 1
\&  struct 0x9 = count of  3 ( BODY FILE ) x rank map of 1 = 3
.Ve
.Sp
.Vb 1
\&  struct 0x29 = count of  1 ( HEADING BODY FILE ) x rank map of 6 = 6
.Ve
.Sp
.Vb 1
\&  struct 0x49 = count of  3 ( EM BODY FILE ) x rank map of 1 = 3
.Ve
.Sp
It is similar to the default Scheme, but notice how the total number of files in the index
and the total word frequency (as opposed to the document frequency) are both part of the
equation.
.PP
Ranking is a complicated subject. SWISH-E allows for more Ranking Schemes to be developed
and experimented with, using the \-R option (from the swish-e command) and the RankScheme
(see the \s-1API\s0 documentation). Experiment and share your findings via the discussion list.
.PP
\fIHow can I limit searches to the title, body, or comment?\fR
.IX Subsection "How can I limit searches to the title, body, or comment?"
.PP
Use the \f(CW\*(C`\-t\*(C'\fR switch.
.PP
\fII can't limit searches to title/body/comment.\fR
.IX Subsection "I can't limit searches to title/body/comment."
.PP
Or, \fII can't search with meta names, all the names are indexed as
\&\*(L"plain\*(R".\fR
.PP
Check in the config.h file if #define \s-1INDEXTAGS\s0 is set to 1. If it is,
change it to 0, recompile, and index again.  When \s-1INDEXTAGS\s0 is 1, \s-1ALL\s0
the tags are indexed as plain text, that is you index \*(L"title\*(R", \*(L"h1\*(R", and
so on, \s-1AND\s0 they loose their indexing meaning.  If \s-1INDEXTAGS\s0 is set to 0,
you will still index meta tags and comments, unless you have indicated
otherwise in the user config file with the IndexComments directive.
.PP
Also, check for the \f(CW\*(C`UndefinedMetaTags\*(C'\fR setting in your configuration
file.
.PP
\fII've tried running the included \s-1CGI\s0 script and I get a \*(L"Internal Server Error\*(R"\fR
.IX Subsection "I've tried running the included CGI script and I get a Internal Server Error"
.PP
Debugging \s-1CGI\s0 scripts are beyond the scope of this document.
Internal Server Error basically means \*(L"check the web server's log for
an error message\*(R", as it can mean a bad shebang (#!) line, a missing
perl module, \s-1FTP\s0 transfer error, or simply an error in the program.
The \s-1CGI\s0 script \fIswish.cgi\fR in the \fIexample\fR directory contains some
debugging suggestions.  Type \f(CW\*(C`perldoc swish.cgi\*(C'\fR for information.
.PP
There are also many, many \s-1CGI\s0 FAQs available on the Internet.  A quick web
search should offer help.  As a last resort you might ask your webadmin
for help...
.PP
\fIWhen I try to view the swish.cgi page I see the contents of the Perl program.\fR
.IX Subsection "When I try to view the swish.cgi page I see the contents of the Perl program."
.PP
Your web server is not configured to run the program as a \s-1CGI\s0 script.
This problem is described in \f(CW\*(C`perldoc swish.cgi\*(C'\fR.
.PP
\fIHow do I make Swish-e highlight words in search results?\fR
.IX Subsection "How do I make Swish-e highlight words in search results?"
.PP
Short answer:
.PP
Use the supplied swish.cgi or search.cgi scripts located in the \fIexample\fR directory.
.PP
Long answer:
.PP
Swish-e can't because it doesn't have access to the source documents when
returning results, of course.  But a front-end program of your creation
can highlight terms.  Your program can open up the source documents and
then use regular expressions to replace search terms with highlighted
or bolded words.
.PP
But, that will fail with all but the most simple source documents.
For \s-1HTML\s0 documents, for example, you must parse the document into words
and tags (and comments).  A word you wish to highlight may span multiple
\&\s-1HTML\s0 tags, or be a word in a \s-1URL\s0 and you wish to highlight the entire
link text.
.PP
Perl modules such as HTML::Parser and XML::Parser make word extraction
possible.  Next, you need to consider that Swish-e uses settings such
as WordCharacters, BeginCharacters, EndCharacters, IgnoreFirstChar,
and IgnoreLast, char to define a \*(L"word\*(R".  That is, you can't consider
that a string of characters with white space on each side is a word.
.PP
Then things like TranslateCharacters, and \s-1HTML\s0 Entities may transform a
source word into something else, as far as Swish-e is concerned.  Finally,
searches can be limited by metanames, so you may need to limit your
highlighting to only parts of the source document.  Throw phrase searches
and stopwords into the equation and you can see that it's not a trivial
problem to solve.
.PP
All hope is not lost, thought, as Swish-e does provide some help.
Using the \f(CW\*(C`\-H\*(C'\fR option it will return in the headers the current index
(or indexes) settings for WordCharacters (and others) required to parse
your source documents as it parses them during indexing, and will return a
\&\*(L"Parsed Words:\*(R" header that will show how it parsed the query internally.
If you use fuzzy indexing (word stemming, soundex, or metaphone)
then you will also need to stem each word in your
document before comparing with the \*(L"Parsed Words:\*(R" returned by Swish\-e.
.PP
The Swish-e stemming code is available either by using the Swish-e
Perl module (\s-1SWISH::API\s0) or the C library (included with the swish-e distribution),
or by using the SWISH::Stemmer module available on \s-1CPAN\s0.  Also on \s-1CPAN\s0 is
the module Text::DoubleMetaphone.  Using \s-1SWISH::API\s0 probably provides the best
stemming support.
.PP
\fIDo filters effect the performance during search?\fR
.IX Subsection "Do filters effect the performance during search?"
.PP
No.  Filters (FileFilter or via \*(L"prog\*(R" method) are only used for building
the search index database.  During search requests there will be no
filter calls.
.Sh "I have read the \s-1FAQ\s0 but I still have questions about using Swish\-e."
.IX Subsection "I have read the FAQ but I still have questions about using Swish-e."
The Swish-e discussion list is the place to go.  http://swish\-e.org/.
Please do not email developers directly.  The list is the best place to
ask questions.
.PP
Before you post please read \fI\s-1QUESTIONS\s0 \s-1AND\s0 \s-1TROUBLESHOOTING\s0\fR located
in the \s-1INSTALL\s0 page.  You should also search the Swish-e
discussion list archive which can be found on the swish-e web site.
.PP
In short, be sure to include in the following when asking for help.
.IP "* The swish-e version (./swish\-e \-V)" 4
.IX Item "The swish-e version (./swish-e -V)"
.PD 0
.IP "* What you are indexing (and perhaps a sample), and the number of files" 4
.IX Item "What you are indexing (and perhaps a sample), and the number of files"
.IP "* Your Swish-e configuration file" 4
.IX Item "Your Swish-e configuration file"
.IP "* Any error messages that Swish-e is reporting" 4
.IX Item "Any error messages that Swish-e is reporting"
.PD
.SH "Document Info"
.IX Header "Document Info"
$Id: \s-1SWISH\-FAQ\s0.pod,v 1.36 2004/10/04 22:49:35 whmoseley Exp $
.PP
\&.