1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<TITLE>SWISH Bug Fixes and Enhancements - Digital Library SunSITE</TITLE>
</HEAD>
<BODY VLINK ="#FF0000" LINK="#FF0000" ALINK="#FF0000" BGCOLOR="#FFFFFF">
<P ALIGN="CENTER"><A HREF="/cgi-bin/imagemap/newhead">
<IMG BORDER="0" ALT="Berkeley Digital Library SunSITE"
SRC="/Images/newhead.gif" WIDTH="510" HEIGHT="50" ISMAP></A></P>
<P ALIGN="CENTER">
<A HREF="/SWISH-E/"><IMG ALT="SWISH-E" WIDTH="112"
HEIGHT="49" BORDER="0" SRC="/Images/swish-e.gif"></A><BR><IMG
SRC="/Images/swishbanner2.gif"></P>
<P ALIGN="CENTER"><IMG ALT="" SRC="/Images/dotrule1.gif"></P>
<H1 ALIGN="CENTER">SWISH Bug Fixes and
Enhancements</H1>
<H3>Bug Fixes</H3>
<P>The following bugs have been fixed in SWISH-E:</P>
<DL>
<DT>Wild card *
<DD> problem before fix: in a multiple words search, the results varied
with the
position of the term containing the asterisk in the query.
<DT>Merge option -M
<DD>problem before fix: the created merged file was not in the right
format,
consequently any search on that index would cause swish to hang.
<DT>Unary operator "not"
<DD>problem before fix: unreliable results
<DT>Explicit nested boolean
<DD> problem before fix: urnreliable results
</DL>
<H3>New Features</H3>
<PRE>
- Ignore specified char's when in final position.
--------------------------------------------------
It is sometimes convenient that certain char's are treated as normal
char when in the middle of a word while they are disregarded when in final
position. To exercise this option there should be in the config.h file
the following lines:
#define IGNORELAST 1
#define IGNORELASTCHAR "<list of char>"
For example if "." is listed in the IGNORELASTCHAR variable, words
will be indexed as follows:
Word Indexed as
z39.50 z39.50
z39.50. z39.50
There is to note that the char's that are listed in the IGNORELASTCHAR
variable need also to be listed in the ENDCHARS variable, otherwise
the word is discarded as invalid. The char's in the list are written
in sequence within the quotes with no separation between them.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- Common removed words printing
-------------------------------
This new swish version automatically prints out all the words that
are not indexed as too common according to the limits set in the PLIMIT
and FLIMIT variables in the config.h file.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- META data tag support
_______________________
It is now possible to search in META tags for names associated to a
particular metaName.
There are two ways to associate a word to a metaName:
1) <META NAME="metaName" CONTENT="words"> the usual HTML tag used
within <HEAD></HEAD>
2) <!--META START NAME="metaName" -->
some text of any length
<!--META END -->
In this way it is possible to mark pretty much any part of the text; please
note, however, that the words associated to metaNames are not searchable
in a plain search.
NOTE: Nested or overlapping META tags are not allowed and will lead to
unpredictable search results.
Step by Step indexing and search:
---------------------------------
Indexing:
In the user configuration file a new variable containing the metaNames
that will be used in the files (see user config file example at the
end of this doc); after adding the list of metaNames values to the
file, indexing proceeds as usual:
%swish-e -c <config.file>
If during indexing a metaName specified in a file is not listed in the
config.file, the user has the choice of having SWISH-E either aborting the
indexing with an error, or issuing a warning stating the metaName not in
the config.file and the file that contains it and continuing the index
construction, in which case the words are not associated to any metaName.
To exercise this choice, set the variable OKNOMETA in the conifig.h file
(see config.h file example at the end).
Meta names are case insensitive, so they can be written with any
combination of upper and lower cases.
Search:
The search query has a slightly different syntax and is of the kind:
%swish-e -w "metaName = word" -f <index.file>
The equal sign indicates the presence of a metaName and the search
results are all the file where the META tag with NAME="metaName" has
CONTENT="word" (or where "word" is contained in the area marked by the
<!--META START...> and <!--META END..> tags).
It is not necessary to have spaces at either side of the '=',
consequently the following are equivalent:
%swish-e -w "metaName = word" -f <index.file>
%swish-e -w "metaName=word" -f <index.file>
%swish-e -w "metaName= word" -f <index.file>
To search on a word that contain a '=', have a '/' precede the '=':
%swish-e -w "test/=3 = x/=4 or y/=5" -f <index.file>
this query returns the files where the word "x=4" is associated with
the metaName "test=3" or that contain the word "y=5" not associated
with any metaName.
Queries can be also constructed using any of the usual search features,
moreover metaName and plain search can be mixed in a single query.
e.g.
%swish-e -w "metaName1 = (a1 or a4) not (a3 and a7)" -f yyy
This query will retrieve all the files in which the "metaName1" is
associated either with "a1" or "a4" and that do not contain the words
"a3" and "a7", where "a3" and "a7" are not associated to any meta
name.
###################################################################
config.h example
-----------------
/*
** SWISH Default Configuration File
**
** Kevin Hughes, kevinh@eit.com
** 3/11/94
**
** Two variables added IGNORELAST and IGNORELASTCHAE
** G. Hill 3/12/97 ghill@library.berkeley.edu
**
**
** Added OKNOMETA to allow no failing in case the META name is
** not listed in the config.h
** G. Hill 4/15/97 ghill@library.berkeley.edu
**
** The following are user-definable options that you can change
** to fine-tune SWISH's default options.
*/
/* #define NEXTSTEP */
/* You may need to define this if compiling on a NeXTstep machine.
*/
#define INDEXPERMS 0644
/* After SWISH generates an index file, it changes the permissions
** of the file to this mode. Change to the mode you like
** (note that it must be an octal number). If you don't want
** permissions to be changed for you, comment out this line.
*/
#define PLIMIT 80
#define FLIMIT 256
/* SWISH uses these parameters to automatically mark words as
** being too common while indexing. For instance, if I defined PLIMIT
** as 80 and FLIMIT as 256, SWISH would define a common word as
** a word that occurs in over 80% of all indexed files and over
** 256 files. Making these numbers lower will most likely make your
** index files smaller. Making PLIMIT and FLIMIT small will also
** ensure that searching consumes only so much CPU resources.
*/
#define VERBOSE 2
/* You can define VERBOSE to be a number from 0 to 3. 0 is totally
** silent operation; 3 is very verbose.
*/
#define MAXHITS 500
/* MAXHITS is the maximum number of results to return from a search.
*/
#define DEFAULT_RULE AND_RULE
/* If a list of search words is specified without booleans,
** SWISH will assume they are connected by a default rule.
** This can be AND_RULE or OR_RULE.
*/
#define TITLETOPLINES 12
/* This is how many lines deep SWISH will look into an HTML file to
** attempt to find a <TITLE> tag.
*/
#define EMPHASIZECOMMENTS 0
/* Normally, words within HTML comments are not assigned a higher
** relevance rank. If you're including keywords in comments
** define this as 1 so matching results will rise to the top
** of search results.
*/
#define MINWORDLIMIT 2
/* This is the minimum length of a word. Anything shorter will not
** be indexed.
*/
#define MAXWORDLIMIT 40
/* This is the maximum length of a word. Anything longer will not
** be indexed.
*/
#define ASCIIENTITIES 1
/* If defined as 1, all entities in search words and indexed
** words will be converted to an ASCII equivalent. For instance,
** with this feature you can index the word "resumé" or
** "resumé" and it will be indexed as the word "resume".
** If defined as 0, only numerical entities will be converted
** to named entities, if they exist.
*/
#define IGNOREALLV 0
#define IGNOREALLC 0
#define IGNOREALLN 0
/* If IGNOREALLV is 1, words containing all vowels won't be indexed.
** If IGNOREALLC is 1, words containing all consonants won't be indexed.
** If IGNOREALLN is 1, words containing all digits won't be indexed.
** Define as 0 to allow words with consistent characters.
** Vowels are defined as "aeiou", digits are "0123456789".
*/
#define IGNOREROWV 6
#define IGNOREROWC 8
#define IGNOREROWN 7
/* IGNOREROWV is the maximum number of consecutive vowels a word can have.
** IGNOREROWC is the maximum number of consecutive consonants a word can have.
** IGNOREROWN is the maximum number of consecutive digits a word can have.
** Vowels are defined as "aeiou", digits are "0123456789".
*/
#define IGNORESAME 15
/* IGNORESAME is the maximum times a character can repeat in a word.
*/
#define WORDCHARS "abcdefghijklmnopqrstuvwxyz=&#;0123456789.@\|/-"
/* WORDCHARS is a string of characters which SWISH permits to
** be in words. Any strings which do not include these characters
** will not be indexed. You can choose from any character in
** the following string:
**
** abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'\"`~,.[]{}()
**
** Note that if you omit "0123456789&#;" you will not be able to
** index HTML entities. DO NOT use the asterisk (*), lesser than
** and greater than signs (<), (>), or colon (:).
**
** Including any of these four characters may cause funny things to happen.
** If you have a pressing need to index 8-bit characters, please contact
** me for possible user testing in the future.
**
** Also note that if you specify the backslash character (\) or
** double quote (") you need to type a backslash before them to
** make the compiler understand them.
*/
#define BEGINCHARS "abcdefghijklmnopqrstuvwxyz&0123456789"
/* Of the characters that you decide can go into words, this is
** a list of characters that words can begin with. It should be
** a subset of (or equal to) WORDCHARS.
*/
#define ENDCHARS "abcdefghijklmnopqrstuvwxyz;0123456789,."
/* This is the same as BEGINCHARS, except you're testing for
** valid characters at the ends of words.
*/
/* Note that if you really want to edit the default stopwords, (words
** that are deemed too common to be indexed) then you can do so in the
** file "swish.h". They don't have to be in alphabetical order.
*/
#define IGNORELAST 1
/* Variable that, if set to 1, will cause IGNORELASTCHAR to be direguared
** when in the final position in a word. This variable was introduced to solve
** the z39.50 problem - to have certain char valid in the middle of a sentence,
** but disreguarded when at the end i.e. period. Defaults is false.
*/
#define IGNORELASTCHAR ".,"
/* Array that contains the char that, if considered valid in the middle of
** a word need to be disreguarded when at the end. It is important to also
** set the given char's in the ENDCHARS array, otherwise the word will not
** be indexed because considered invalid.
*/
#define OKNOMETA 1
/* Variable that define if it is ok to fail in case the META name is not listed
** in the METANAMES variable. Value of 1 will cause the word to be listed as a
** regular words with no metaName attached, and only a warning listing the
** the meta name and the file in which it was found is issued.
*/
#define INDEXTAGS 0
/* Normally, all data in tags in HTML files (except for words in
** comments) is ignored. If you want to index HTML files with the
** text within tags and all, define this to be 1 and not 0.
*/
######################################################################
User configuration file example
--------------------------------
# Sample SWISH configuration file
# Kevin Hughes, kevinh@eit.com, 3/11/95
#
# Added MetaNames variable to support META tags
# G.Hill ghill@library.berkeley.edu 4/97
IndexDir /home/ghill/swish/dir5/records
# This is a space-separated list of files and
# directories you want indexed. You can specify
# more than one of these directives.
IndexFile /home/ghill/swish/dir5/myindex5
# This is what the generated index file will be.
MetaNames NaMe1 nAme2
# List of metaNames used in the files to index; names
# are case insensitive.
IndexName "Improvement index"
IndexDescription "This is an index to test bug fixes in swish."
IndexPointer "http://xxxx"
IndexAdmin "Name, (e-mail address)"
# Extra information you can include in the index file.
IndexOnly .html
# Only files with these suffixes will be indexed.
IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.
FollowSymLinks no
# Put "yes" to follow symbolic links in indexing, else "no".
NoContents .gif .xbm .au .mov .mpg .pdf .ps
# Files with these suffixes will not have their contents indexed -
# only their file names will be indexed.
#ReplaceRules replace "/home/cleita/public_html/index/links" "http://sunsite.berkeley.edu/InternetIndex/Data"
# ReplaceRules allow you to make changes to file pathnames
# before they're indexed.
FileRules pathname contains admin testing demo trash construction confidential
FileRules filename contains # % ~ .bak .orig .old old.
FileRules title contains construction example pointers
FileRules directory contains .htaccess
# Files matching the above criteria will *not* be indexed.
IgnoreLimit 50 1000
# This automatically omits words that appear too often in the files
# (these words are called stopwords). Specify a whole percentage
# and a number, such as "80 256". This omits words that occur in
# over 80% of the files and appear in over 256 files. Comment out
# to turn of auto-stopwording.
#IgnoreWords SwishDefault
# The IgnoreWords option allows you to specify words to ignore.
# Comment out for no stopwords; the word "SwishDefault" will
# include a list of default stopwords. Words should be separated by spaces
# and may span multiple directives.
</PRE>
<P ALIGN="CENTER">
<IMG ALT="" WIDTH="470" HEIGHT="10"
SRC="/Images/dotrule.gif"></P>
<P ALIGN="CENTER">
SWISH is Copyright © 1989, 1991 Free Software Foundation, Inc. <BR>
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA
<BR>SWISH-E is distributed with <B>no warranty</B> under the terms of the <A
HREF="http://www.fsf.org/copyleft/gpl.html">GNU Public License</A>.<BR>
Public questions may be posted to
the <A HREF="mailto:swish-e@sunsite.berkeley.edu">SWISH-E Discussion</A>.
<BR>Document maintained at http://sunsite.berkeley.edu/SWISH-E/changes.html
by the SunSITE Manager.
<BR>Last update 8/12/97. SunSITE Manager:
<A HREF="mailto:manager@sunsite.berkeley.edu">
manager@sunsite.berkeley.edu</A></P>
</BODY>
</HTML>
|