File: README

package info (click to toggle)
swish++ 6.1.5-2
  • links: PTS
  • area: main
  • in suites: squeeze
  • size: 2,256 kB
  • ctags: 1,759
  • sloc: ansic: 11,931; lisp: 804; sh: 629; perl: 366; makefile: 80
file content (221 lines) | stat: -rw-r--r-- 8,758 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
	SWISH++
	Simple Web Indexing System for Humans: C++ version
	http://homepage.mac.com/pauljlucas/software/swish/
	________________________________________________________________

	SWISH++ is a Unix-based file indexing and searching engine
	(typically used to index and search files on web sites).  It
	was based on SWISH-E although SWISH++ is a complete rewrite.
	SWISH++ was developed to circumvent my difficulties with using
	the SWISH-E package.

	SWISH++ has been ported to compile and run under Microsoft
	Windows by Robert J. Lebowitz <lebowitz@finaltouch.com> and
	Christoph Conrad <christoph.conrad@gmx.de>.

	________________________________________________________________

	Features

	 1. Lightning-fast indexing
	    SWISH++ attains its speed chiefly by doing two things:
	    using good algorithms and data structures and by doing fast
	    I/O.

		A. SWISH++ uses the C++ Standard Template Library's map
		   class that is typically implemented either as a
		   red-black or AVL tree for which the worst-case
		   running time is O(lg n).

		B. SWISH++ uses the mmap(2) Unix system call to read
		   files instead of using standard I/O. If you are
		   unfamiliar with mmap, it "maps" a file into memory
		   using the same virtual memory management mechanism
		   the operating system itself uses.  When the first
		   character of a file is read, a page fault occurs and
		   the operating system maps a page of the file into
		   memory. It is impossible to get faster access.
		   Additionally, because the file is in memory, the
		   characters in it are accessed via pointers using
		   simple pointer arithmetic rather than through
		   library function calls and input buffers.

	    Other factors contributing to SWISH++'s speed are that it
	    does very little explicit dynamic memory allocation, uses
	    function inlining, and makes very few function calls in
	    inner loops.

	 2. Indexes META elements, ALT, and other attributes
	    For HTML or XHTML files, SWISH++ indexes words in META
	    element CONTENT attributes and associates them with the
	    NAME attributes. Meta names can later be queried against
	    specifically, e.g.:

		search author = hawking

	    SWISH++ also indexes the words in ALT attributes (for the
	    AREA, IMG, and INPUT elements), STANDBY attributes (for the
	    OBJECT element), SUMMARY attributes (for the TABLE
	    element), and TITLE attributes (for any HTML or XHTML
	    element).

	 3. Selectively not index text within HTML or XHTML elements
	    Text within HTML or XHTML elements belonging to specified
	    classes can be not indexed.  This is most useful not to
	    index text in common page headers, footers, and pop-up
	    menus.

	 4. Intelligently index mail and news files
	    SWISH++ indexes words in headers and associates them with
	    the name of the headers as meta names that can later be
	    queried against specifically, e.g.:

		search subject = big-bang

	    Similarly, words in vCard fields are associated with the
	    names of the fields as meta names that can also later be
	    queried against, e.g.:

		search title = professor
		search org = SLAC

	    Additionally, plain and enriched text, and HTML in any one
	    of ASCII, ISO-8859-1, UTF-7, or UTF-8 character sets in any
	    one of 7-bit, 8-bit, quoted-printable, or base-64 encodings
	    is decoded and converted on-the-fly thus properly indexing
	    encoded bodies and attachments.

	    Lastly, attachments having other MIME types can be filtered
	    on-the-fly before being indexed, e.g., convert Microsoft
	    Word or PDF attachments to plain text.

	 5. Index Unix manual page files
	    SWISH++ indexes words in sections and associates them with
	    the name of the section as meta names that can later be
	    queried against specifically, e.g.:

		search description = environment
		search author = lucas

	    SWISH++ can therefore be used as a much better apropos(1)
	    command replacement.

	 6. Index LaTeX and RTF documents 
	    SWISH++ can ignore LaTeX and RTF markup. Additionally, for LaTeX
	    documents, SWISH++ sets the document title to the content of the
	    \title{...} command.

	 7. Index ID3 tags of MP3 files
	    SWISH++ indexes words in ID3 tags of MP3 files and
	    associates them with the name of the fields as meta names
	    than can later be queried against specifically, e.g.:

		search artist = roxette
		search title = dangerous

	    All ID3 tag versions through 2.4 are supported.
	    Additionally, text fields in any one of ASCII, ISO-8859-1,
	    UTF-8, or UTF-16 character sets are supported.

	 8. Index non-text files such as Microsoft Office documents
	    A separate text-extraction utility "extract" is included to
	    assist in indexing non-text files. It is a essentially a
	    more sophisticated version of the Unix strings(1) command,
	    but employs the same word-determination heuristics used for
	    indexing.

	 9. Apply filters to files on-the-fly prior to indexing
	    Based on filename patterns, files can be filtered before
	    being indexed, e.g.: compressed files uncompressed, PDF
	    files converted to plain text, etc.

	10. Modular indexing architecture
	    New indexing modules can be written to index other file
	    formats directly (without filters).

	11. Index new files incrementally
	    New files can be indexed and added to an existing index
	    incrementally.

	12. Index remote web sites
	    A separate utility "httpindex" is included that interfaces
	    SWISH++ to the wget(1) command enabling remote web sites to
	    be indexed. This is useful to be able to search all the
	    servers in your local area network simultaneously.

	13. Handles large collections of files
	    SWISH++ automatically splits and merges partial indices for
	    large collections of files as it goes thereby not bringing
	    your machine to its knees by exhausting physical memory and
	    causing it to swap like mad.

	14. Lightning-fast searching
	    The same mmap(2) technique used for indexing is used again
	    for searching. The generated index file is written to disk
	    such that it can be mmap'ed back into memory and binary
	    searched immediately, with no parsing of the data, also in
	    O(lg n) time.

	15. Optional word stemming (suffix stripping)
	    SWISH++ allows stemming to be performed at the time of
	    searches, not at the time of index generation.  This allows
	    users to decide whether to perform stemming or not.

	16. Ability to run as a search server
	    SWISH++'s search engine can run in the background as a
	    multi-threaded daemon process to function as a search
	    server accepting query requests and returning results via a
	    Unix domain or TCP socket or both.  For search-intensive
	    applications, such as a search engine on a heavily used web
	    site, this can yield a large performance improvement since
	    the start-up cost (fork(2), exec(2), and initialization) is
	    paid only once.

	17. Easy-to-parse results format
	    SWISH++ outputs its search results in the form:

		rank path_name file_size file_title

	    By placing the file_title, which may contain spaces, last,
	    you can easily parse it, e.g., in Perl:

		($rank,$path,$size,$title) = split( / /, $_, 4 );

	18. XML results format
	    Alternatively, SWISH++ can output search results in XML for
	    increased interoperability with other XML applications.

	19. Generously commented source code
	    The source code is clearly written with lots of comments
	    including references to other works in case you want to
	    modify it under the terms of the GNU general public
	    license.

	________________________________________________________________

	Non-Features

	The following is a list of the features SWISH++ does not have
	that SWISH-E does. I wrote SWISH++ to solve my immediate
	indexing problems; therefore, I implemented only those features
	useful to me. If others can also benefit from the work, great.
	I may implement other features as time permits.

	1. Indexing and searching based on HTML tags
	   SWISH++ has no equivalent means for searching within
	   specific HTML tags (the SWISH-E -t option). I didn't have a
	   need for this feature.

	2. Document properties
	   This functionality can be achieved by using the extract_meta()
	   function in the included WWW Perl module.

	3. Crash and burn on files
	   SWISH++ will not crash while indexing any file.  Period.  If
	   it does, there's a bug and I'll fix it.

	________________________________________________________________

	Copyright (C) 1998-2002 by Paul J. Lucas <pauljlucas@mac.com>
	SWISH++ is available under the GNU General Public License.
	This file last updated: May 13, 2002