File: require.html

package info (click to toggle)
htdig 1%3A3.2.0b6-21
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 21,292 kB
  • sloc: ansic: 49,632; cpp: 46,468; sh: 17,400; xml: 4,180; perl: 2,543; makefile: 888; php: 79; asm: 14
file content (392 lines) | stat: -rw-r--r-- 11,064 bytes parent folder | download | duplicates (9)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <head>
	<title>
	  ht://Dig: Features and System requirements
	</title>
  </head>
  <body bgcolor="#eef7ff">
	<h1>
	  Features and System requirements
	</h1>
	<p>
	  ht://Dig Copyright &copy; 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br>
	  Please see the file <a href="COPYING">COPYING</a> for
	  license information.
	</p>
	<hr noshade>
	<h2>
	  Features
	</h2>
	<p>
	  Here are some of the major features of ht://Dig. They are in
	  no particular order.
	</p>
	<blockquote>
	<dl>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Intranet searching</strong>
	  </dt>
	  <dd>
		ht://Dig has the ability to search through many servers
		on a network by acting as a WWW browser.
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		It is free</strong>
	  </dt>
	  <dd>
		The whole system is released under the
		<a href="COPYING">GNU Library General Public License (LGPL)</a>
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Robot exclusion is supported</strong>
	  </dt>
	  <dd>
		The <a href="http://www.robotstxt.org/wc/norobots.html">
		Standard for Robot Exclusion</a> is
		<a href="meta.html#robots">supported by ht://Dig.</a>
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Boolean expression searching</strong>
	  </dt>
	  <dd>
		Searches can be arbitrarily complex using boolean
		expressions.
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Phrase searching</strong>
	  </dt>
	  <dd>
		A phrase can be searched for by enclosing it in quotes.
		Phrase searches can be combined with word searches, as in
		<code>Linux and "high quality"</code>.
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Configurable search results</strong>
	  </dt>
	  <dd>
		The output of a search can easily be tailored to your
		needs by means of providing HTML templates.
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Fuzzy searching</strong>
	  </dt>
	  <dd>
		Searches can be performed using various
		<a href="attrs.html#search_algorithm">configurable algorithms</a>.
		Currently the following algorithms are
		supported (in any combination):
		<ul>
		  <li>
			exact
		  </li>
		  <li>
			soundex
		  </li>
		  <li>
			metaphone
		  </li>
		  <li>
			common word endings
		  </li>
		  <li>
			synonyms
		  </li>
		  <li>
			accent stripping
		  </li>
		  <li>
			substring and prefix
		  </li>
		  <li>
			regular expressions
		  </li>
		  <li>
			simple spelling corrections
		  </li>
		</ul>
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Searching of many file formats</strong>
	  </dt>
	  <dd>
		Both HTML documents and plain text files can be
		searched directly ht://Dig itself.  There is also a
		<a href="attrs.html#external_parsers">mechanism
		to allow external programs ("external parsers")</a> to be used
		while building the database so that arbitrary file formats
		can be searched. <br>
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Document retrieval using many transport services</strong>
	  </dt>
	  <dd>
		Several transport services can be handled by ht://Dig,
		including http://, ftp:// and file:///.
		There is also a
		<a href="attrs.html#external_protocols">mechanism
		to allow external programs ("external protocols")</a> to be used
		while building the database so that arbitrary transport
		services can be used. <br>
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Keywords can be added to HTML documents</strong>
	   </dt>
	  <dd>
		Any number of <a href="meta.html">keywords</a>
		can be added to HTML documents
		which will not show up when the document is viewed.
		This is used to make a document more like to be found
		and also to make it appear higher in the list of
		matches.
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Email notification of expired documents</strong>
	  </dt>
	  <dd>
		Special meta information can be added to HTML documents
		which can be used to
		<a href="notification.html">notify the maintainer</a> of those
		documents at a certain time. It is handy to get
		reminded when to remove the "New" images from a certain
		page, for example.
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		A Protected server can be indexed</strong>
	  </dt>
	  <dd>
		ht://Dig can be told to use a specific
		<a href="attrs.html#authorization">username and password</a>
		when it retrieves documents. This can be used
		to index a server or parts of a server that are
		protected by a username and password.
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Searches on subsections of the database</strong>
	  </dt>
	  <dd>
		It is easy to set up a search which only returns
		documents whose
		<a href="hts_form.html#restrict">URL matches a certain pattern.</a>
		This becomes very useful for people who want to make their
		own data searchable without having to use a separate
		search engine or database.
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Full source code included</strong>
	  </dt>
	  <dd>
		The search engine comes with full source code. The
		whole system is released under the terms and conditions
		of the <a href="COPYING">GNU Library General Public License (LGPL) version
		2.0</a>
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		The depth of the search can be limited</strong>
	  </dt>
	  <dd>
		Instead of limiting the search to a set of machines, it
		can also be restricted to documents that are a certain
		number of <a href="attrs.html#max_hop_count">"mouse-clicks"</a>
		away from the start document.
	  </dd>
	  <dt>
		<strong><img src="bdot.gif" width=9 height=9 alt="*">
		Full support for the ISO-Latin-1 character set</strong>
	  </dt>
	  <dd>
		Both SGML entities like '&amp;agrave;' and ISO-Latin-1
		characters can be indexed and searched.
	  </dd>
	</dl>
	</blockquote>
	<hr size="4" noshade>
	<h1>
	  Requirements to build ht://Dig
	</h1>
	<p>
	  ht://Dig was developed under Unix using C++.
	</p>
	<p>
	  For this reason, you will need a Unix machine, a C compiler
	  and a C++ compiler. (The C compiler is needed to compile some
	  of the GNU libraries)
	</p>
	<p>
	  Unfortunately, we only have access to a couple of different
	  Unix machines. ht://Dig has been tested on these machines:
	</p>
	<ul>
<!--
	  <li>
		Sun Solaris 2.5 SPARC (using gcc/g++ 2.7.2)
	  </li>
	  <li>
		Sun SunOS 4.1.4 SPARC (using gcc/gcc 2.7.0)
	  </li>
	  <li>
		HP/UX A.09.01 (using gcc/g++ 2.6.0)
	  </li>
	  <li>
		IRIX 5.3 (SGI C++ compiler. Don't know the version)
	  </li>
	  <li>
		Debian Linux 2.0 (using egcs 1.1b)
	  </li>
-->
	  <li>
		FreeBSD 4.6 (using gcc 2.95.3) <!-- lha -->
	  </li>
	  <li>
	        Mandrake Linux 8.2 (using gcc 3.2) <!-- lha -->
	  </li>
	  <li>
		Debian, 2.2.19 kernel (using gcc 2.95.4) <!-- lha -->
	  </li>
	  <li>
	        Debian on an Alpha <!-- lha -->
	  </li>
	  <li>
	        RedHat 7.3, 8.0 <!-- Jim Cole -->
	  </li>
	  <li>
	        Sun Solaris 2.8 = SunOS 5.8 (using gcc 3.1) <!-- lha -->
	  </li>
	  <li>
	        Sun Solaris 2.8 = SunOS 5.8 (using Sun's cc / g++ 3.1) <!-- lha -->
	  </li>
	  <li>
	        Mac OS X 10.2 (using gcc) <!-- Jim Cole -->
	  </li>

 	</ul>
	There are reports of ht://Dig working on a number of other platforms.
	<h3>
	  libstdc++
	</h3>
	<p>
	  If you plan on using g++ to compile ht://Dig, you have to make
	  sure that libstdc++ has been installed. Unfortunately, libstdc++ is a
	  separate package from gcc/g++. You can get libstdc++ from the
	  <a href="ftp://ftp.gnu.org/pub/gnu/">GNU software archive</a>.
	</p>

<!--		The current  Makefiles  don't use include...
	<h3>
	  Berkeley 'make'
	</h3>
	<p>
	  The building relies heavily on the make program. The problem
	  with this is that not all make programs are the same. The
	  requirement for the make program is that it understands the
	  'include' statement as in
	</p>
	<blockquote>
	  <code>include somefile otherfile</code>
	</blockquote>
	<p>
	  The Berkeley 4.4 make program doesn't use this syntax, instead
	  it wants
	</p>
	<blockquote>
	  <code>.include "somefile"</code><br>
	  <code>.include "otherfile"</code>
	</blockquote>
	<p>
	  and hence it cannot be used to build ht://Dig.
	</p>
	<p>
	  If your make program doesn't understand the right 'include'
	  syntax, it is best if you get and install
	  <a href="ftp://ftp.gnu.org/pub/gnu/">gnumake</a> before you try
	  to compile everything. The alternative is to change all the
	  Makefiles.
	</p>
-->
	<hr noshade>
	<h1>
	  Disk space requirements
	</h1>
	<p>
	  The search engine will require lots of disk space to store
	  its databases. Unfortunately, there is no exact formula to
	  compute the space requirements. It depends on the number of
	  documents you are going to index but also on the various
	  options you use.
	  </p>
	  <p>As a temporary measure, 3.2 betas use a very inefficient
	  database structure to enable phrase searching.  This will be
	  fixed before the release of 3.2.0.  Currently, indexing a site of
	  around 10,000 documents gives a database of around 400MB using the
	  default setting for
	  <a href="attrs.html#max_doc_size">maximum document size</a> and storing the
	  <a href="attrs.html#max_head_length">first 50,000 bytes of each document</a>
	  to enable context to be displayed.
	  <!-- To give you an idea of the space
	  requirements, here is what I have deduced from our own
	  database size at San Diego State University.
	</p>
	<p>
	  If you keep around the wordlist database (for update digging
	  instead of initial digging) I found that multiplying the
	  number of documents covered by 12,000 will come pretty close
	  to the space required.
	</p>
	<p>
	  We have about 13,000 documents:
	</p>
<pre>
         13,000
         12,000 x
    ===========
    156,000,000
</pre>
	or about 150 MB.
	<p>
	  Without the wordlist database, the factor drops down to about
	  7500:
	</p>
<pre>
         13,000
          7,500 x
     ===========
     97,500,000
</pre>
	or about 93 MB.
-->
	<p>
	  Keep in mind that we keep at most 50,000 bytes of each
	  document. This may seen a lot, but most documents aren't very
	  big and it gives us a big enough chunk to almost always show
	  an excerpt of the matches.
	</p>
	<p>
	  You may find that if you store most of each document, the
	  databases are almost the same size, or even larger than the
	  documents themselves! Remember that if you're storing a
	  significant portion of each document (say 50,000 bytes as
	  above), you have that requirement, plus the size of the word
	  database and all the additional information about each document
	  (size, URL, date, etc.) required for searching.
	</p>
	<hr size="4" noshade>

	Last modified: $Date: 2004/05/28 13:15:19 $

  </body>
</html>