File: SWISH-FAQ.html

package info (click to toggle)
swish-e 2.4.3-1
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 7,248 kB
  • ctags: 7,642
  • sloc: ansic: 47,385; sh: 8,502; perl: 5,101; makefile: 719; xml: 9
file content (2568 lines) | stat: -rw-r--r-- 83,537 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
  <head>
   <title>SWISH-Enhanced:  The Swish-e FAQ - Answers to Common Questions </title>
   <link href="./style.css" rel=stylesheet type="text/css" title="refstyle">
  </head>
  <body>

    <h1 class="banner">
        <a href="http://swish-e.org"><img border=0 src="images/swish.gif" alt="Swish-E Logo"></a><br>
        <img src="images/swishbanner1.gif"><br>
        <img src="images/dotrule1.gif"><br>
         The Swish-e FAQ - Answers to Common Questions 
    </h1>

    <hr>

    <p>
    <div class="navbar">
      <a href="./SWISH-SEARCH.html">Prev</a> |
      <a href="./index.html">Contents</a> |
      <a href="./SWISH-BUGS.html">Next</a>
    </div>
    <p>

    <div class="toc">
      
<A NAME="toc"></A>
<P><B>Table of Contents:</B></P>

<UL>

	<LI><A HREF="#Frequently_Asked_Questions">Frequently Asked Questions</A>
	<UL>

		<LI><A HREF="#General_Questions">General Questions</A>
		<UL>

			<LI><A HREF="#What_is_Swish_e_">What is Swish-e?</A>
			<LI><A HREF="#So_is_Swish_e_a_search_engine_">So, is Swish-e a search engine?</A>
			<LI><A HREF="#Should_I_upgrade_if_I_m_already_running_a_previous_version">Should I upgrade if I'm already running a previous version</A>
			<LI><A HREF="#Are_there_binary_distributions_available_for_Swish_e_on_platform_foo_">Are there binary distributions available for Swish-e on platform foo?</A>
			<LI><A HREF="#Do_I_need_to_reindex_my_site_each_time_I_upgrade_to_a_new_Swish_e">Do I need to reindex my site each time I upgrade to a new Swish-e</A>
			<LI><A HREF="#What_s_the_advantage_of_using_the_libxml2_library_for_parsing_HTML_">What's the advantage of using the libxml2 library for parsing HTML?</A>
			<LI><A HREF="#Does_Swish_e_include_a_CGI_interface_">Does Swish-e include a CGI interface?</A>
			<LI><A HREF="#How_secure_is_Swish_e_">How secure is Swish-e?</A>
			<LI><A HREF="#Should_I_run_Swish_e_as_the_superuser_root_">Should I run Swish-e as the superuser (root)?</A>
			<LI><A HREF="#What_files_does_Swish_e_write_">What files does Swish-e write?</A>
			<LI><A HREF="#Can_I_index_PDF_and_MS_Word_documents_">Can I index PDF and MS-Word documents?</A>
			<LI><A HREF="#Can_I_index_documents_on_a_web_server_">Can I index documents on a web server?</A>
			<LI><A HREF="#Can_I_implement_keywords_in_my_documents_">Can I implement keywords in my documents? </A>
			<LI><A HREF="#What_are_document_properties_">What are document properties?</A>
			<LI><A HREF="#What_s_the_difference_between_MetaNames_and_PropertyNames_">What's the difference between MetaNames and PropertyNames?</A>
			<LI><A HREF="#Can_Swish_e_index_multi_byte_characters_">Can Swish-e index multi-byte characters?</A>
		</UL>

		<LI><A HREF="#Indexing">Indexing</A>
		<UL>

			<LI><A HREF="#How_do_I_pass_Swish_e_a_list_of_files_to_index_">How do I pass Swish-e a list of files to index?</A>
			<LI><A HREF="#How_does_Swish_e_know_which_parser_to_use_">How does Swish-e know which parser to use?</A>
			<LI><A HREF="#Can_I_reindex_and_search_at_the_same_time_">Can I reindex and search at the same time?</A>
			<LI><A HREF="#Can_I_index_phrases_">Can I index phrases? </A>
			<LI><A HREF="#How_can_I_prevent_phrases_from_matching_across_sentences_">How can I prevent phrases from matching across sentences?</A>
			<LI><A HREF="#Swish_e_isn_t_indexing_a_certain_word_or_phrase_">Swish-e isn't indexing a certain word or phrase.</A>
			<LI><A HREF="#How_do_I_keep_Swish_e_from_indexing_numbers_">How do I keep Swish-e from indexing numbers?</A>
			<LI><A HREF="#Swish_e_crashes_and_burns_on_a_certain_file_What_can_I_do_">Swish-e crashes and burns on a certain file. What can I do?</A>
			<LI><A HREF="#How_to_I_prevent_indexing_of_some_documents_">How to I prevent indexing of some documents?</A>
			<LI><A HREF="#How_do_I_prevent_indexing_parts_of_a_document_">How do I prevent indexing parts of a document?</A>
			<LI><A HREF="#How_do_I_modify_the_path_or_URL_of_the_indexed_documents_">How do I modify the path or URL of the indexed documents.</A>
			<LI><A HREF="#How_can_I_index_data_from_a_database_">How can I index data from a database?</A>
			<LI><A HREF="#How_do_I_index_my_PDF_Word_and_compressed_documents_">How do I index my PDF, Word, and compressed documents?</A>
			<LI><A HREF="#How_do_I_filter_documents_">How do I filter documents?</A>
			<LI><A HREF="#Eh_but_I_just_want_to_know_how_to_index_PDF_documents_">Eh, but I just want to know how to index PDF documents!</A>
			<LI><A HREF="#I_m_using_Windows_and_can_t_get_Filters_or_the_prog_input_method">I'm using Windows and can't get Filters or the prog input method</A>
			<LI><A HREF="#How_do_I_index_non_English_words_">How do I index non-English words?</A>
			<LI><A HREF="#Can_I_add_remove_files_from_an_index_">Can I add/remove files from an index?</A>
			<LI><A HREF="#I_run_out_of_memory_trying_to_index_my_files_">I run out of memory trying to index my files. </A>
			<LI><A HREF="#_too_many_open_files_when_indexing_with_e_option">&quot;too many open files&quot; when indexing with -e option</A>
			<LI><A HREF="#My_system_admin_says_Swish_e_uses_too_much_of_the_CPU_">My system admin says Swish-e uses too much of the CPU!</A>
		</UL>

		<LI><A HREF="#Spidering">Spidering</A>
		<UL>

			<LI><A HREF="#How_can_I_index_documents_on_a_web_server_">How can I index documents on a web server?</A>
			<LI><A HREF="#Why_does_swish_report_swishspider_not_found_">Why does swish report &quot;./swishspider: not found&quot;?</A>
			<LI><A HREF="#I_m_using_the_spider_pl_program_to_spider_my_web_site_but_some">I'm using the spider.pl program to spider my web site, but some</A>
			<LI><A HREF="#I_still_don_t_think_all_my_web_pages_are_being_indexed_">I still don't think all my web pages are being indexed.</A>
			<LI><A HREF="#Swish_is_not_spidering_Javascript_links_">Swish is not spidering Javascript links!</A>
			<LI><A HREF="#How_do_I_spider_other_websites_and_combine_it_with_my_own">How do I spider other websites and combine it with my own</A>
		</UL>

		<LI><A HREF="#Searching">Searching</A>
		<UL>

			<LI><A HREF="#How_do_I_limit_searches_to_just_parts_of_the_index_">How do I limit searches to just parts of the index?</A>
			<LI><A HREF="#How_is_ranking_calculated_">How is ranking calculated?</A>
			<LI><A HREF="#How_can_I_limit_searches_to_the_title_body_or_comment_">How can I limit searches to the title, body, or comment?</A>
			<LI><A HREF="#I_can_t_limit_searches_to_title_body_comment_">I can't limit searches to title/body/comment.</A>
			<LI><A HREF="#I_ve_tried_running_the_included_CGI_script_and_I_get_a_Internal">I've tried running the included CGI script and I get a &quot;Internal</A>
			<LI><A HREF="#When_I_try_to_view_the_swish_cgi_page_I_see_the_contents_of_the">When I try to view the swish.cgi page I see the contents of the</A>
			<LI><A HREF="#How_do_I_make_Swish_e_highlight_words_in_search_results_">How do I make Swish-e highlight words in search results?</A>
			<LI><A HREF="#Do_filters_effect_the_performance_during_search_">Do filters effect the performance during search?</A>
		</UL>

		<LI><A HREF="#I_have_read_the_FAQ_but_I_still_have_questions_about_using_Swish_e_">I have read the FAQ but I still have questions about using Swish-e.</A>
	</UL>

	<LI><A HREF="#Document_Info">Document Info</A>
</UL>

    </div>

    

	    [ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>

<P>
<H1><A NAME="Frequently_Asked_Questions">Frequently Asked Questions</A></H1>
<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H2><A NAME="General_Questions">General Questions</A></H2>
<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="What_is_Swish_e_">What is Swish-e?</A></H3>
<P>
Swish-e is <STRONG>S</STRONG>imple <STRONG>W</STRONG>eb <STRONG>I</STRONG>ndexing <STRONG>S</STRONG>ystem for <STRONG>H</STRONG>umans -
<STRONG>E</STRONG>nhanced. With it, you can quickly and easily index directories of files or
remote web sites and search the generated indexes for words and phrases.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="So_is_Swish_e_a_search_engine_">So, is Swish-e a search engine?</A></H3>
<P>
Well, yes. Probably the most common use of Swish-e is to provide a search
engine for web sites. The Swish-e distribution includes CGI scripts that
can be used with it to add a <EM>search engine</EM> for your web site. The CGI scripts can be found in the <EM>example</EM> directory of the distribution package. See the <EM>README</EM> file for information about the scripts.

<P>
But Swish-e can also be used to index all sorts of data, such as email
messages, data stored in a relational database management system, XML
documents, or documents such as Word and PDF documents -- or any
combination of those sources at the same time. Searches can be limited to
fields or <EM>MetaNames</EM> within a document, or limited to areas within an HTML document (e.g. body,
title). Programs other than CGI applications can use Swish-e, as well.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Should_I_upgrade_if_I_m_already_running_a_previous_version_of_Swish_e_">Should I upgrade if I'm already running a previous version
of Swish-e?</A></H3>
<P>
A large number of bug fixes, feature additions, and logic corrections were
made in version 2.2. In addition, indexing speed has been drastically
improved (reports of indexing times changing from four hours to 5 minutes),
and major parts of the indexing and search parsers have been rewritten.
There's better debugging options, enhanced output formats, more document
meta data (e.g. last modified date, document summary), options for indexing
from external data sources, and faster spidering just to name a few
changes. (See the CHANGES file for more information.

<P>
Since so much effort has gone into version 2.2, support for previous
versions will probably be limited.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Are_there_binary_distributions_available_for_Swish_e_on_platform_foo_">Are there binary distributions available for Swish-e on platform foo?</A></H3>
<P>
Foo? Well, yes there are some binary distributions available. Please see
the Swish-e web site for a list at <A
HREF="http://swish-e.org/.">http://swish-e.org/.</A>

<P>
In general, it is recommended that you build Swish-e from source, if
possible.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Do_I_need_to_reindex_my_site_each_time_I_upgrade_to_a_new_Swish_e_version_">Do I need to reindex my site each time I upgrade to a new Swish-e
version?</A></H3>
<P>
At times it might not strictly be necessary, but since you don't really
know if anything in the index has changed, it is a good rule to reindex.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="What_s_the_advantage_of_using_the_libxml2_library_for_parsing_HTML_">What's the advantage of using the libxml2 library for parsing HTML?</A></H3>
<P>
Swish-e may be linked with libxml2, a library for working with HTML and XML
documents. Swish-e can use libxml2 for parsing HTML and XML documents.

<P>
The libxml2 parser is a better parser than Swish-e's built-in HTML parser.
It offers more features, and it does a much better job at extracting out
the text from a web page. In addition, you can use the
<CODE>ParserWarningLevel</CODE> configuration setting to find structural errors in your documents that
could (and would with Swish-e's HTML parser) cause documents to be indexed
incorrectly.

<P>
Libxml2 is not required, but is strongly recommended for parsing HTML
documents. It's also recommended for parsing XML, as it offers many more
features than the internal Expat xml.c parser.

<P>
The internal HTML parser will have limited support, and does have a number
of bugs. For example, HTML entities may not always be correctly converted
and properties do not have entities converted. The internal parser tends to
get confused when invalid HTML is parsed where the libxml2 parser doesn't
get confused as often. The structure is better detected with the libxml2
parser.

<P>
If you are using the Perl module (the C interface to the Swish-e library)
you may wish to build two versions of Swish-e, one with the libxml2 library
linked in the binary, and one without, and build the Perl module against
the library without the libxml2 code. This is to save space in the library.
Hopefully, the library will someday soon be split into indexing and
searching code (volunteers welcome).

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Does_Swish_e_include_a_CGI_interface_">Does Swish-e include a CGI interface?</A></H3>
<P>
Yes. Kind of.

<P>
There's two example CGI scripts included, swish.cgi and search.cgi. Both
are installed at <EM>$prefix/lib/swish-e</EM>.

<P>
Both require a bit of work to setup and use. Swish.cgi is probably what
most people will want to use as it contains more features. Search.cgi is
for those that want to start with a small script and customize it to fit
their needs.

<P>
An example of using swish.cgi is given in the <A HREF="././INSTALL.html">INSTALL</A> man page, and it the swish.cgi documentation. Like often is the case, it
will be easier to use if you first read the documentation.

<P>
Please use caution about CGI scripts found on the Internet for use with
Swish-e. Some are not secure.

<P>
The included example CGI scripts were designed with security in mind.
Regardless, you are encouraged to have your local Perl expert review it
(and all other CGI scripts you use) before placing it into production. This
is just a good policy to follow.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_secure_is_Swish_e_">How secure is Swish-e?</A></H3>
<P>
We know of no security issues with using Swish-e. Careful attention has
been made with regard to common security problems such as buffer overruns
when programming Swish-e.

<P>
The most likely security issue with Swish-e is when it is run via a poorly
written CGI interface. This is not limited to CGI scripts written in Perl,
as it's just as easy to write an insecure CGI script in C, Java, PHP, or
Python. A good source of information is included with the Perl
distribution. Type <CODE>perldoc perlsec</CODE> at your local prompt for more information. Another must-read document is
located at
<CODE>http://www.w3.org/Security/faq/wwwsf4.html</CODE>.

<P>
Note that there are many <EM>free</EM> yet insecure and poorly written CGI scripts available -- even some designed
for use with Swish-e. Please carefully review any CGI script you use. Free
is not such a good price when you get your server hacked...

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Should_I_run_Swish_e_as_the_superuser_root_">Should I run Swish-e as the superuser (root)?</A></H3>
<P>
No. Never.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="What_files_does_Swish_e_write_">What files does Swish-e write?</A></H3>
<P>
Swish writes the index file, of course. This is specified with the
<A HREF="#item_IndexFile">IndexFile</A> configuration directive or by the <CODE>-f</CODE> command line switch.

<P>
The index file is actually a collection of files, but all start with the
file name specified with the <A HREF="#item_IndexFile">IndexFile</A> directive or the <CODE>-f</CODE>
command line switch.

<P>
For example, the file ending in <EM>.prop</EM> contains the document properties.

<P>
When creating the index files Swish-e appends the extension <EM>.temp</EM>
to the index file names. When indexing is complete Swish-e renames the
<EM>.temp</EM> files to the index files specified by <A HREF="#item_IndexFile">IndexFile</A> or <CODE>-f</CODE>. This is done so that existing indexes remain untouched until it completes
indexing.

<P>
Swish-e also writes temporary files in some cases during indexing (e.g. <CODE>-s http</CODE>, <CODE>-s prog</CODE> with filters), when merging, and when using <CODE>-e</CODE>). Temporary files are created with the <CODE>mkstemp(3)</CODE> function
(with 0600 permission on unix-like operating systems).

<P>
The temporary files are created in the directory specified by the
environment variables <CODE>TMPDIR</CODE> and <CODE>TMP</CODE> in that order. If those are not set then swish uses the setting the
configuration setting
<A HREF="././SWISH-CONFIG.html#item_TmpDir">TmpDir</A>. Otherwise, the temporary file will be located in the current directory.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Can_I_index_PDF_and_MS_Word_documents_">Can I index PDF and MS-Word documents?</A></H3>
<P>
Yes, you can use a <EM>Filter</EM> to convert documents while indexing, or you can use a program that
&quot;feeds&quot; documents to Swish-e that have already been converted.
See <CODE>Indexing</CODE> below.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Can_I_index_documents_on_a_web_server_">Can I index documents on a web server?</A></H3>
<P>
Yes, Swish-e provides two ways to index (spider) documents on a web server.
See <CODE>Spidering</CODE> below.

<P>
Swish-e can retrieve documents from a file system or from a remote web
server. It can also execute a program that returns documents back to it.
This program can retrieve documents from a database, filter compressed
documents files, convert PDF files, extract data from mail archives, or
spider remote web sites.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Can_I_implement_keywords_in_my_documents_">Can I implement keywords in my documents?</A></H3>
<P>
Yes, Swish-e can associate words with <EM>MetaNames</EM> while indexing, and you can limit your searches to these MetaNames while
searching.

<P>
In your HTML files you can put keywords in HTML META tags or in XML blocks.

<P>
META tags can have two formats in your source documents:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    &lt;META NAME=&quot;DC.subject&quot; CONTENT=&quot;digital libraries&quot;&gt;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
And in XML format (can also be used in HTML documents when using libxml2):

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    &lt;meta2&gt;
        Some Content
    &lt;/meta2&gt;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Then, to inform Swish-e about the existence of the meta name in your
documents, edit the line in your configuration file:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    MetaNames DC.subject meta1 meta2</pre>
        </td>
	    
      </tr>
    </table>
    <P>
When searching you can now limit some or all search terms to that MetaName.
For example, to look for documents that contain the word apple and also
have either fruit or cooking in the DC.subject meta tag.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="What_are_document_properties_">What are document properties?</A></H3>
<P>
A document property is typically data that describes the document. For
example, properties might include a document's path name, its last modified
date, its title, or its size. Swish-e stores a document's properties in the
index file, and they can be reported back in search results.

<P>
Swish-e also uses properties for sorting. You may sort your results by one
or more properties, in ascending or descending order.

<P>
Properties can also be defined within your documents. HTML and XML files
can specify tags (see previous question) as properties. The <EM>contents</EM> of these tags can then be returned with search results. These user-defined
properties can also be used for sorting search results.

<P>
For example, if you had the following in your documents

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>   &lt;meta name=&quot;creator&quot; content=&quot;accounting department&quot;&gt;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
and <CODE>creator</CODE> is defined as a property (see <A HREF="#item_PropertyNames">PropertyNames</A> in
<A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A>) Swish-e can return <CODE>accounting department</CODE>
with the result for that document.

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -w foo -p creator</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Or for sorting:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -w foo -s creator</pre>
        </td>
	    
      </tr>
    </table>
    <P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="What_s_the_difference_between_MetaNames_and_PropertyNames_">What's the difference between MetaNames and PropertyNames?</A></H3>
<P>
MetaNames allows keywords searches in your documents. That is, you can use
MetaNames to restrict searches to just parts of your documents.

<P>
PropertyNames, on the other hand, define text that can be returned with
results, and can be used for sorting.

<P>
Both use <EM>meta tags</EM> found in your documents (as shown in the above two questions) to define the
text you wish to use as a property or meta name.

<P>
You may define a tag as <STRONG>both</STRONG> a property and a meta name. For example:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>   &lt;meta name=&quot;creator&quot; content=&quot;accounting department&quot;&gt;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
placed in your documents and then using configuration settings of:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    PropertyNames creator
    MetaNames creator</pre>
        </td>
	    
      </tr>
    </table>
    <P>
will allow you to limit your searches to documents created by accounting:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -w 'foo and creator=(accounting)'</pre>
        </td>
	    
      </tr>
    </table>
    <P>
That will find all documents with the word <CODE>foo</CODE> that also have a creator meta tag that contains the word <CODE>accounting</CODE>. This is using MetaNames.

<P>
And you can also say:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -w foo -p creator</pre>
        </td>
	    
      </tr>
    </table>
    <P>
which will return all documents with the word <CODE>foo</CODE>, but the results will also include the contents of the <CODE>creator</CODE> meta tag along with results. This is using properties.

<P>
You can use properties and meta names at the same time, too:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -w creator=(accounting or marketing) -p creator -s creator</pre>
        </td>
	    
      </tr>
    </table>
    <P>
That searches only in the <CODE>creator</CODE>  <EM>meta name</EM> for either of the words
<CODE>accounting</CODE> or <CODE>marketing</CODE>, prints out the contents of the contents of the <CODE>creator</CODE>  <EM>property</EM>, and sorts the results by the <CODE>creator</CODE>

<EM>property name</EM>.

<P>
(See also the <CODE>-x</CODE> output format switch in <A HREF="././SWISH-RUN.html">SWISH-RUN</A>.)

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Can_Swish_e_index_multi_byte_characters_">Can Swish-e index multi-byte characters?</A></H3>
<P>
No. This will require much work to change. But, Swish-e works with
eight-bit characters, so many characters sets can be used. Note that it
does call the ANSI-C <CODE>tolower()</CODE> function which does depend on
the current locale setting. See <CODE>locale(7)</CODE> for more information.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H2><A NAME="Indexing">Indexing</A></H2>
<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_pass_Swish_e_a_list_of_files_to_index_">How do I pass Swish-e a list of files to index?</A></H3>
<P>
Currently, there is not a configuration directive to include a file that
contains a list of files to index. But, there is a directive to include
another configuration file.

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    IncludeConfigFile /path/to/other/config</pre>
        </td>
	    
      </tr>
    </table>
    <P>
And in <CODE>/path/to/other/config</CODE> you can say:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    IndexDir file1 file2 file3 file4 file5 ...
    IndexDir file20 file21 file22</pre>
        </td>
	    
      </tr>
    </table>
    <P>
You may also specify more than one configuration file on the command line:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    ./swish-e -c config_one config_two config_three</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Another option is to create a directory with symbolic links of the files to
index, and index just that directory.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_does_Swish_e_know_which_parser_to_use_">How does Swish-e know which parser to use?</A></H3>
<P>
Swish can parse HTML, XML, and text documents. The parser is set by
associating a file extension with a parser by the <A HREF="#item_IndexContents">IndexContents</A>
directive. You may set the default parser with the <A HREF="#item_DefaultContents">DefaultContents</A>
directive. If a document is not assigned a parser it will default to the
HTML parser (HTML2 if built with libxml2).

<P>
You may use Filters or an external program to convert documents to HTML,
XML, or text.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Can_I_reindex_and_search_at_the_same_time_">Can I reindex and search at the same time?</A></H3>
<P>
Yes. Starting with version 2.2 Swish-e indexes to temporary files, and then
renames the files when indexing is complete. On most systems renames are
atomic. But, since Swish-e also generates more than one file during
indexing there will be a very short period of time between renaming the
various files when the index is out of sync.

<P>
Settings in <EM>src/config.h</EM> control some options related to temporary files, and their use during
indexing.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Can_I_index_phrases_">Can I index phrases?</A></H3>
<P>
Phrases are indexed automatically. To search for a phrase simply place
double quotes around the phrase.

<P>
For example:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -w 'free and &quot;fast search engine&quot;'</pre>
        </td>
	    
      </tr>
    </table>
    <P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_can_I_prevent_phrases_from_matching_across_sentences_">How can I prevent phrases from matching across sentences?</A></H3>
<P>
Use the
<A HREF="././SWISH-CONFIG.html#item_BumpPositionCounterCharacters">BumpPositionCounterCharacters</A>
configuration directive.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Swish_e_isn_t_indexing_a_certain_word_or_phrase_">Swish-e isn't indexing a certain word or phrase.</A></H3>
<P>
There are a number of configuration parameters that control what Swish-e
considers a &quot;word&quot; and it has a debugging feature to help
pinpoint any indexing problems.

<P>
Configuration file directives (<A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A>)
<A HREF="#item_WordCharacters">WordCharacters</A>, <A HREF="#item_BeginCharacters">BeginCharacters</A>, <A HREF="#item_EndCharacters">EndCharacters</A>,
<A HREF="#item_IgnoreFirstChar">IgnoreFirstChar</A>, and <A HREF="#item_IgnoreLastChar">IgnoreLastChar</A> are the main settings that Swish-e uses to define a &quot;word&quot;. See <A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A> and
<A HREF="././SWISH-RUN.html">SWISH-RUN</A> for details.

<P>
Swish-e also uses compile-time defaults for many settings. These are
located in <EM>src/config.h</EM> file.

<P>
Use of the command line arguments <CODE>-k</CODE>, <CODE>-v</CODE> and <CODE>-T</CODE> are useful when debugging these problems. Using <CODE>-T INDEXED_WORDS</CODE> while indexing will display each word as it is indexed. You should specify
one file when using this feature since it can generate a lot of output.

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>     ./swish-e -c my.conf -i problem.file -T INDEXED_WORDS</pre>
        </td>
	    
      </tr>
    </table>
    <P>
You may also wish to index a single file that contains words that are or
are not indexing as you expect and use -T to output debugging information
about the index. A useful command might be:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    ./swish-e -f index.swish-e -T INDEX_FULL</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Once you see how Swish-e is parsing and indexing your words, you can adjust
the configuration settings mentioned above to control what words are
indexed.

<P>
Another useful command might be:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>     ./swish-e -c my.conf -i problem.file -T PARSED_WORDS INDEXED_WORDS</pre>
        </td>
	    
      </tr>
    </table>
    <P>
This will show white-spaced words parsed from the document (PARSED_WORDS),
and how those words are split up into separate words for indexing
(INDEXED_WORDS).

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_keep_Swish_e_from_indexing_numbers_">How do I keep Swish-e from indexing numbers?</A></H3>
<P>
Swish-e indexes words as defined by the <A HREF="#item_WordCharacters">WordCharacters</A> setting, as described above. So to avoid indexing numbers you simply remove
digits from the <A HREF="#item_WordCharacters">WordCharacters</A> setting.

<P>
There are also some settings in <EM>src/config.h</EM> that control what &quot;words&quot; are indexed. You can configure swish to
never index words that are all digits, vowels, or consonants, or that
contain more than some consecutive number of digits, vowels, or consonants.
In general, you won't need to change these settings.

<P>
Also, there's an experimental feature called <A HREF="#item_IgnoreNumberChars">IgnoreNumberChars</A>
which allows you to define a set of characters that describe a number. If a
word is made up of <STRONG>only</STRONG> those characters it will not be indexed.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Swish_e_crashes_and_burns_on_a_certain_file_What_can_I_do_">Swish-e crashes and burns on a certain file. What can I do?</A></H3>
<P>
This shouldn't happen. If it does please post to the Swish-e discussion
list the details so it can be reproduced by the developers.

<P>
In the mean time, you can use a <A HREF="#item_FileRules">FileRules</A> directive to exclude the particular file name, or pathname, or its title.
If there are serious problems in indexing certain types of files, they may
not have valid text in them (they may be binary files, for instance). You
can use NoContents to exclude that type of file.

<P>
Swish-e will issue a warning if an embedded null character is found in a
document. This warning will be an indication that you are trying to index
binary data. If you need to index binary files try to find a program that
will extract out the text (e.g. <CODE>strings(1),</CODE>
<CODE>catdoc(1),</CODE> <CODE>pdftotext(1)).</CODE>

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_to_I_prevent_indexing_of_some_documents_">How to I prevent indexing of some documents?</A></H3>
<P>
When using the file system to index your files you can use the
<A HREF="#item_FileRules">FileRules</A> directive. Other than <CODE>FileRules title</CODE>, <A HREF="#item_FileRules">FileRules</A>
only works with the file system (<CODE>-S fs</CODE>) indexing method, not with
<CODE>-S prog</CODE> or <CODE>-S http</CODE>.

<P>
If you are spidering, use a <EM>robots.text</EM> file in your document root. This is a standard way to excluded files from
search engines, and is fully supported by Swish-e. See <A
HREF="http://www.robotstxt.org/">http://www.robotstxt.org/</A>

<P>
You can also modify the <EM>spider.pl</EM> spider perl program to skip, index content only, or spider only listed web
pages. Type <CODE>perldoc spider.pl</CODE>
in the <CODE>prog-bin</CODE> directory for details.

<P>
If using the libxml2 library for parsing HTML, you may also use the Meta
Robots Exclusion in your documents:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    &lt;meta name=&quot;robots&quot; content=&quot;noindex&quot;&gt;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
See the <A HREF="././SWISH-CONFIG.html#item_obeyRobotsNoIndex">obeyRobotsNoIndex</A> directive.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_prevent_indexing_parts_of_a_document_">How do I prevent indexing parts of a document?</A></H3>
<P>
To prevent Swish-e from indexing a common header, footer, or navigation
bar, AND you are using libxml2 for parsing HTML, then you may use a fake
HTML tag around the text you wish to ignore and use the
<A HREF="#item_IgnoreMetaTags">IgnoreMetaTags</A> directive. This will generate an error message if the <CODE>ParserWarningLevel</CODE> is set as it's invalid HTML.

<P>
<A HREF="#item_IgnoreMetaTags">IgnoreMetaTags</A> works with XML documents (and HTML documents when using libxml2 as the
parser), but not with documents parsed by the text (TXT) parser.

<P>
If you are using the libxml2 parser (HTML2 and XML2) then you can use the
the following comments in your documents to prevent indexing:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>       &lt;!-- SwishCommand noindex --&gt;
       &lt;!-- SwishCommand index --&gt;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
and/or these may be used also:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>       &lt;!-- noindex --&gt;
       &lt;!-- index --&gt;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_modify_the_path_or_URL_of_the_indexed_documents_">How do I modify the path or URL of the indexed documents.</A></H3>
<P>
Use the <A HREF="#item_ReplaceRules">ReplaceRules</A> configuration directive to rewrite path names and URLs. If you are using <CODE>-S prog</CODE> input method you may set the path to any string.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_can_I_index_data_from_a_database_">How can I index data from a database?</A></H3>
<P>
Use the &quot;prog&quot; document source method of indexing. Write a
program to extract out the data from your database, and format it as XML,
HTML, or text. See the examples in the <CODE>prog-bin</CODE> directory, and the next question.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_index_my_PDF_Word_and_compressed_documents_">How do I index my PDF, Word, and compressed documents?</A></H3>
<P>
Swish-e can internally only parse HTML, XML and TXT (text) files by
default, but can make use of <EM>filters</EM> that will convert other types of files such as MS Word documents, PDF, or
gzipped files into one of the file types that Swish-e understands.

<P>
Please see <A HREF="././SWISH-CONFIG.html#Document_Filter_Directives">SWISH-CONFIG</A>
and the examples in the <EM>filters</EM> and <EM>filter-bin</EM> directory for more information.

<P>
See the next question to learn about the filtering options with Swish-e.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_filter_documents_">How do I filter documents?</A></H3>
<P>
The term &quot;filter&quot; in Swish-e means the converstion of a document
of one type (one that swish-e cannot index directly) into a type that
Swish-e can index, namely HTML, plain text, or XML. To add to the
confusion, there are a number of ways to accomplish this in Swish-e. So
here's a bit of background.

<P>
The <A HREF="././SWISH-CONFIG.html#Document_Filter_Directives">FileFilter</A> directive was added to swish first. This feature allows you to specify a
program to run for documents that match a given file extension. For
example, to filter PDF files (files that end in .pdf) you can specify the
configuation setting of:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    FileFilter .pdf pdftotext   &quot;'%p' -&quot;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
which says to run the program &quot;pdftotext&quot; passing it the pathname
of the file (%p) and a dash (which tells pdftotext to output to stdout).
Then for each .pdf file Swish-e runs this program and reads in the filtered
document from the output from the filter program.

<P>
This has the advantage that it is easy to setup -- a single line in the
config file is all that is needed to add the filter into Swish-e. But it
also has a number of problems. For example, if you use a Perl script to do
your filtering it can be very slow since the filter script must be run (and
thus compiled) for each processed document. This is exacerbated when using
the -S http method since the -S http method also uses a Perl script that is
run for every URL fetched. Also, when using -S prog method of input
(reading input from a program) using FileFilter means that Swish-e must
first read the file in from the external program and then write the file
out to a temporary file before running the filter.

<P>
With -S prog it makes much more sense to filter the document in the program
that is fetching the documents than to have swish-e read the file into
memory, write it to a temporary file and then run an external program.

<P>
The Swish-e distribution contains a couple of example -S prog programs.  <EM>spider.pl</EM> is a reasonably full-featured web spider that offers many more options than
the -S http method. And it is much faster than running -S http, too.

<P>
The spider has a perl configuration file, which means you can add
programming logic right into the configuration file without editing the
spider program. One bit of logic that is provided in the spider's
configuration file is a &quot;call-back&quot; function that allows you to
filter the content. In other words, before the spider passes a fetched web
document to swish for indexing the spider can call a simple subroutine in
the spider's configuration file passing the document and its content type.
The subroutine can then look at the content type and decide if the document
needs to be filtered.

<P>
For example, when processing a document of type
&quot;application/msword&quot; the call-back subroutine might call the
doc2txt.pm perl module, and a document of type &quot;appliation/pdf&quot;
could use the pdf2html.pm module. The <EM>prog-bin/SwishSpiderConfig.pl</EM> file shows this usage.

<P>
This system works reasonably well, but also means that more work is
required to setup the filters. First, you must explicitly check for
specific content types and then call the appropriate Perl module, and
second, you have to know how each module must be called and how each
returns the possibly modified content.

<P>
In comes SWISH::Filter.

<P>
To make things easier the SWISH::Filter Perl module was created. The idea
of this module is that there is one interface used to filter all types of
documents. So instead of checking for specific types of content you just
pass the content type and the document to the SWISH::Filter module and it
returns a new content type and document if it was filtered. The filters
that do the actual work are designed with a standard interface and work
like filter &quot;plug-ins&quot;. Adding new filters means just downloading
the filter to a directory and no changes are needed to the spider's
configuation file. Download a filter for Postscript and next time you run
indexing your Postscript files will be indexed.

<P>
Since the filters are standardized, hopefully when you have the need to
filter documents of a specific type there will already be a filter ready
for your use.

<P>
Now, note that the perl modules may or may not do the actual conversion of
a document. For example, the PDF conversion module calls the pdfinfo and
pdftotext programs. Those programs (part of the Xpfd package) must be
installed separately from the filters.

<P>
The SwishSpiderConfig.pl examle spider configuration file shows how to use
the SWISH::Filter module for filtering. This file is installed at
$prefix/share/doc/swish-e/examples/prog-bin, where <CODE>$prefix</CODE> is
normally /usr/local on unix-type machines.

<P>
The SWISH::Filter method of filtering can also be used with the -S http
method of indexing. By default the <EM>swishspider</EM> program (the Perl helper script that fetches documents from the web) will
attempt to use the SWISH::Filter module if it can be found in Perls library
path. This path is set automatically for spider.pl but not for swishspider
(because it would slow down a method that's already slow and spider.pl is
recommended over the -S http method).

<P>
Therefore, all that's required to use this system with -S http is setting
the <CODE>@INC</CODE> array to point to the filter directory.

<P>
For example, if the swish-e distribution was unpacked into ~/swish-e:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>   PERL5LIB=~/swish-e/filters swish-e -c conf -S http</pre>
        </td>
	    
      </tr>
    </table>
    <P>
will allow the -S http method to make use of the SWISH::Filter module.

<P>
Note that if you are not using the SWISH::Filter module you may wish to
edit the <EM>swishspider</EM> program and disable the use of the SWISH::Filter module using this setting:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    use constant USE_FILTERS  =&gt; 0;  # disable SWISH::Filter</pre>
        </td>
	    
      </tr>
    </table>
    <P>
This prevents the program from attempting to use the SWISH::Filter module
for every non-text URL that is fetched. Of course, if you are concerned
with indexing speed you should be using the -S prog method with spider.pl
instead of -S http.

<P>
If you are not spidering, but you still want to make use of the
SWISH::Filter module for filtering you can use the DirTree.pl program (in
$prefix/lib/swish-e). This is a simple program that traverses the file
system and uses SWISH::Filter for filtering.

<P>
Here's two examples of how to run a filter program, one using Swish-e's
<A HREF="#item_FileFilter">FileFilter</A> directive, another using a <A HREF="#item_prog">prog</A> input method program. See the <EM>SwishSpiderConfig.pl</EM> file for an example of using the SWISH::Filter module.

<P>
These filters simply use the program <CODE>/bin/cat</CODE> as a filter and only indexes .html files.

<P>
First, using the <A HREF="#item_FileFilter">FileFilter</A> method, here's the entire configuration file (swish.conf):

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    IndexDir .
    IndexOnly .html
    FileFilter .html &quot;/bin/cat&quot;   &quot;'%p'&quot;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
and index with the command

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -c swish.conf -v 1</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Now, the same thing with using the <CODE>-S prog</CODE> document source input method and a Perl program called catfilter.pl. You
can see that's it's much more work than using the <A HREF="#item_FileFilter">FileFilter</A> method above, but provides a place to do additional processing. In this
example, the <A HREF="#item_prog">prog</A> method is only slightly faster. But if you needed a perl script to run as a
FileFilter then <A HREF="#item_prog">prog</A> will be significantly faster.

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    #!/usr/local/bin/perl -w
    use strict;
    use File::Find;  # for recursing a directory tree</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    $/ = undef;
    find(
        { wanted =&gt; \&amp;wanted, no_chdir =&gt; 1, },
        '.',
    );</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    sub wanted {
        return if -d;
        return unless /\.html$/;</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>        my $mtime  = (stat)[9];</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>        my $child = open( FH, '-|' );
        die &quot;Failed to fork $!&quot; unless defined $child;
        exec '/bin/cat', $_ unless $child;</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>        my $content = &lt;FH&gt;;
        my $size = length $content;</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>        print &lt;&lt;EOF;
    Content-Length: $size
    Last-Mtime: $mtime
    Path-Name: $_</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    EOF</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>        print &lt;FH&gt;;
    }</pre>
        </td>
	    
      </tr>
    </table>
    <P>
And index with the command:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -S prog -i ./catfilter.pl -v 1</pre>
        </td>
	    
      </tr>
    </table>
    <P>
This example will probably not work under Windows due to the '-|' open. A
simple piped open may work just as well:

<P>
That is, replace:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    my $child = open( FH, '-|' );
    die &quot;Failed to fork $!&quot; unless defined $child;
    exec '/bin/cat', $_ unless $child;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
with this:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    open( FH, &quot;/bin/cat $_ |&quot; ) or die $!;</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Perl will try to avoid running the command through the shell if meta
characters are not passed to the open. See <CODE>perldoc -f open</CODE> for more information.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Eh_but_I_just_want_to_know_how_to_index_PDF_documents_">Eh, but I just want to know how to index PDF documents!</A></H3>
<P>
See the examples in the <EM>conf</EM> directory and the comments in the <EM>SwishSpiderConfig.pl</EM> file.

<P>
See the previous question for the details on filtering. The method you
decide to use will depend on how fast you want to index, and your comfort
level with using Perl modules.

<P>
Regardless of the filtering method you use you will need to install the
Xpdf packages available from <A
HREF="http://www.foolabs.com/xpdf/.">http://www.foolabs.com/xpdf/.</A>

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="I_m_using_Windows_and_can_t_get_Filters_or_the_prog_input_method_to_work_">I'm using Windows and can't get Filters or the prog input method
to work!</A></H3>
<P>
Both the <CODE>-S prog</CODE> input method and filters use the <CODE>popen()</CODE> system call to run the external program. If your external program is, for
example, a perl script, you have to tell Swish-e to run perl, instead of
the script. Swish-e will convert forward slashes to backslashes when
running under Windows.

<P>
For example, you would need to specify the path to perl as (assuming this
is where perl is on your system):

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    IndexDir e:/perl/bin/perl.exe</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Or run a filter like:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    FileFilter .foo e:/perl/bin/perl.exe 'myscript.pl &quot;%p&quot;'</pre>
        </td>
	    
      </tr>
    </table>
    <P>
It's often easier to just install Linux.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_index_non_English_words_">How do I index non-English words?</A></H3>
<P>
Swish-e indexes 8-bit characters only. This is the ISO 8859-1 Latin-1
character set, and includes many non-English letters (and symbols). As long
as they are listed in <A HREF="#item_WordCharacters">WordCharacters</A> they will be indexed.

<P>
Actually, you probably can index any 8-bit character set, as long as you
don't mix character sets in the same index and don't use libxml2 for
parsing (see below).

<P>
The <A HREF="#item_TranslateCharacters">TranslateCharacters</A> directive (<A HREF="././SWISH-CONFIG.html">SWISH-CONFIG</A>) can translate characters while indexing and searching. You may specify
the mapping of one character to another character with the
<A HREF="#item_TranslateCharacters">TranslateCharacters</A> directive.

<P>
<CODE>TranslateCharacters :ascii7:</CODE> is a predefined set of characters that will translate eight-bit characters
to ascii7 characters. Using the
<CODE>:ascii7:</CODE> rule will, for example, translate &quot;&quot; to &quot;aac&quot;. This
means: searching &quot;elik&quot;, &quot;elik&quot; or &quot;celik&quot;
will all match the same word.

<P>
Note: When using libxml2 for parsing, parsed documents are converted
internally (within libxml2) to UTF-8. This is converted to ISO 8859-1
Latin-1 when indexing. In cases where a string can not be converted from
UTF-8 to ISO 8859-1 (because it contains non 8859-1 characters), the string
will be sent to Swish-e in UTF-8 encoding. This will results in some words
indexed incorrectly. Setting <CODE>ParserWarningLevel</CODE> to 1 or more will display warnings when UTF-8 to 8859-1 conversion fails.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Can_I_add_remove_files_from_an_index_">Can I add/remove files from an index?</A></H3>
<P>
Try building swish-e with the <CODE>--enable-incremental</CODE> option.

<P>
The rest of this FAQ applies to the default swish-e format.

<P>
Swish-e currently has no way to add or remove items from its index. But,
Swish-e indexes so quickly that it's often possible to reindex the entire
document set when a file needs to be added, modified or removed. If you are
spidering a remote site then consider caching documents locally compressed.

<P>
Incremental additions can be handled in a couple of ways, depending on your
situation. It's probably easiest to create one main index every night (or
every week), and then create an index of just the new files between main
indexing jobs and use the <CODE>-f</CODE> option to pass both indexes to Swish-e while searching.

<P>
You can merge the indexes into one index (instead of using -f), but it's
not clear that this has any advantage over searching multiple indexes.

<P>
How does one create the incremental index?

<P>
One method is by using the <CODE>-N</CODE> switch to pass a file path to Swish-e when indexing. It will only index
files that have a last modification date <CODE>newer</CODE> than the file supplied with the <CODE>-N</CODE> switch.

<P>
This option has the disadvantage that Swish-e must process every file in
every directory as if they were going to be indexed (the test for <CODE>-N</CODE>
is done last right before indexing of the file contents begin and after all
other tests on the file have been completed) -- all that just to find a few
new files.

<P>
Also, if you use the Swish-e index file as the file passed to <CODE>-N</CODE> there may be files that were added after indexing was started, but before
the index file was written. This could result in a file not being added to
the index.

<P>
Another option is to maintain a parallel directory tree that contains
symlinks pointing to the main files. When a new file is added (or changed)
to the main directory tree you create a symlink to the real file in the
parallel directory tree. Then just index the symlink directory to generate
the incremental index.

<P>
This option has the disadvantage that you need to have a central program
that creates the new files that can also create the symlinks. But, indexing
is quite fast since Swish-e only has to look at the files that need to be
indexed. When you run full indexing you simply unlink (delete) all the
symlinks.

<P>
Both of these methods have issues where files could end up in both indexes,
or files being left out of an index. Use of file locks while indexing, and
hash lookups during searches can help prevent these problems.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="I_run_out_of_memory_trying_to_index_my_files_">I run out of memory trying to index my files.</A></H3>
<P>
It's true that indexing can take up a lot of memory! Swish-e is extremely
fast at indexing, but that comes at the cost of memory.

<P>
The best answer is install more memory.

<P>
Another option is use the <CODE>-e</CODE> switch. This will require less memory, but indexing will take longer as not
all data will be stored in memory while indexing. How much less memory and
how much more time depends on the documents you are indexing, and the
hardware that you are using.

<P>
Here's an example of indexing all .html files in /usr/doc on Linux. This
first example is <EM>without</EM>  <CODE>-e</CODE> and used about 84M of memory:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    270279 unique words indexed.
    23841 files indexed.  177640166 total bytes.
    Elapsed time: 00:04:45 CPU time: 00:03:19</pre>
        </td>
	    
      </tr>
    </table>
    <P>
This is <EM>with</EM>  <CODE>-e</CODE>, and used about 26M or memory:    

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    270279 unique words indexed.
    23841 files indexed.  177640166 total bytes.
    Elapsed time: 00:06:43 CPU time: 00:04:12</pre>
        </td>
	    
      </tr>
    </table>
    <P>
You can also build a number of smaller indexes and then merge together with <CODE>-M</CODE>. Using <CODE>-e</CODE> while merging will save memory.

<P>
Finally, if you do build a number of smaller indexes, you can specify more
than one index when searching by using the <CODE>-f</CODE> switch. Sorting large results sets by a property will be slower when
specifying multiple index files while searching.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="_too_many_open_files_when_indexing_with_e_option">&quot;too many open files&quot; when indexing with -e option</A></H3>
<P>
Some platforms report &quot;too many open files&quot; when using the -e
economy option. The -e feature uses many temporary files (something like
377) plus the index files and this may exceed your system's limits.

<P>
Depending on your platform you may need to set &quot;ulimit&quot; or
&quot;unlimit&quot;.

<P>
For example, under Linux bash shell:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  $ ulimit -n 1024</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Or under an old Sparc

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  % unlimit openfiles</pre>
        </td>
	    
      </tr>
    </table>
    <P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="My_system_admin_says_Swish_e_uses_too_much_of_the_CPU_">My system admin says Swish-e uses too much of the CPU!</A></H3>
<P>
That's a good thing! That expensive CPU is supposed to be busy.

<P>
Indexing takes a lot of work -- to make indexing fast much of the work is
done in memory which reduces the amount of time Swish-e is waiting on I/O.
But, there's two things you can try:

<P>
The <CODE>-e</CODE> option will run Swish-e in economy mode, which uses the disk to store data
while indexing. This makes Swish-e run somewhat slower, but also uses less
memory. Since it is writing to disk more often it will be spending more
time waiting on I/O and less time in CPU. Maybe.

<P>
The other thing is to simply lower the priority of the job using the
<CODE>nice(1)</CODE> command:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    /bin/nice -15 swish-e -c search.conf</pre>
        </td>
	    
      </tr>
    </table>
    <P>
If concerned about searching time, make sure you are using the -b and -m
switches to only return a page at a time. If you know that your result sets
will be large, and that you wish to return results one page at a time, and
that often times many pages of the same query will be requested, you may be
smart to request all the documents on the first request, and then cache the
results to a temporary file. The perl module File::Cache makes this very
simple to accomplish.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H2><A NAME="Spidering">Spidering</A></H2>
<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_can_I_index_documents_on_a_web_server_">How can I index documents on a web server?</A></H3>
<P>
If possible, use the file system method <CODE>-S fs</CODE> of indexing to index documents in you web area of the file system. This
avoids the overhead of spidering a web server and is much faster. (<CODE>-S fs</CODE> is the default method if <CODE>-S</CODE> is not specified).

<P>
If this is impossible (the web server is not local, or documents are
dynamically generated), Swish-e provides two methods of spidering. First,
it includes the http method of indexing <CODE>-S http</CODE>. A number of special configuration directives are available that control
spidering (see <A HREF="././SWISH-CONFIG.html#Directives_for_the_HTTP_Access_Method_Only">Directives for the HTTP Access Method Only</A>). A perl helper script (swishspider) is included in the <EM>src</EM> directory to assist with spidering web servers. There are example
configurations for spidering in the <EM>conf</EM> directory.

<P>
As of Swish-e 2.2, there's a general purpose &quot;prog&quot; document
source where a program can feed documents to it for indexing. A number of
example programs can be found in the <CODE>prog-bin</CODE> directory, including a program to spider web servers. The provided
spider.pl program is full-featured and is easily customized.

<P>
The advantage of the &quot;prog&quot; document source feature over the
&quot;http&quot; method is that the program is only executed one time,
where the swishspider.pl program used in the &quot;http&quot; method is
executed once for every document read from the web server. The forking of
Swish-e and compiling of the perl script can be quite expensive, time-wise.

<P>
The other advantage of the <CODE>spider.pl</CODE> program is that it's simple and efficient to add filtering (such as for PDF
or MS Word docs) right into the spider.pl's configuration, and it includes
features such as MD5 checks to prevent duplicate indexing, options to avoid
spidering some files, or index but avoid spidering. And since it's a perl
program there's no limit on the features you can add.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Why_does_swish_report_swishspider_not_found_">Why does swish report &quot;./swishspider: not found&quot;?</A></H3>
<P>
Does the file <EM>swishspider</EM> exist where the error message displays? If not, either set the
configuration option <A HREF="././SWISH-CONFIG.html#item_SpiderDir">SpiderDirectory</A>
to point to the directory where the <EM>swishspider</EM> program is found, or place the
<EM>swishspider</EM> program in the current directory when running swish-e.

<P>
If you are running Windows, make sure &quot;perl&quot; is in your path. Try
typing <EM>perl</EM> from a command prompt.

<P>
If you not running windows, make sure that the shebang line (the first line
of the swishspider program that starts with #!) points to the correct
location of perl. Typically this will be <EM>/usr/bin/perl</EM> or <EM>/usr/local/bin/perl</EM>. Also, make sure that you have execute and read permissions on <EM>swishspider</EM>.

<P>
The <EM>swishspider</EM> perl script is only used with the -S http method of indexing.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="I_m_using_the_spider_pl_program_to_spider_my_web_site_but_some_large_files_are_not_indexed_">I'm using the spider.pl program to spider my web site, but some
large files are not indexed.</A></H3>
<P>
The <CODE>spider.pl</CODE> program has a default limit of 5MB file size. This can be changed with the <CODE>max_size</CODE> parameter setting. See <CODE>perldoc
spider.pl</CODE> for more information.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="I_still_don_t_think_all_my_web_pages_are_being_indexed_">I still don't think all my web pages are being indexed.</A></H3>
<P>
The <EM>spider.pl</EM> program has a number of debugging switches and can be quite verbose in
telling you what's happening, and why. See <CODE>perldoc
spider.pl</CODE> for instructions.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Swish_is_not_spidering_Javascript_links_">Swish is not spidering Javascript links!</A></H3>
<P>
Swish cannot follow links generated by Javascript, as they are generated by
the browser and are not part of the document.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_spider_other_websites_and_combine_it_with_my_own_filesystem_index_">How do I spider other websites and combine it with my own
(filesystem) index?</A></H3>
<P>
You can either merge <CODE>-M</CODE> two indexes into a single index, or use <CODE>-f</CODE>
to specify more than one index while searching.

<P>
You will have better results with the <CODE>-f</CODE> method.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H2><A NAME="Searching">Searching</A></H2>
<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_limit_searches_to_just_parts_of_the_index_">How do I limit searches to just parts of the index?</A></H3>
<P>
If you can identify &quot;parts&quot; of your index by the path name you
have two options.

<P>
The first options is by indexing the document path. Add this to your
configuration:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    MetaNames swishdocpath</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Now you can search for words or phrases in the path name:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -w 'foo AND swishdocpath=(sales)'</pre>
        </td>
	    
      </tr>
    </table>
    <P>
So that will only find documents with the word &quot;foo&quot; and where
the file's path contains &quot;sales&quot;. That might not works as well as
you like, though, as both of these paths will match:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    /web/sales/products/index.html
    /web/accounting/private/sales_we_messed_up.html</pre>
        </td>
	    
      </tr>
    </table>
    <P>
This can be solved by searching with a phrase (assuming &quot;/&quot; is
not a WordCharacter):

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -w 'foo AND swishdocpath=(&quot;/web/sales/&quot;)'
    swish-e -w 'foo AND swishdocpath=(&quot;web sales&quot;)'  (same thing)</pre>
        </td>
	    
      </tr>
    </table>
    <P>
The second option is a bit more powerful. With the <A HREF="#item_ExtractPath">ExtractPath</A>
directive you can use a regular expression to extract out a sub-set of the
path and save it as a separate meta name:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    MetaNames department
    ExtractPath department regex !^/web/([^/]+).+$!$1/</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Which says match a path that starts with &quot;/web/&quot; and extract out
everything after that up to, but not including the next &quot;/&quot; and
save it in variable $1, and then match everything from the &quot;/&quot;
onward. Then replace the entire matches string with $1. And that gets
indexed as meta name &quot;department&quot;.

<P>
Now you can search like:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    swish-e -w 'foo AND department=sales'</pre>
        </td>
	    
      </tr>
    </table>
    <P>
and be sure that you will only match the documents in the /www/sales/*
path. Note that you can map completely different areas of your file system
to the same metaname:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    # flag the marketing specific pages
    ExtractPath department regex !^/web/(marketing|sales)/.+$!marketing/
    ExtractPath department regex !^/internal/marketing/.+$!marketing/</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>    # flag the technical departments pages
    ExtractPath department regex !^/web/(tech|bugs)/.+$!tech/</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Finally, if you have something more complicated, use <CODE>-S prog</CODE> and write a perl program or use a filter to set a meta tag when processing
each file.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_is_ranking_calculated_">How is ranking calculated?</A></H3>
<P>
The <CODE>swishrank</CODE> property value is calculated based on which Ranking Scheme (or algorithm)
you have selected. In this discussion, any time the word <STRONG>fancy</STRONG> is used, you should consult the actual code for more details. It is open
source, after all.

<P>
Things you can do to affect ranking:

<DL>
<P><DT><STRONG><A NAME="item_MetaRankBias">MetaRankBias</A></STRONG><DD>
<P>
You may configure your index to bias certain metaname values more or less
than others. See the <A HREF="#item_MetaRankBias">MetaRankBias</A> configuration option in <A HREF="././SWISH-CONFIG.html">the SWISH-CONFIG manpage</A>.

<P><DT><STRONG><A NAME="item_IgnoreTotalWordCountWhenRanking">IgnoreTotalWordCountWhenRanking</A></STRONG><DD>
<P>
Set to 1 (default) or 0 in your config file. See <A HREF="././SWISH-CONFIG.html">the SWISH-CONFIG manpage</A>.
<STRONG>NOTE:</STRONG> You must set this to 0 to use the IDF Ranking Scheme.

<P><DT><STRONG><A NAME="item_structure">structure</A></STRONG><DD>
<P>
Each term's position in each HTML document is given a structure value based
on the context in which the word appears. The structure value is used to
artificially inflate the frequency of each term in that particular
document. These structural values are defined in <EM>config.h</EM>:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre> #define RANK_TITLE             7
 #define RANK_HEADER            5
 #define RANK_META              3
 #define RANK_COMMENTS          1
 #define RANK_EMPHASIZED        0</pre>
        </td>
	    
      </tr>
    </table>
    <P>
For example, if the word <CODE>foo</CODE> appears in the title of a document, the Scheme will treat that document as
if <CODE>foo</CODE> appeared 7 additional times.

</DL>
<P>
All Schemes share the following characteristics:

<DL>
<P><DT><STRONG><A NAME="item_AND">AND searches</A></STRONG><DD>
<P>
The rank value is averaged for all AND'd terms. Terms within a set of
parentheses () are averaged as a single term (this is an acknowledged
weakness and is on the TODO list).

<P><DT><STRONG><A NAME="item_OR">OR searches</A></STRONG><DD>
<P>
The rank value is summed and then doubled for each pair of OR'd terms. This
results in higher ranks for documents that have multiple OR'd terms.

<P><DT><STRONG><A NAME="item_scaled">scaled rank</A></STRONG><DD>
<P>
After a document's raw rank score is calculated, a final rank score is
calculated using a fancy <CODE>log()</CODE> function. All the documents are then scaled against a base score of 1000.
The top-ranked document will therefore always have a <CODE>swishrank</CODE> value of 1000.

</DL>
<P>
Here is a brief overview of how the different Schemes work. The number in
parentheses after the name is the value to invoke that scheme with <CODE>swish-e -R</CODE> or <CODE>RankScheme()</CODE>.

<DL>
<P><DT><STRONG><A NAME="item_Default">Default (0)</A></STRONG><DD>
<P>
The default ranking scheme considers the number of times a term appears in
a document (frequency), the MetaRankBias and the structure value. The rank
might be summarized as:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre> DocRank = Sum of ( structure + metabias )</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Consider this output with the DEBUG_RANK variable set at compile time: 

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre> Ranking Scheme: 0 
 Word entry 0 at position 6 has struct 7
 Word entry 1 at position 64 has struct 41
 Word entry 2 at position 71 has struct 9
 Word entry 3 at position 132 has struct 9
 Word entry 4 at position 154 has struct 9
 Word entry 5 at position 423 has struct 73
 Word entry 6 at position 541 has struct 73
 Word entry 7 at position 662 has struct 73
 File num: 1104.  Raw Rank: 21.  Frequency: 8 scaled rank: 30445
  Structure tally:
  struct 0x7 = count of 1 ( HEAD TITLE FILE ) x rank map of 8 = 8</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  struct 0x9 = count of 3 ( BODY FILE ) x rank map of 1 = 3</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  struct 0x29 = count of 1 ( HEADING BODY FILE ) x rank map of 6 = 6</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  struct 0x49 = count of 3 ( EM BODY FILE ) x rank map of 1 = 3</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Every word instance starts with a base score of 1. Then for each instance
of your word, a running sum is taken of the structural value of that word
position plus any bias you've configured. In the example above, the raw
rank is <CODE>1 + 8 + 3 + 6 + 3 = 21</CODE>.

<P>
Consider this line:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  struct 0x7 = count of 1 ( HEAD TITLE FILE ) x rank map of 8 = 8</pre>
        </td>
	    
      </tr>
    </table>
    <P>
That means there was one instance of our word in the title of the file.
It's context was in the &lt;head&gt; tagset, inside the &lt;title&gt;. The &lt;title&gt; is the most specific structure, so it gets the RANK_TITLE score:
7. The base rank of 1 plus the structure score of 7 equals 8. If there had
been two instances of this word in the title, then the score would have
been <CODE>8 + 8 = 16</CODE>.

<P><DT><STRONG><A NAME="item_IDF">IDF (1)</A></STRONG><DD>
<P>
IDF is short for Inverse Document Frequency. That's fancy ranking lingo for
taking into account the total frequency of a term across the entire index,
in addition to the term's frequency in a single document. IDF ranking also
uses the relative density of a word in a document to judge its relevancy.
Words that appear more often in a doc make that doc's rank higher, and
longer docs are not weighted higher than shorter docs.

<P>
The IDF Scheme might be summarized as:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  DocRank = Sum of ( density * idf * ( structure + metabias ) )</pre>
        </td>
	    
      </tr>
    </table>
    <P>
Consider this output from DEBUG_RANK:

<P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre> Ranking Scheme: 1 
 File num: 1104  Word Score: 1  Frequency: 8  Total files: 1451   
 Total word freq: 108   IDF: 2564  
 Total words: 1145877   Indexed words in this doc: 562   
 Average words: 789   Density: 1120    Word Weight: 28716   
 Word entry 0 at position 6 has struct 7
 Word entry 1 at position 64 has struct 41
 Word entry 2 at position 71 has struct 9
 Word entry 3 at position 132 has struct 9
 Word entry 4 at position 154 has struct 9
 Word entry 5 at position 423 has struct 73
 Word entry 6 at position 541 has struct 73
 Word entry 7 at position 662 has struct 73
 Rank after IDF weighting: 574321  
 scaled rank: 132609
  Structure tally:
  struct 0x7 = count of  1 ( HEAD TITLE FILE ) x rank map of 8 = 8</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  struct 0x9 = count of  3 ( BODY FILE ) x rank map of 1 = 3</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  struct 0x29 = count of  1 ( HEADING BODY FILE ) x rank map of 6 = 6</pre>
        </td>
	    
      </tr>
    </table>
    <P>

    <table>
      <tr>

	<td bgcolor="#eeeeee" width="1">
	  &nbsp;
        </td>

	<td>
	  <pre>  struct 0x49 = count of  3 ( EM BODY FILE ) x rank map of 1 = 3</pre>
        </td>
	    
      </tr>
    </table>
    <P>
It is similar to the default Scheme, but notice how the total number of
files in the index and the total word frequency (as opposed to the document
frequency) are both part of the equation.

</DL>
<P>
Ranking is a complicated subject. SWISH-E allows for more Ranking Schemes
to be developed and experimented with, using the -R option (from the
swish-e command) and the RankScheme (see the API documentation). Experiment
and share your findings via the discussion list.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_can_I_limit_searches_to_the_title_body_or_comment_">How can I limit searches to the title, body, or comment?</A></H3>
<P>
Use the <CODE>-t</CODE> switch.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="I_can_t_limit_searches_to_title_body_comment_">I can't limit searches to title/body/comment.</A></H3>
<P>
Or, <EM>I can't search with meta names, all the names are indexed as
"plain".</EM>



<P>
Check in the config.h file if #define INDEXTAGS is set to 1. If it is,
change it to 0, recompile, and index again. When INDEXTAGS is 1, ALL the
tags are indexed as plain text, that is you index &quot;title&quot;,
&quot;h1&quot;, and so on, AND they loose their indexing meaning. If
INDEXTAGS is set to 0, you will still index meta tags and comments, unless
you have indicated otherwise in the user config file with the IndexComments
directive.

<P>
Also, check for the <A HREF="#item_UndefinedMetaTags">UndefinedMetaTags</A> setting in your configuration file.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="I_ve_tried_running_the_included_CGI_script_and_I_get_a_Internal_Server_Error_">I've tried running the included CGI script and I get a &quot;Internal
Server Error&quot;</A></H3>
<P>
Debugging CGI scripts are beyond the scope of this document. Internal
Server Error basically means &quot;check the web server's log for an error
message&quot;, as it can mean a bad shebang (#!) line, a missing perl
module, FTP transfer error, or simply an error in the program. The CGI
script <EM>swish.cgi</EM> in the <EM>example</EM> directory contains some debugging suggestions. Type <CODE>perldoc swish.cgi</CODE> for information.

<P>
There are also many, many CGI FAQs available on the Internet. A quick web
search should offer help. As a last resort you might ask your webadmin for
help...

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="When_I_try_to_view_the_swish_cgi_page_I_see_the_contents_of_the_Perl_program_">When I try to view the swish.cgi page I see the contents of the
Perl program.</A></H3>
<P>
Your web server is not configured to run the program as a CGI script. This
problem is described in <CODE>perldoc swish.cgi</CODE>.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="How_do_I_make_Swish_e_highlight_words_in_search_results_">How do I make Swish-e highlight words in search results?</A></H3>
<P>
Short answer:

<P>
Use the supplied swish.cgi or search.cgi scripts located in the <EM>example</EM> directory.

<P>
Long answer:

<P>
Swish-e can't because it doesn't have access to the source documents when
returning results, of course. But a front-end program of your creation can
highlight terms. Your program can open up the source documents and then use
regular expressions to replace search terms with highlighted or bolded
words.

<P>
But, that will fail with all but the most simple source documents. For HTML
documents, for example, you must parse the document into words and tags
(and comments). A word you wish to highlight may span multiple HTML tags,
or be a word in a URL and you wish to highlight the entire link text.

<P>
Perl modules such as HTML::Parser and XML::Parser make word extraction
possible. Next, you need to consider that Swish-e uses settings such as
WordCharacters, BeginCharacters, EndCharacters, IgnoreFirstChar, and
IgnoreLast, char to define a &quot;word&quot;. That is, you can't consider
that a string of characters with white space on each side is a word.

<P>
Then things like TranslateCharacters, and HTML Entities may transform a
source word into something else, as far as Swish-e is concerned. Finally,
searches can be limited by metanames, so you may need to limit your
highlighting to only parts of the source document. Throw phrase searches
and stopwords into the equation and you can see that it's not a trivial
problem to solve.

<P>
All hope is not lost, thought, as Swish-e does provide some help. Using the <CODE>-H</CODE> option it will return in the headers the current index (or indexes)
settings for WordCharacters (and others) required to parse your source
documents as it parses them during indexing, and will return a &quot;Parsed
Words:&quot; header that will show how it parsed the query internally. If
you use fuzzy indexing (word stemming, soundex, or metaphone) then you will
also need to stem each word in your document before comparing with the
&quot;Parsed Words:&quot; returned by Swish-e.

<P>
The Swish-e stemming code is available either by using the Swish-e Perl
module (SWISH::API) or the C library (included with the swish-e
distribution), or by using the SWISH::Stemmer module available on CPAN.
Also on CPAN is the module Text::DoubleMetaphone. Using SWISH::API probably
provides the best stemming support.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H3><A NAME="Do_filters_effect_the_performance_during_search_">Do filters effect the performance during search?</A></H3>
<P>
No. Filters (FileFilter or via &quot;prog&quot; method) are only used for
building the search index database. During search requests there will be no
filter calls.

<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H2><A NAME="I_have_read_the_FAQ_but_I_still_have_questions_about_using_Swish_e_">I have read the FAQ but I still have questions about using Swish-e.</A></H2>
<P>
The Swish-e discussion list is the place to go. <A
HREF="http://swish-e.org/.">http://swish-e.org/.</A> Please do not email
developers directly. The list is the best place to ask questions.

<P>
Before you post please read <EM>QUESTIONS AND TROUBLESHOOTING</EM> located in the <A HREF="././INSTALL.html">INSTALL</A> page. You should also search the Swish-e discussion list archive which can
be found on the swish-e web site.

<P>
In short, be sure to include in the following when asking for help.

<UL>
<P><LI><STRONG><A NAME="item_The">The swish-e version (./swish-e -V)</A></STRONG>
<P><LI><STRONG><A NAME="item_What">What you are indexing (and perhaps a sample), and the number
of files</A></STRONG>
<P><LI><STRONG><A NAME="item_Your">Your Swish-e configuration file</A></STRONG>
<P><LI><STRONG><A NAME="item_Any">Any error messages that Swish-e is reporting</A></STRONG>
</UL>
<P>
[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>
<H1><A NAME="Document_Info">Document Info</A></H1>
<P>
$Id: SWISH-FAQ.pod,v 1.36 2004/10/04 22:49:35 whmoseley Exp $

<P>
.

[ <B><FONT SIZE=-1><A HREF="#toc">TOC</A></FONT></B> ]
<HR>



    <p>
    <div class="navbar">
      <a href="./SWISH-SEARCH.html">Prev</a> |
      <a href="./index.html">Contents</a> |
      <a href="./SWISH-BUGS.html">Next</a>
    </div>
    <p>

    <P ALIGN="CENTER">
    <IMG ALT="" WIDTH="470" HEIGHT="10" SRC="images/dotrule1.gif"></P>
    <P ALIGN="CENTER">

    <div class="footer">
        <BR>SWISH-E is distributed with <B>no warranty</B> under the terms of the
        <A HREF="http://www.fsf.org/copyleft/gpl.html">GNU Public License</A>,<BR>
        Free Software Foundation, Inc., 
        59 Temple Place - Suite 330, Boston, MA  02111-1307, USA<BR> 
        Public questions may be posted to 
        the <A HREF="http://swish-e.org/Discussion/">SWISH-E Discussion</A>.
    </div>

</body>
</html>