File: README

package info (click to toggle)
wwwoffle 2.9f-2.2
  • links: PTS
  • area: main
  • in suites: squeeze
  • size: 7,716 kB
  • ctags: 2,077
  • sloc: ansic: 23,951; sh: 6,815; lex: 4,201; perl: 917; makefile: 623; ruby: 181; lisp: 14
file content (914 lines) | stat: -rw-r--r-- 41,008 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
          WWWOFFLE - World Wide Web Offline Explorer - Version 2.9e
          =========================================================


The WWWOFFLE programs simplify World Wide Web browsing from computers that use
intermittent connections to the internet.

Description
-----------

The WWWOFFLE server is a proxy web server with special features for use with
intermittent internet links.  This means that it is possible to browse web pages
and read them without having to remain connected.

Basic Features
    - Caching of HTTP, FTP and finger protocols.
    - Allows the 'GET', 'HEAD', 'POST' and 'PUT' HTTP methods.
    - Interactive or command line control of online/offline/autodial status.
    - Highly configurable.
    - Low maintenance, start/stop and online/offline status can be automated.

While Online
    - Caching of pages that are viewed for later review.
    - Conditional fetching to only get pages that have changed.
        - Based on expiration date, time since last fetched or once per session.
    - Non cached support for SSL (Secure Socket Layer e.g. https).
    - Caching for https connections. (compile time option).
    - Can be used with one or more external proxies based on web page.
    - Control which pages cannot be accessed.
        - Allow replacement of blocked pages.
    - Control which pages are not to be stored in the cache.
    - Create backups of cached pages when server cannot be contacted.
        - Option to create backup when server sends back an error page.
    - Requests compressed pages from web servers (compile time option).
    - Requests chunked transfer encoding from web servers.

While Offline
    - Can be configured to use dial-on-demand for pages that are not cached.
    - Selection of pages to download next time online
        - Using normal browser to follow links.
        - Command line interface to select pages for downloading.
    - Control which pages can be requested when offline.
    - Provides non-cached access to intranet servers.

Automated Download
    - Downloading of specified pages non-interactively.
    - Options to automatically fetch objects in requested pages
        - Understands various types of pages
            - HTML 4.0, Java classes, VRML (partial), XML (partial).
        - Options to fetch different classes of objects
            - Images, Stylesheets, Frames, Scripts, Java or other objects.
        - Option to not fetch webbug images (images of 1 pixel square).
    - Automatically follows links for pages that have been moved.
    - Can monitor pages at regular intervals to fetch those that have changed.
    - Recursive fetching
        - To specified depth.
        - On any host or limited to same server or same directory.
        - Chosen from command line or from browser.
        - Control over which links can be fetched recursively.

Convenience
    - Optional information footer on HTML pages showing date cached and options.
    - Options to modify HTML pages
        - Remove scripts.
        - Remove Java applets.
        - Remove stylesheets.
        - Remove shockwave flash animations.
        - Indicate cached and uncached links.
        - Remove the blink tag.
        - Remove the marquee tag.
        - Remove refresh tags.
        - Remove links to pages that are in the DontGet list.
        - Remove inline frames (iframes) that are in the DontGet list.
        - Replace images that are in the DontGet list.
        - Replace webbug images (images of 1 pixel square).
        - Demoronise HTML character sets.
        - Fix mixed Cyrillic character sets.
        - Stop animated GIFs.
        - Remove Cookies in meta tags.
    - Provides information about cached pages
        - Headers, raw and modified.
        - Contents, images, links etc.
        - Source code unmodified by WWWOFFLE.
    - Automatic proxy configuration with Proxy Auto-Config file.
    - Searchable cache with the addition of the ht://Dig, mnoGoSearch
      (UdmSearch), Namazu or Hyper Estraier programs.
    - Built in simple web-server for local pages
        - HTTP and HTTPS access (compile time option).
        - Allows CGI scripts.
    - Timeouts to stop proxy lockups
        - DNS name lookups.
        - Remote server connection.
        - Data transfer.
    - Continue or stop downloads interrupted by client.
        - Based on file size of fraction downloaded.
    - Purging of pages from cache
        - Based on URL matching.
        - To keep the cache size below a specified limit.
        - To keep the free disk space above a specified limit.
        - Interactive or command line control.
        - Compression of cached pages based on age.
    - Provides compressed pages to web browser (compile time option).
    - Use chunked transfer-encoding to web browser.

Indexes
    - Multiple indexes of pages stored in cache
        - Servers for each protocol (http, ftp ...).
        - Pages on each server.
        - Pages waiting to be fetched.
        - Pages requested last time offline.
        - Pages fetched last time online.
        - Pages monitored on a regular basis.
    - Configurable indexes
        - Sorted by name, date, server domain name, type of file.
        - Options to delete, refresh or monitor pages.
        - Selection of complete list of pages or hide un-interesting pages.

Security
    - Works with pages that require basic username/password authentication.
    - Automates proxy authentication for external proxies that require it.
    - Control over access to the proxy
        - Defaults to local host access only.
        - Host access configured by hostname or IP address.
        - Optional proxy authentication for user level access control.
    - Optional password control for proxy management functions.
    - HTTPS access to all proxy management web pages (compile time option).
    - Can censor incoming and outgoing HTTP headers to maintain user privacy.

Configuration
    - All options controlled using a configuration file.
    - Interactive web page to allow editing of the configuration file.
    - User customisable error and information pages.
    - Log file or syslog reporting with user specified error level.


Configuring A Web Browser
-------------------------

To use the WWWOFFLE programs, requires that your web browser is set up to use it
as a proxy.  The proxy hostname will be 'localhost' (or the name of the host
that wwwoffled is running on), and the port number will be the one that is used
by wwwoffled (default 8080).

There are lots of different browsers and it is not possible to list all the ways
to configure them here.  There should be an option in one of the menus or
described in the manual for the browser that explains how to configure a proxy.

You will also need to disable the caching that the web browser performs itself
between sessions to get the best out of the program.

Depending on which browser you use and which version, it is possible to request
pages to be refreshed while offline.  This is done using the 'reload' or
'refresh' button or key on the browser.  On many browsers, there are two ways of
doing this, one forces the proxy to reload the page, and this is the one that
will cause the page to be refreshed.

The latest browser compatibility information is available at:

http://www.gedanken.demon.co.uk/wwwoffle/version-2.9/browser.html


Welcome Page
------------

There is a welcome page at URL 'http://localhost:8080/' that gives a very brief
description of the program and has links to the index pages, interactive control
page and the WWWOFFLE internet home pages.

The most important places to get information about WWWOFFLE are the WWWOFFLE
homepage which has information about WWWOFFLE in general:

http://www.gedanken.demon.co.uk/wwwoffle/

Or even better the WWWOFFLE version-2.9 user page which has more information
about this version of WWWOFFLE:

http://www.gedanken.demon.co.uk/wwwoffle/version-2.9/user.html


Index Of Cached Files
---------------------

To get the index of cached files, use the URL 'http://localhost:8080/index/'.
There are sufficient links on each of the index pages to allow easy navigation.

The indexes provides several levels of information:
   A list of the requests in the outgoing directory.
   A list of the files fetched the last time that the program was online.
      And for the previous 5 times before that.
   A list of the files requested the last time that the program was offline.
      And for the previous 5 times before that.
   A list of the files that are being monitored.
   A list of all hosts for each of the protocols (http,ftp etc.).
      A list of all of the files on a particular host.

These indexes can be sorted in a number of ways:
   No sorting (directory order on disk).
   By time of last modification (update).
   By time of last access.
   By date of last update with markers for each day.
   Alphabetically.
   By file extension.
   Random.

For each of the pages that are cached there are options to delete the page,
refresh it, select the interactive refresh page with the URL already filled in
or add the page to the list that is monitored regularly.

It is also possible to specify in the configuration file what URLs are not to be
listed in the indexes.


Interactive Refresh Page
------------------------

Pages can be specified by using whatever method is provided by the browser that
is used or as an alternative there is an interactive refresh page.  This allows
the user to enter a URL and then fetch it if it is not currently cached or
refresh it if it is in the cache.  There is also the option here to recursively
fetch the pages that are linked to by the page that is specified.  This
recursive fetching can be limited to pages from the same host, narrowed down to
links in the same directory (or subdirectory) or widened to fetch pages from any
web server.  This functionality is also provided in the 'wwwoffle' command line
program.


Monitoring Web-Pages
--------------------

Pages can be specified that are to be checked at regular intervals.  This can
either be every time that WWWOFFLE is online or at user specifiable times.  The
page will be monitored when the four specified conditions are all met:
A month of the year that it can be fetched in (can be set to all months).
A day of the month that the page can be fetched on (can be set to all days).
A day of the week that the page can be fetched on (can be set to all days).
An hour of the day that the page should be fetched after (can be more than one).

For example to get a URL every Saturday morning, use the following:

Month of year: all
Day of Month : all
Day of week  : Saturday
Hour of day  : 0 (24hr clock)


Interactive Control Page
------------------------

The behaviour and mode of operation of the WWWOFFLE daemon can be controlled from
an interactive control page at 'http://localhost:8080/control/'.  This has a
number of buttons that change the mode of the proxy server.  These provide the
same functionality as the 'wwwoffle' command line program.  To provide security,
this page can be password protected.  There is also the facility to delete pages
from the cache or from the spooled outgoing requests directory.


Interactive Configuration File Editing Page
-------------------------------------------

The interactive configuration file editing page allows the configuration file
wwwoffle.conf to be edited.  This facility can be reached via the configuration
editing page 'http://localhost:8080/configuration/'.  Each item in the
configuration file has a separate web-page with a form in it that lists the
current entries in the configuration file and allows each entry to be edited
individually.  When an entry has been updated, the configuration file needs to
be re-read.


Searching the Cache
-------------------

The three web indexing programs ht://Dig, mnoGoSearch (UdmSearch), Namazu or
Hyper Estraier can be used to create an index of the pages in the WWWOFFLE cache
for later searching.

For ht://Dig version 3.1.0b4 or later is required, it can be found at
http://www.htdig.org/.

For mnoGoSearch (previously called UdmSearch) version 3.1.0 or later is
required, it can be found at http://mnogosearch.org/.

For Namazu version 2.0.0 or later is required, it can be found at
http://www.namazu.org/, also required is mknmz-wwwoffle which can be found at
http://www.naney.org/comp/distrib/mknmz-wwwoffle/.

For Hyper Estraier version 0.5.3 or later is required, it can be found at
http://hyperestraier.sourceforge.net/.

The search forms for these programs are 'http://localhost:8080/search/htdig/',
'http://localhost:8080/search/mnogosearch/',
'http://localhost:8080/search/namazu/', and
'http://localhost:8080/search/hyperestraier/'.  These allow the search part of
the programs to be run to find the cached web-pages that you want.

For more information about configuring these programs to work with WWWOFFLE you
should read the file README.htdig, README.mnogosearch, README.namazu, or
README.hyperestraier.


Built-In Web-Server
-------------------

Any URLs to WWWOFFLE on port 8080 that refer to files in the directory '/' refer
to the files that are stored in the 'html' subdirectory.  This directory also
contains the message templates that WWWOFFLE uses to generate the internal web
pages.  When a file is requested from either of these locations it is first
looked for in the language specific sub-directory specified in the browser's
request header.  If it is not found in that location then it is looked for in
the directory named 'default' which by default is a symbolic link to the English
language pages, but can be changed.  If it is not found in this location then it
is looked for in the English language directory (since that will have a full set
of pages).

Any URLs that refer to files in the directory '/local/' are taken from the files
in the 'local' sub-directory of the spool directory if they exist.  If they do
not exist then they are searched for in the language subdirectories of the
'html' directory as described above.  This allows for trivial web-pages to be
provided without a separate web-server.  CGI scripts are available but disabled
by the default configuration file.  The MIME type used for these files are those
that are specified in the configuration file.

Important: The local web-page server will follow symbolic links, but will only
           allow access to files that are world readable.  See the FAQ for
           security issues.


Deleting Requests
-----------------

If no password is used for the control pages then it is possible for anybody to
delete requests that are recorded.  If a password is assigned then users that
know this password can delete any request (or cached file or other thing).
Individual users that do not know the password can delete pages that they have
requested provided that they do it immediately that the "Will Get" page appears,
the "Delete" button on here has a once-only password that will delete that
request.


Backup Copies of Pages
----------------------

When a page is fetched while online a remote server error will overwrite any
existing web page.  In this case a backup copy of the page is made so that when
the error message has been read while offline the backup copy is placed back
into the cache.  This is automatic for all cases of files that have remote
server errors (and that do not use external proxies), no user intervention is
required.


Lock Files
----------

When one WWWOFFLE process is downloading a file any other WWWOFFLE process that
tries to read that file will not be able to until the first one has finished.
This removes the problem of an incomplete page being displayed in the second
browser, or a second copy of the page being fetched.  If the lock file is not
removed by the first process within a timeout period then the second process
will produce an error message indicating the problem.

This is now a configurable option, the default condition is that lock files are
not used.


HTTPS Access to Internal Pages
------------------------------

All of the web pages that are available through normal HTTP access on port 8080
(e.g. http://localhost:8080/*) are also available with secure HTTPS access on
port 8443 if WWWOFFLE is compiled with the libgnutls encryption library.  This
applies to all pages; indexes, built-in web server and control and configuration
pages.


Caching of HTTPS Web Pages
--------------------------

It is possible to configure WWWOFFLE so that it will intercept and cache
selected HTTPS connections.  This is disabled by default and there are three
steps to enable it.  WWWOFFLE must be compiled with encryption support, the
enable-caching option in the SSLOptions section of the configuration file must
be set true and the list of hosts to cache for must be set.

When WWWOFFLE is configured to cache an HTTPS web page it will request the page,
decrypt it, re-encrypt it and pass it to the browser.  The copy of the page that
is stored in the cache will be stored without encryption.  With this option all
other WWWOFFLE features like the DontGet section, the ModifyHTML section, the
OnlineOptions and others will be used.  Normally most of these options cannot be
applied to HTTPS pages because the exact URL is not known to WWWOFFLE and the
unencrypted contents are not visible.


HTTPS Server Certificates
-------------------------

To handle the encryption functions described above WWWOFFLE will create and
manage a set of server certificates.  One master certificate is used to sign all
of the other certificates that WWWOFFLE creates.  The created certificates are
either for the WWWOFFLE server HTTPS access pages or for a fake certificate that
is created for each server that is cached.  The certificates that are captured
by WWWOFFLE and stored are the certificates that are sent back by the real HTTPS
server.  The final set of certificates are the trusted certificates that
WWWOFFLE can use to confirm that the remote server is the one it claims to be.

The full set of certificates that WWWOFFLE stores can be seen through the
WWWOFFLE URL http://localhost:8080/certificates/ but is only available if
WWWOFFLE was compiled with encryption support.

To add trusted certificates to WWWOFFLE place the certificate file (in PEM
format) into the directory '/var/lib/wwwoffle/certificates/trusted' and
restart WWWOFFLE.


Spool Directory Layout
----------------------

In the spool directory there is a directory for each of the network protocols
that are handled.  In this directory there is a directory for each hostname that
has been contacted and has pages cached.  These directories have the name of the
host.  In each of these directories, there is an entry for each of the pages
that are cached, generated using a hashing function to give a constant length.
The entry consists of two files, one prefixed with 'D' that contains the data
and one prefixed with 'U' that contains the URL.

The outgoing directory is a single directory that all of the pending requests
are contained in, the format is the same with two files for each, but using 'O'
for the file containing the request instead of 'D' and one prefixed with 'U'
that contains the URL.

The lasttime (and prevtime*) directories are a single directory that contains an
entry for each of the files that were fetched the last time that the program was
online.  Each entry consists of two files, one prefixed with 'D' that is a
hard-link to the real file and one prefixed with 'U' that contains the URL.

The lastout (and prevout*) directories are a single directory that contains an
entry for each of the files that were requested the last time that the program
was offline.  Each entry consists of two files, one prefixed with 'D' that is a
hard-link to the real file and one prefixed with 'U' that contains the URL.

The monitor directory is a single directory that all of the regularly monitored
requests are contained in, the format is the same as the outgoing directory with
two files for each, using 'O' and 'U' prefixes.  There is also a file with an
'M' prefix that contains the information about when to monitor the URL.


The Programs and Configuration File
-----------------------------------

There are two programs that make up this utility, with three distinct functions.

wwwoffle  - A program to interact with and control the HTTP proxy daemon.

wwwoffled - A daemon process that acts as an HTTP proxy.
wwwoffles - A server that actually does the fetching of the web pages.

The wwwoffles function is combined with the wwwoffled function into the
wwwoffled program from version 1.1 onwards.  This is to simplify the procedure
of starting servers, and allow for future improvements.

The configuration file, called wwwoffle.conf by default contains all of the
parameters that are used to control the way the wwwoffled and wwwoffles
functions work.  The default installation location for this file is in the
directory /etc/wwwoffle.


WWWOFFLE - User control program
-------------------------------

The control program (wwwoffle) is used to control the action of the daemon
program (wwwoffled), or to request pages that are not in the cache.

The daemon program needs to know if the system is online or offline, when to
fetch the pages that have been previously requested and when to purge the cache
of old pages.


The first mode of operation is for controlling the daemon process.  These are the
functions that are also available on the interactive control page (except kill).

wwwoffle -online        Indicates to the daemon that the system is online.

wwwoffle -autodial      Indicates to the daemon that the system is in autodial
                        mode, this will use cached pages if they exist and use
                        the network as last resort, for dial-on-demand systems.

wwwoffle -offline       Indicates to the daemon that the system is offline.

wwwoffle -fetch         Commands the daemon to fetch the pages that were
                        requested by clients while the system was offline.
                        wwwoffle exits when the fetching is complete.
                        (This requires the daemon to be told it is online).

wwwoffle -config        Cause the configuration file for the daemon process to be
                        re-read.  The config file can also be re-read by sending
                        a HUP signal to the wwwoffled process.

wwwoffle -purge         Commands the daemon to purge from the cache the pages
                        that are older than the number of days specified in the
                        configuration file, using modification or access
                        time. Or if a maximum size is specified then delete the
                        oldest pages until the maximum size is not exceeded.

wwwoffle -status        Request from the wwwoffled proxy server the current
                        status of the program.  The online or offline mode, the
                        fetch and purge statuses, the number of current
                        processes and their PIDs are displayed.

wwwoffle -kill          Causes the daemon to exit cleanly at a convenient point.


The second mode of operation is to specify URLs to get.

wwwoffle <URL> .. <URL> Specifies to the daemon URLs that must be fetched.
                        If online then it is got immediately, else the request
                        is stored for a later fetch.

wwwoffle <filename> ... The specified HTML file is be read and all of the links
                        in it used as if they had been specified on the command
                        line.

wwwoffle -post <URL>    Send a request using the POST method, the data is read
                        from stdin and should be provided correctly url-encoded.

wwwoffle -put <URL>     Send a request using the PUT method, the data is read
                        from stdin and should be provided correctly url-encoded.


wwwoffle -F             Force the wwwoffle server to refresh the URL.
                        (Or fetch it if not cached.)

wwwoffle -g[Sisfo]      Specifies that the URLs when fetched are to be parsed
                        for Stylesheets (s), images (i), scripts (s), frames (f)
                        or objects (o) and these are also to be fetched.  Using
                        -g without any following letters will get none of them.

wwwoffle -r[<depth>]    Specifies that the URL when fetched is to have the links
                        followed and these pages also fetched (to a depth
                        specified by the optional depth parameter, default 1).
                        Only links on the same server are to be fetched.

wwwoffle -R[<depth>]    This is the same as the '-r' option except that all of
                        the links are to be followed, even those to other
                        servers.

wwwoffle -d[<depth>]    This is the same as the '-r' option except that links
                        are only followed if they are in the same directory or a
                        sub-directory.

                        (If the -F, -(d|r|R) or -g[Sisfo] options are set they
                        override the options in the FetchOptions section of the
                        config file and only the -g[Sisfo] options are fetched.)


The third mode of operation is to get a URL from the cache.

wwwoffle <URL>          Specifies the URL to get.

wwwoffle -o             Get the URL and output it on the standard output.
                        (Or request it if not already cached.)

wwwoffle -O             Get the URL and output it on the standard output
                        including the HTTP headers.
                        (Or request it if not already cached.)


The last mode of operation is to provide help in using the other modes.

wwwoffle -h             Gives help about the command line options.


With any of the first three modes of operation the WWWOFFLE server can be
specified in one of three different ways.

wwwoffle -c <config-file>
                        Can be used to specify the configuration file that
                        contains the port numbers, server hostname (the first
                        entry in the LocalHost section) and the password (if
                        required for the first mode of operation).  If there is
                        a password then this is the only way to specify it.

wwwoffle -p <host>[:<port>]
                        Can be used to specify the hostname and port number that
                        the daemon program listens to for control messages (first
                        mode) or proxy connections (second and third modes).

WWWOFFLE_PROXY          An environment variable that can be used to specify
                        either the argument to the -c option (must be the full
                        pathname) or the argument to the -p option.  (In this
                        case two ports can be specified, the first for the proxy
                        connection, the second for the control connection
                        e.g. 'localhost:8080:8081' or 'localhost:8080'.)


WWWOFFLED - Daemon program
--------------------------

The daemon program (wwwoffled) runs as an HTTP proxy and also accepts
connections from the control program (wwwoffle).

The daemon program needs to maintain the current state of the system, online or
offline, as well as the other parameters in the configuration file.

As HTTP proxy requests come in, the program forks a copy of itself (the
wwwoffles function) to handle the requests.  The server program can also be
forked in response to the wwwoffle program requesting pages to be fetched.


wwwoffled -c <config-file>      Starts the daemon with the named configuration
                                file.

wwwoffled -d [level]            Starts the daemon in debugging mode, i.e it does
                                not detach from the terminal and uses standard
                                error for the log messages.  The optional
                                numeric level (0 for none to 5 for all or 6 for
                                more) specifies the level of error messages for
                                standard error, if not specified then use
                                log-level from the config file.

wwwoffled -f                    Start the daemon in debugging mode (implies -d)
                                and when the first HTTP request comes in handle
                                it without creating a child process and then
                                exit.

wwwoffled -p                    Print the PID of the daemon on standard out
                                before detaching from the terminal.

wwwoffled -h                    Gives help about the command line options.


There are a number of error and informational messages that are generated by the
program as it runs.  By default (in the config file) these go to syslog, by
using the -d flag the daemon does not detach from the terminal and the errors
are also on standard error.

By using the run-uid and run-gid options in the config file, it is possible to
change the user id and group id that the program runs as.  This will require
that the program is started by root and that the specified user has read/write
access to the spool directory.


WWWOFFLES - Server program
--------------------------

The server (wwwoffles) starts by being forked from the daemon (wwwoffled) in one
of three different modes.

Real  - When the system is online and acting as a proxy for a client.
        All requests for web pages are handled by forking a new server which
        will connect to the remote host and fetch the page.  This page is then
        stored in the cache as well as being returned to the client.  If the
        page is already in the cache then the remote server is asked for a newer
        page if one exists, else the cache one is used.

SpoolOrReal - When the system is in autodial mode and we have not decided if we
        will go for Spool or Real mode.  Select Spool mode if already cached and
        Real mode otherwise as a last resort.

Fetch - When the system is online and fetching pages that have been requested.
        All web page requests in the outgoing directory are fetched by the
        server connecting to the remote host to get the page.  This page is then
        stored in the cache, there is no client active.  If the page has been
        moved then the link is followed and that one fetched.

Spool - When the system is offline and acting as a proxy for a client.
        All requests for web pages are handled by forking a server that will
        either return a cached page or store the request.  If the page is
        cached, it is returned to the client, else a dummy page is returned
        (and stored in the cache), and the outgoing request is stored.
        If the cached page refers to a page that failed to be downloaded then it
        will be deleted from the cache.

Depending on the existence of files in the spool and other conditions, the mode
can be changed to one of several other modes.

RealNoCache - For requests for pages on the server machine or those specified
        not to be cached in the configuration file.

RealRefresh - Used by the refresh button on the index or the wwwoffle program
        to re-fetch a page while the system is online.

RealNoPassword - Used when a password was provided and two copies of the page
        are required, one with and one without the password.

FetchNoPassword - Used when a password was provided and two copies of the page
        are required, one with and one without the password.

SpoolGet - Used when the page does not exist in the cache so a request needs to
        be stored for it in the outgoing directory.

SpoolRefresh - Used when the refresh button on the index or the wwwoffle program
        are used, the existing spooled page (if there is one) is not
        overwritten, but a request is stored.

SpoolPragma - Used when the client requests the cache to refresh the page
        using the 'Pragma: no-cache' header, the existing spooled page (if there
        is one) is not overwritten, but a request is stored.

InternalPage - Used when the program is generating a web-page internally or is
        spooling a web-page with modifications.


WWWOFFLE-TOOLS - Cache maintenance program
------------------------------------------

This is a quick hack program that I wrote to allow you to list the contents of
the cache or move files around in it.  The programs are all named after common
UNIX commands with a 'wwwoffle-' prefix.

All of the programs should be invoked from the spool directory.

wwwoffle-rm     - Delete the URL that is specified on the command line.
                  To delete all URLs from a host it is easier to use
                  'rm -r http/foo' than use this.

wwwoffle-mv     - To rename URLs under one path in the spool to another path.
                  Because the URL is encoded in the filename just renaming the
                  files or the directory will not work.  Instead of 'mv http/foo
                  http/bar' use 'wwwoffle-mv http/foo http/bar'.  Also works for
                  complex cases: 'wwwoffle-mv http://foo/bar http:///bar/foo'.

wwwoffle-ls     - To list a cached URL or the files in a cache directory in the
                  style of 'ls -l'.  As examples use 'wwwoffle-ls http/foo' to
                  list all of the URLs in the cache directory 'http/foo', use
                  'wwwoffle-ls http://foo/' to list the single URL 'http://foo/'
                  or use 'wwwoffle-ls outgoing' to list the outgoing requests.

wwwoffle-read   - Read data directly from the cache for the URL named on the
                  command line and output it on stdout.

wwwoffle-write  - Write data directly to the cache for the URL named on the
                  command line from stdin.  Note this requires a HTTP header to
                  be included first or clients may get confused.
                  (echo "HTTP/1.0 200 OK" ; echo "" ; cat bar.html ) | \
                  wwwoffle-write http://www.foo/bar.html

wwwoffle-hash   - Print WWWOFFLE's encoding of the submitted URL. This is
                  useful for scripts working on the WWWOFFLE cache.

wwwoffle-fsck   - Checks the WWWOFFLE cache for consistency, it will rename any
                  files where the filename does not match the hash of the URL.

wwwoffle-gzip   - Compress the contents of the cache so that they take less
                  space but WWWOFFLE can still read them.

wwwoffle-gunzip - Uncompress the contents of the cache.

All of the programs are the same executable and the name of the file determines
the function.  The wwwoffle-tools executable can also be used with a command
line parameter, for example 'wwwoffle-ls' is the same as 'wwwoffle-tools -ls'.

This program also accepts the '-c <config-file>' option and uses the
'WWWOFFLE_PROXY' environment variable so that the wwwoffle-write program uses
the correct permissions and uid/gid.

These are basically hacks and should not be considered as fully featured and
fully debugged programs.


audit-usage.pl - Perl script to check log files
-----------------------------------------------

The audit-usage.pl script (in the contrib directory) can be used to get audit
information from the output of the wwwoffled program.

If wwwoffled is started as

wwwoffled -c /etc/wwwoffle/wwwoffle.conf -d 4

Then on the standard error output will be generated information about the
program as it is run.  The debug level needs to be 4 so that the URL information
is output.

If this is captured into a log file then it can be analysed by the
audit-usage.pl program.  When run this will tell the host that the connection is
made from and the URL that is requested.  It also includes the timestamp
information and connections to the WWWOFFLE control connection.


Test Programs
-------------

In the testprogs directory are two test programs that can be compiled if
required.  They are not needed for WWWOFFLE to work, but if you are customising
the information pages for WWWOFFLE to use or trying to debug the HTML parser
then they will be of use.

These are even more hacks than the wwwoffle-tools programs, use at your own risk.


Author and Copyright
--------------------

The two programs wwwoffle and wwwoffled were written by Andrew M. Bishop in
1996,97,98,99,2000,01,02,03,04,05 and are copyright Andrew M. Bishop 1996,97,98,
99,04,05,2000,01,02,03,04,05.

The programs known as wwwoffle-tools were written by Andrew M. Bishop in 1997,
98,99,2000,01,02,03,04,05 and are copyright Andrew M. Bishop 1997,98,99,2000,01,
02,03,04,05.

The Perl scripts update-config.pl and audit-usage.pl were written by Andrew
M. Bishop in 1998,99,2000,01,02,03,04,05 and are copyright Andrew M. Bishop
1998,99,2000,01,02,03,04,05.

They can be freely distributed according to the terms of the GNU General Public
License (see the file `COPYING').

If you wish to submit bug reports or other comments about the programs then
email the author amb@gedanken.demon.co.uk and put WWWOFFLE in the subject line.


Ht://Dig
- - - -

The htdig package is copyrighted by Andrew Scherpbier. The icons in the
html/search/htdig directory come from htdig as does the search form
html/search/htdig/search.html and configuration files in search/htdig/conf/*
(with modifications by myself).


mnoGoSearch (UdmSearch)
- - - - - - - - - - - -

The mnoGoSearch package is copyrighted by Lavtech.Com Corp and released under
the GPL.  The mnoGoSearch icon in the html/search/mnogosearch directory comes
from mnoGoSearch as does the search form html/search/mnogosearch/search.html and
configuration files in search/mnogosearch/conf/* (with modifications by myself).


Namazu
- - -

The Namazu package is copyrighted by the Namazu Project and mknmz-wwwoffle is
copyrighted by WATANABE Yoshimasa, both programs are released under the GPL.
The configuration files in search/namazu/conf/* come from Namazu (with
modifications by myself).


Hyper Estraier
- - - - - - -

The Hyper Estraier package is copyrighted by Mikio Hirabayashi and is released
under the LGPL.  The configuration files in search/hyperestraier/conf/* come
from Hyper Estraier (with modifications by myself).


With Source Code Contributions From
- - - - - - - - - - - - - - - - - -

Yannick Versley <sa6z225@public.uni-hamburg.de>
        Initial syslog code (much rewritten before inclusion).

Axel Rasmus Wienberg <2wienbe@informatik.uni-hamburg.de>
        Code to run wwwoffled as a specified uid/gid.

Andreas Dietrich <quasi@baccus.franken.de>
        Code to detach the program from the terminal like a *real* daemon.

Ullrich von Bassewitz <uz@wuschel.ibb.schwaben.com>
        Better handling of signals.
        Optimisation of the file handling in the outgoing directory.
        The log-level, max-servers and max-fetch-servers config options.

Tilman Bohn <tb@bohn.isdn.uni-heidelberg.de>
        Autodial mode.

Walter Pfannenmueller <pfn@online.de>
        Document parsing Java/VRML/XML some HTML.

Ben Winslow <rain@insane.loonybin.net>
        Configuration file DontGet section optional replacement Url.
        New FTP commands to get file size and modification time.

Ingo Kloecker <kloecker@math.u-bordeaux.fr>
        Disable animated GIFs (code now removed and rewritten).

David McNab <david@rebirthing.co.nz>
        Workaround winsock bug for cygwin (now lingering close on all systems).

Olaf Buddenhagen <olafbuddenhagen@web.de>
        A patch to do random sorting in the indexes.

Jan Lukoschus <jan+wof@lukoschus.de>
        The patch for wwwoffle-hash (for wwwoffle-tools).

Paul A. Rombouts <p.a.rombouts@home.nl>
        The patch to force re-requests of redirection URLs.
        The patch to allow wildcards to have more than two '*' characters.
        The patch to allow local CGI scripts to be run.
        The patch to keep the backup copy of a page in case of server error.

Marc Boucher <MarcB@box100.com>
        The patch to perform case insensitive matching of URL-SPECs.
        The patch to handle FTP requests made with a password (like HTTP).

Ilya Dogolazky <ilyad@math.uni-bonn.de>
        The patch for the fix-mixed-cyrillic option.

Dieter <netbsd@sopwith.solgatos.com>
        A patch with some 64-bit/32-bit compatibility fixes (that prompted me to
        go and find and fix a lot more).

Andreas Mohr <andi@rhlx01.fht-esslingen.de>
        A patch to add "const" to lots of structures and function parameters
        (this prompted me to go and do a lot more).

Nils Kassube <kassube@gmx.net>
        The patch for the referer-from option.


And Other Useful Contributions From
- - - - - - - - - - - - - - - - - -

Too many people to mention - (everybody that e-mailed me).
        Suggestions and bug reports.