1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914
|
WWWOFFLE - World Wide Web Offline Explorer - Version 2.9e
=========================================================
The WWWOFFLE programs simplify World Wide Web browsing from computers that use
intermittent connections to the internet.
Description
-----------
The WWWOFFLE server is a proxy web server with special features for use with
intermittent internet links. This means that it is possible to browse web pages
and read them without having to remain connected.
Basic Features
- Caching of HTTP, FTP and finger protocols.
- Allows the 'GET', 'HEAD', 'POST' and 'PUT' HTTP methods.
- Interactive or command line control of online/offline/autodial status.
- Highly configurable.
- Low maintenance, start/stop and online/offline status can be automated.
While Online
- Caching of pages that are viewed for later review.
- Conditional fetching to only get pages that have changed.
- Based on expiration date, time since last fetched or once per session.
- Non cached support for SSL (Secure Socket Layer e.g. https).
- Caching for https connections. (compile time option).
- Can be used with one or more external proxies based on web page.
- Control which pages cannot be accessed.
- Allow replacement of blocked pages.
- Control which pages are not to be stored in the cache.
- Create backups of cached pages when server cannot be contacted.
- Option to create backup when server sends back an error page.
- Requests compressed pages from web servers (compile time option).
- Requests chunked transfer encoding from web servers.
While Offline
- Can be configured to use dial-on-demand for pages that are not cached.
- Selection of pages to download next time online
- Using normal browser to follow links.
- Command line interface to select pages for downloading.
- Control which pages can be requested when offline.
- Provides non-cached access to intranet servers.
Automated Download
- Downloading of specified pages non-interactively.
- Options to automatically fetch objects in requested pages
- Understands various types of pages
- HTML 4.0, Java classes, VRML (partial), XML (partial).
- Options to fetch different classes of objects
- Images, Stylesheets, Frames, Scripts, Java or other objects.
- Option to not fetch webbug images (images of 1 pixel square).
- Automatically follows links for pages that have been moved.
- Can monitor pages at regular intervals to fetch those that have changed.
- Recursive fetching
- To specified depth.
- On any host or limited to same server or same directory.
- Chosen from command line or from browser.
- Control over which links can be fetched recursively.
Convenience
- Optional information footer on HTML pages showing date cached and options.
- Options to modify HTML pages
- Remove scripts.
- Remove Java applets.
- Remove stylesheets.
- Remove shockwave flash animations.
- Indicate cached and uncached links.
- Remove the blink tag.
- Remove the marquee tag.
- Remove refresh tags.
- Remove links to pages that are in the DontGet list.
- Remove inline frames (iframes) that are in the DontGet list.
- Replace images that are in the DontGet list.
- Replace webbug images (images of 1 pixel square).
- Demoronise HTML character sets.
- Fix mixed Cyrillic character sets.
- Stop animated GIFs.
- Remove Cookies in meta tags.
- Provides information about cached pages
- Headers, raw and modified.
- Contents, images, links etc.
- Source code unmodified by WWWOFFLE.
- Automatic proxy configuration with Proxy Auto-Config file.
- Searchable cache with the addition of the ht://Dig, mnoGoSearch
(UdmSearch), Namazu or Hyper Estraier programs.
- Built in simple web-server for local pages
- HTTP and HTTPS access (compile time option).
- Allows CGI scripts.
- Timeouts to stop proxy lockups
- DNS name lookups.
- Remote server connection.
- Data transfer.
- Continue or stop downloads interrupted by client.
- Based on file size of fraction downloaded.
- Purging of pages from cache
- Based on URL matching.
- To keep the cache size below a specified limit.
- To keep the free disk space above a specified limit.
- Interactive or command line control.
- Compression of cached pages based on age.
- Provides compressed pages to web browser (compile time option).
- Use chunked transfer-encoding to web browser.
Indexes
- Multiple indexes of pages stored in cache
- Servers for each protocol (http, ftp ...).
- Pages on each server.
- Pages waiting to be fetched.
- Pages requested last time offline.
- Pages fetched last time online.
- Pages monitored on a regular basis.
- Configurable indexes
- Sorted by name, date, server domain name, type of file.
- Options to delete, refresh or monitor pages.
- Selection of complete list of pages or hide un-interesting pages.
Security
- Works with pages that require basic username/password authentication.
- Automates proxy authentication for external proxies that require it.
- Control over access to the proxy
- Defaults to local host access only.
- Host access configured by hostname or IP address.
- Optional proxy authentication for user level access control.
- Optional password control for proxy management functions.
- HTTPS access to all proxy management web pages (compile time option).
- Can censor incoming and outgoing HTTP headers to maintain user privacy.
Configuration
- All options controlled using a configuration file.
- Interactive web page to allow editing of the configuration file.
- User customisable error and information pages.
- Log file or syslog reporting with user specified error level.
Configuring A Web Browser
-------------------------
To use the WWWOFFLE programs, requires that your web browser is set up to use it
as a proxy. The proxy hostname will be 'localhost' (or the name of the host
that wwwoffled is running on), and the port number will be the one that is used
by wwwoffled (default 8080).
There are lots of different browsers and it is not possible to list all the ways
to configure them here. There should be an option in one of the menus or
described in the manual for the browser that explains how to configure a proxy.
You will also need to disable the caching that the web browser performs itself
between sessions to get the best out of the program.
Depending on which browser you use and which version, it is possible to request
pages to be refreshed while offline. This is done using the 'reload' or
'refresh' button or key on the browser. On many browsers, there are two ways of
doing this, one forces the proxy to reload the page, and this is the one that
will cause the page to be refreshed.
The latest browser compatibility information is available at:
http://www.gedanken.demon.co.uk/wwwoffle/version-2.9/browser.html
Welcome Page
------------
There is a welcome page at URL 'http://localhost:8080/' that gives a very brief
description of the program and has links to the index pages, interactive control
page and the WWWOFFLE internet home pages.
The most important places to get information about WWWOFFLE are the WWWOFFLE
homepage which has information about WWWOFFLE in general:
http://www.gedanken.demon.co.uk/wwwoffle/
Or even better the WWWOFFLE version-2.9 user page which has more information
about this version of WWWOFFLE:
http://www.gedanken.demon.co.uk/wwwoffle/version-2.9/user.html
Index Of Cached Files
---------------------
To get the index of cached files, use the URL 'http://localhost:8080/index/'.
There are sufficient links on each of the index pages to allow easy navigation.
The indexes provides several levels of information:
A list of the requests in the outgoing directory.
A list of the files fetched the last time that the program was online.
And for the previous 5 times before that.
A list of the files requested the last time that the program was offline.
And for the previous 5 times before that.
A list of the files that are being monitored.
A list of all hosts for each of the protocols (http,ftp etc.).
A list of all of the files on a particular host.
These indexes can be sorted in a number of ways:
No sorting (directory order on disk).
By time of last modification (update).
By time of last access.
By date of last update with markers for each day.
Alphabetically.
By file extension.
Random.
For each of the pages that are cached there are options to delete the page,
refresh it, select the interactive refresh page with the URL already filled in
or add the page to the list that is monitored regularly.
It is also possible to specify in the configuration file what URLs are not to be
listed in the indexes.
Interactive Refresh Page
------------------------
Pages can be specified by using whatever method is provided by the browser that
is used or as an alternative there is an interactive refresh page. This allows
the user to enter a URL and then fetch it if it is not currently cached or
refresh it if it is in the cache. There is also the option here to recursively
fetch the pages that are linked to by the page that is specified. This
recursive fetching can be limited to pages from the same host, narrowed down to
links in the same directory (or subdirectory) or widened to fetch pages from any
web server. This functionality is also provided in the 'wwwoffle' command line
program.
Monitoring Web-Pages
--------------------
Pages can be specified that are to be checked at regular intervals. This can
either be every time that WWWOFFLE is online or at user specifiable times. The
page will be monitored when the four specified conditions are all met:
A month of the year that it can be fetched in (can be set to all months).
A day of the month that the page can be fetched on (can be set to all days).
A day of the week that the page can be fetched on (can be set to all days).
An hour of the day that the page should be fetched after (can be more than one).
For example to get a URL every Saturday morning, use the following:
Month of year: all
Day of Month : all
Day of week : Saturday
Hour of day : 0 (24hr clock)
Interactive Control Page
------------------------
The behaviour and mode of operation of the WWWOFFLE daemon can be controlled from
an interactive control page at 'http://localhost:8080/control/'. This has a
number of buttons that change the mode of the proxy server. These provide the
same functionality as the 'wwwoffle' command line program. To provide security,
this page can be password protected. There is also the facility to delete pages
from the cache or from the spooled outgoing requests directory.
Interactive Configuration File Editing Page
-------------------------------------------
The interactive configuration file editing page allows the configuration file
wwwoffle.conf to be edited. This facility can be reached via the configuration
editing page 'http://localhost:8080/configuration/'. Each item in the
configuration file has a separate web-page with a form in it that lists the
current entries in the configuration file and allows each entry to be edited
individually. When an entry has been updated, the configuration file needs to
be re-read.
Searching the Cache
-------------------
The three web indexing programs ht://Dig, mnoGoSearch (UdmSearch), Namazu or
Hyper Estraier can be used to create an index of the pages in the WWWOFFLE cache
for later searching.
For ht://Dig version 3.1.0b4 or later is required, it can be found at
http://www.htdig.org/.
For mnoGoSearch (previously called UdmSearch) version 3.1.0 or later is
required, it can be found at http://mnogosearch.org/.
For Namazu version 2.0.0 or later is required, it can be found at
http://www.namazu.org/, also required is mknmz-wwwoffle which can be found at
http://www.naney.org/comp/distrib/mknmz-wwwoffle/.
For Hyper Estraier version 0.5.3 or later is required, it can be found at
http://hyperestraier.sourceforge.net/.
The search forms for these programs are 'http://localhost:8080/search/htdig/',
'http://localhost:8080/search/mnogosearch/',
'http://localhost:8080/search/namazu/', and
'http://localhost:8080/search/hyperestraier/'. These allow the search part of
the programs to be run to find the cached web-pages that you want.
For more information about configuring these programs to work with WWWOFFLE you
should read the file README.htdig, README.mnogosearch, README.namazu, or
README.hyperestraier.
Built-In Web-Server
-------------------
Any URLs to WWWOFFLE on port 8080 that refer to files in the directory '/' refer
to the files that are stored in the 'html' subdirectory. This directory also
contains the message templates that WWWOFFLE uses to generate the internal web
pages. When a file is requested from either of these locations it is first
looked for in the language specific sub-directory specified in the browser's
request header. If it is not found in that location then it is looked for in
the directory named 'default' which by default is a symbolic link to the English
language pages, but can be changed. If it is not found in this location then it
is looked for in the English language directory (since that will have a full set
of pages).
Any URLs that refer to files in the directory '/local/' are taken from the files
in the 'local' sub-directory of the spool directory if they exist. If they do
not exist then they are searched for in the language subdirectories of the
'html' directory as described above. This allows for trivial web-pages to be
provided without a separate web-server. CGI scripts are available but disabled
by the default configuration file. The MIME type used for these files are those
that are specified in the configuration file.
Important: The local web-page server will follow symbolic links, but will only
allow access to files that are world readable. See the FAQ for
security issues.
Deleting Requests
-----------------
If no password is used for the control pages then it is possible for anybody to
delete requests that are recorded. If a password is assigned then users that
know this password can delete any request (or cached file or other thing).
Individual users that do not know the password can delete pages that they have
requested provided that they do it immediately that the "Will Get" page appears,
the "Delete" button on here has a once-only password that will delete that
request.
Backup Copies of Pages
----------------------
When a page is fetched while online a remote server error will overwrite any
existing web page. In this case a backup copy of the page is made so that when
the error message has been read while offline the backup copy is placed back
into the cache. This is automatic for all cases of files that have remote
server errors (and that do not use external proxies), no user intervention is
required.
Lock Files
----------
When one WWWOFFLE process is downloading a file any other WWWOFFLE process that
tries to read that file will not be able to until the first one has finished.
This removes the problem of an incomplete page being displayed in the second
browser, or a second copy of the page being fetched. If the lock file is not
removed by the first process within a timeout period then the second process
will produce an error message indicating the problem.
This is now a configurable option, the default condition is that lock files are
not used.
HTTPS Access to Internal Pages
------------------------------
All of the web pages that are available through normal HTTP access on port 8080
(e.g. http://localhost:8080/*) are also available with secure HTTPS access on
port 8443 if WWWOFFLE is compiled with the libgnutls encryption library. This
applies to all pages; indexes, built-in web server and control and configuration
pages.
Caching of HTTPS Web Pages
--------------------------
It is possible to configure WWWOFFLE so that it will intercept and cache
selected HTTPS connections. This is disabled by default and there are three
steps to enable it. WWWOFFLE must be compiled with encryption support, the
enable-caching option in the SSLOptions section of the configuration file must
be set true and the list of hosts to cache for must be set.
When WWWOFFLE is configured to cache an HTTPS web page it will request the page,
decrypt it, re-encrypt it and pass it to the browser. The copy of the page that
is stored in the cache will be stored without encryption. With this option all
other WWWOFFLE features like the DontGet section, the ModifyHTML section, the
OnlineOptions and others will be used. Normally most of these options cannot be
applied to HTTPS pages because the exact URL is not known to WWWOFFLE and the
unencrypted contents are not visible.
HTTPS Server Certificates
-------------------------
To handle the encryption functions described above WWWOFFLE will create and
manage a set of server certificates. One master certificate is used to sign all
of the other certificates that WWWOFFLE creates. The created certificates are
either for the WWWOFFLE server HTTPS access pages or for a fake certificate that
is created for each server that is cached. The certificates that are captured
by WWWOFFLE and stored are the certificates that are sent back by the real HTTPS
server. The final set of certificates are the trusted certificates that
WWWOFFLE can use to confirm that the remote server is the one it claims to be.
The full set of certificates that WWWOFFLE stores can be seen through the
WWWOFFLE URL http://localhost:8080/certificates/ but is only available if
WWWOFFLE was compiled with encryption support.
To add trusted certificates to WWWOFFLE place the certificate file (in PEM
format) into the directory '/var/lib/wwwoffle/certificates/trusted' and
restart WWWOFFLE.
Spool Directory Layout
----------------------
In the spool directory there is a directory for each of the network protocols
that are handled. In this directory there is a directory for each hostname that
has been contacted and has pages cached. These directories have the name of the
host. In each of these directories, there is an entry for each of the pages
that are cached, generated using a hashing function to give a constant length.
The entry consists of two files, one prefixed with 'D' that contains the data
and one prefixed with 'U' that contains the URL.
The outgoing directory is a single directory that all of the pending requests
are contained in, the format is the same with two files for each, but using 'O'
for the file containing the request instead of 'D' and one prefixed with 'U'
that contains the URL.
The lasttime (and prevtime*) directories are a single directory that contains an
entry for each of the files that were fetched the last time that the program was
online. Each entry consists of two files, one prefixed with 'D' that is a
hard-link to the real file and one prefixed with 'U' that contains the URL.
The lastout (and prevout*) directories are a single directory that contains an
entry for each of the files that were requested the last time that the program
was offline. Each entry consists of two files, one prefixed with 'D' that is a
hard-link to the real file and one prefixed with 'U' that contains the URL.
The monitor directory is a single directory that all of the regularly monitored
requests are contained in, the format is the same as the outgoing directory with
two files for each, using 'O' and 'U' prefixes. There is also a file with an
'M' prefix that contains the information about when to monitor the URL.
The Programs and Configuration File
-----------------------------------
There are two programs that make up this utility, with three distinct functions.
wwwoffle - A program to interact with and control the HTTP proxy daemon.
wwwoffled - A daemon process that acts as an HTTP proxy.
wwwoffles - A server that actually does the fetching of the web pages.
The wwwoffles function is combined with the wwwoffled function into the
wwwoffled program from version 1.1 onwards. This is to simplify the procedure
of starting servers, and allow for future improvements.
The configuration file, called wwwoffle.conf by default contains all of the
parameters that are used to control the way the wwwoffled and wwwoffles
functions work. The default installation location for this file is in the
directory /etc/wwwoffle.
WWWOFFLE - User control program
-------------------------------
The control program (wwwoffle) is used to control the action of the daemon
program (wwwoffled), or to request pages that are not in the cache.
The daemon program needs to know if the system is online or offline, when to
fetch the pages that have been previously requested and when to purge the cache
of old pages.
The first mode of operation is for controlling the daemon process. These are the
functions that are also available on the interactive control page (except kill).
wwwoffle -online Indicates to the daemon that the system is online.
wwwoffle -autodial Indicates to the daemon that the system is in autodial
mode, this will use cached pages if they exist and use
the network as last resort, for dial-on-demand systems.
wwwoffle -offline Indicates to the daemon that the system is offline.
wwwoffle -fetch Commands the daemon to fetch the pages that were
requested by clients while the system was offline.
wwwoffle exits when the fetching is complete.
(This requires the daemon to be told it is online).
wwwoffle -config Cause the configuration file for the daemon process to be
re-read. The config file can also be re-read by sending
a HUP signal to the wwwoffled process.
wwwoffle -purge Commands the daemon to purge from the cache the pages
that are older than the number of days specified in the
configuration file, using modification or access
time. Or if a maximum size is specified then delete the
oldest pages until the maximum size is not exceeded.
wwwoffle -status Request from the wwwoffled proxy server the current
status of the program. The online or offline mode, the
fetch and purge statuses, the number of current
processes and their PIDs are displayed.
wwwoffle -kill Causes the daemon to exit cleanly at a convenient point.
The second mode of operation is to specify URLs to get.
wwwoffle <URL> .. <URL> Specifies to the daemon URLs that must be fetched.
If online then it is got immediately, else the request
is stored for a later fetch.
wwwoffle <filename> ... The specified HTML file is be read and all of the links
in it used as if they had been specified on the command
line.
wwwoffle -post <URL> Send a request using the POST method, the data is read
from stdin and should be provided correctly url-encoded.
wwwoffle -put <URL> Send a request using the PUT method, the data is read
from stdin and should be provided correctly url-encoded.
wwwoffle -F Force the wwwoffle server to refresh the URL.
(Or fetch it if not cached.)
wwwoffle -g[Sisfo] Specifies that the URLs when fetched are to be parsed
for Stylesheets (s), images (i), scripts (s), frames (f)
or objects (o) and these are also to be fetched. Using
-g without any following letters will get none of them.
wwwoffle -r[<depth>] Specifies that the URL when fetched is to have the links
followed and these pages also fetched (to a depth
specified by the optional depth parameter, default 1).
Only links on the same server are to be fetched.
wwwoffle -R[<depth>] This is the same as the '-r' option except that all of
the links are to be followed, even those to other
servers.
wwwoffle -d[<depth>] This is the same as the '-r' option except that links
are only followed if they are in the same directory or a
sub-directory.
(If the -F, -(d|r|R) or -g[Sisfo] options are set they
override the options in the FetchOptions section of the
config file and only the -g[Sisfo] options are fetched.)
The third mode of operation is to get a URL from the cache.
wwwoffle <URL> Specifies the URL to get.
wwwoffle -o Get the URL and output it on the standard output.
(Or request it if not already cached.)
wwwoffle -O Get the URL and output it on the standard output
including the HTTP headers.
(Or request it if not already cached.)
The last mode of operation is to provide help in using the other modes.
wwwoffle -h Gives help about the command line options.
With any of the first three modes of operation the WWWOFFLE server can be
specified in one of three different ways.
wwwoffle -c <config-file>
Can be used to specify the configuration file that
contains the port numbers, server hostname (the first
entry in the LocalHost section) and the password (if
required for the first mode of operation). If there is
a password then this is the only way to specify it.
wwwoffle -p <host>[:<port>]
Can be used to specify the hostname and port number that
the daemon program listens to for control messages (first
mode) or proxy connections (second and third modes).
WWWOFFLE_PROXY An environment variable that can be used to specify
either the argument to the -c option (must be the full
pathname) or the argument to the -p option. (In this
case two ports can be specified, the first for the proxy
connection, the second for the control connection
e.g. 'localhost:8080:8081' or 'localhost:8080'.)
WWWOFFLED - Daemon program
--------------------------
The daemon program (wwwoffled) runs as an HTTP proxy and also accepts
connections from the control program (wwwoffle).
The daemon program needs to maintain the current state of the system, online or
offline, as well as the other parameters in the configuration file.
As HTTP proxy requests come in, the program forks a copy of itself (the
wwwoffles function) to handle the requests. The server program can also be
forked in response to the wwwoffle program requesting pages to be fetched.
wwwoffled -c <config-file> Starts the daemon with the named configuration
file.
wwwoffled -d [level] Starts the daemon in debugging mode, i.e it does
not detach from the terminal and uses standard
error for the log messages. The optional
numeric level (0 for none to 5 for all or 6 for
more) specifies the level of error messages for
standard error, if not specified then use
log-level from the config file.
wwwoffled -f Start the daemon in debugging mode (implies -d)
and when the first HTTP request comes in handle
it without creating a child process and then
exit.
wwwoffled -p Print the PID of the daemon on standard out
before detaching from the terminal.
wwwoffled -h Gives help about the command line options.
There are a number of error and informational messages that are generated by the
program as it runs. By default (in the config file) these go to syslog, by
using the -d flag the daemon does not detach from the terminal and the errors
are also on standard error.
By using the run-uid and run-gid options in the config file, it is possible to
change the user id and group id that the program runs as. This will require
that the program is started by root and that the specified user has read/write
access to the spool directory.
WWWOFFLES - Server program
--------------------------
The server (wwwoffles) starts by being forked from the daemon (wwwoffled) in one
of three different modes.
Real - When the system is online and acting as a proxy for a client.
All requests for web pages are handled by forking a new server which
will connect to the remote host and fetch the page. This page is then
stored in the cache as well as being returned to the client. If the
page is already in the cache then the remote server is asked for a newer
page if one exists, else the cache one is used.
SpoolOrReal - When the system is in autodial mode and we have not decided if we
will go for Spool or Real mode. Select Spool mode if already cached and
Real mode otherwise as a last resort.
Fetch - When the system is online and fetching pages that have been requested.
All web page requests in the outgoing directory are fetched by the
server connecting to the remote host to get the page. This page is then
stored in the cache, there is no client active. If the page has been
moved then the link is followed and that one fetched.
Spool - When the system is offline and acting as a proxy for a client.
All requests for web pages are handled by forking a server that will
either return a cached page or store the request. If the page is
cached, it is returned to the client, else a dummy page is returned
(and stored in the cache), and the outgoing request is stored.
If the cached page refers to a page that failed to be downloaded then it
will be deleted from the cache.
Depending on the existence of files in the spool and other conditions, the mode
can be changed to one of several other modes.
RealNoCache - For requests for pages on the server machine or those specified
not to be cached in the configuration file.
RealRefresh - Used by the refresh button on the index or the wwwoffle program
to re-fetch a page while the system is online.
RealNoPassword - Used when a password was provided and two copies of the page
are required, one with and one without the password.
FetchNoPassword - Used when a password was provided and two copies of the page
are required, one with and one without the password.
SpoolGet - Used when the page does not exist in the cache so a request needs to
be stored for it in the outgoing directory.
SpoolRefresh - Used when the refresh button on the index or the wwwoffle program
are used, the existing spooled page (if there is one) is not
overwritten, but a request is stored.
SpoolPragma - Used when the client requests the cache to refresh the page
using the 'Pragma: no-cache' header, the existing spooled page (if there
is one) is not overwritten, but a request is stored.
InternalPage - Used when the program is generating a web-page internally or is
spooling a web-page with modifications.
WWWOFFLE-TOOLS - Cache maintenance program
------------------------------------------
This is a quick hack program that I wrote to allow you to list the contents of
the cache or move files around in it. The programs are all named after common
UNIX commands with a 'wwwoffle-' prefix.
All of the programs should be invoked from the spool directory.
wwwoffle-rm - Delete the URL that is specified on the command line.
To delete all URLs from a host it is easier to use
'rm -r http/foo' than use this.
wwwoffle-mv - To rename URLs under one path in the spool to another path.
Because the URL is encoded in the filename just renaming the
files or the directory will not work. Instead of 'mv http/foo
http/bar' use 'wwwoffle-mv http/foo http/bar'. Also works for
complex cases: 'wwwoffle-mv http://foo/bar http:///bar/foo'.
wwwoffle-ls - To list a cached URL or the files in a cache directory in the
style of 'ls -l'. As examples use 'wwwoffle-ls http/foo' to
list all of the URLs in the cache directory 'http/foo', use
'wwwoffle-ls http://foo/' to list the single URL 'http://foo/'
or use 'wwwoffle-ls outgoing' to list the outgoing requests.
wwwoffle-read - Read data directly from the cache for the URL named on the
command line and output it on stdout.
wwwoffle-write - Write data directly to the cache for the URL named on the
command line from stdin. Note this requires a HTTP header to
be included first or clients may get confused.
(echo "HTTP/1.0 200 OK" ; echo "" ; cat bar.html ) | \
wwwoffle-write http://www.foo/bar.html
wwwoffle-hash - Print WWWOFFLE's encoding of the submitted URL. This is
useful for scripts working on the WWWOFFLE cache.
wwwoffle-fsck - Checks the WWWOFFLE cache for consistency, it will rename any
files where the filename does not match the hash of the URL.
wwwoffle-gzip - Compress the contents of the cache so that they take less
space but WWWOFFLE can still read them.
wwwoffle-gunzip - Uncompress the contents of the cache.
All of the programs are the same executable and the name of the file determines
the function. The wwwoffle-tools executable can also be used with a command
line parameter, for example 'wwwoffle-ls' is the same as 'wwwoffle-tools -ls'.
This program also accepts the '-c <config-file>' option and uses the
'WWWOFFLE_PROXY' environment variable so that the wwwoffle-write program uses
the correct permissions and uid/gid.
These are basically hacks and should not be considered as fully featured and
fully debugged programs.
audit-usage.pl - Perl script to check log files
-----------------------------------------------
The audit-usage.pl script (in the contrib directory) can be used to get audit
information from the output of the wwwoffled program.
If wwwoffled is started as
wwwoffled -c /etc/wwwoffle/wwwoffle.conf -d 4
Then on the standard error output will be generated information about the
program as it is run. The debug level needs to be 4 so that the URL information
is output.
If this is captured into a log file then it can be analysed by the
audit-usage.pl program. When run this will tell the host that the connection is
made from and the URL that is requested. It also includes the timestamp
information and connections to the WWWOFFLE control connection.
Test Programs
-------------
In the testprogs directory are two test programs that can be compiled if
required. They are not needed for WWWOFFLE to work, but if you are customising
the information pages for WWWOFFLE to use or trying to debug the HTML parser
then they will be of use.
These are even more hacks than the wwwoffle-tools programs, use at your own risk.
Author and Copyright
--------------------
The two programs wwwoffle and wwwoffled were written by Andrew M. Bishop in
1996,97,98,99,2000,01,02,03,04,05 and are copyright Andrew M. Bishop 1996,97,98,
99,04,05,2000,01,02,03,04,05.
The programs known as wwwoffle-tools were written by Andrew M. Bishop in 1997,
98,99,2000,01,02,03,04,05 and are copyright Andrew M. Bishop 1997,98,99,2000,01,
02,03,04,05.
The Perl scripts update-config.pl and audit-usage.pl were written by Andrew
M. Bishop in 1998,99,2000,01,02,03,04,05 and are copyright Andrew M. Bishop
1998,99,2000,01,02,03,04,05.
They can be freely distributed according to the terms of the GNU General Public
License (see the file `COPYING').
If you wish to submit bug reports or other comments about the programs then
email the author amb@gedanken.demon.co.uk and put WWWOFFLE in the subject line.
Ht://Dig
- - - -
The htdig package is copyrighted by Andrew Scherpbier. The icons in the
html/search/htdig directory come from htdig as does the search form
html/search/htdig/search.html and configuration files in search/htdig/conf/*
(with modifications by myself).
mnoGoSearch (UdmSearch)
- - - - - - - - - - - -
The mnoGoSearch package is copyrighted by Lavtech.Com Corp and released under
the GPL. The mnoGoSearch icon in the html/search/mnogosearch directory comes
from mnoGoSearch as does the search form html/search/mnogosearch/search.html and
configuration files in search/mnogosearch/conf/* (with modifications by myself).
Namazu
- - -
The Namazu package is copyrighted by the Namazu Project and mknmz-wwwoffle is
copyrighted by WATANABE Yoshimasa, both programs are released under the GPL.
The configuration files in search/namazu/conf/* come from Namazu (with
modifications by myself).
Hyper Estraier
- - - - - - -
The Hyper Estraier package is copyrighted by Mikio Hirabayashi and is released
under the LGPL. The configuration files in search/hyperestraier/conf/* come
from Hyper Estraier (with modifications by myself).
With Source Code Contributions From
- - - - - - - - - - - - - - - - - -
Yannick Versley <sa6z225@public.uni-hamburg.de>
Initial syslog code (much rewritten before inclusion).
Axel Rasmus Wienberg <2wienbe@informatik.uni-hamburg.de>
Code to run wwwoffled as a specified uid/gid.
Andreas Dietrich <quasi@baccus.franken.de>
Code to detach the program from the terminal like a *real* daemon.
Ullrich von Bassewitz <uz@wuschel.ibb.schwaben.com>
Better handling of signals.
Optimisation of the file handling in the outgoing directory.
The log-level, max-servers and max-fetch-servers config options.
Tilman Bohn <tb@bohn.isdn.uni-heidelberg.de>
Autodial mode.
Walter Pfannenmueller <pfn@online.de>
Document parsing Java/VRML/XML some HTML.
Ben Winslow <rain@insane.loonybin.net>
Configuration file DontGet section optional replacement Url.
New FTP commands to get file size and modification time.
Ingo Kloecker <kloecker@math.u-bordeaux.fr>
Disable animated GIFs (code now removed and rewritten).
David McNab <david@rebirthing.co.nz>
Workaround winsock bug for cygwin (now lingering close on all systems).
Olaf Buddenhagen <olafbuddenhagen@web.de>
A patch to do random sorting in the indexes.
Jan Lukoschus <jan+wof@lukoschus.de>
The patch for wwwoffle-hash (for wwwoffle-tools).
Paul A. Rombouts <p.a.rombouts@home.nl>
The patch to force re-requests of redirection URLs.
The patch to allow wildcards to have more than two '*' characters.
The patch to allow local CGI scripts to be run.
The patch to keep the backup copy of a page in case of server error.
Marc Boucher <MarcB@box100.com>
The patch to perform case insensitive matching of URL-SPECs.
The patch to handle FTP requests made with a password (like HTTP).
Ilya Dogolazky <ilyad@math.uni-bonn.de>
The patch for the fix-mixed-cyrillic option.
Dieter <netbsd@sopwith.solgatos.com>
A patch with some 64-bit/32-bit compatibility fixes (that prompted me to
go and find and fix a lot more).
Andreas Mohr <andi@rhlx01.fht-esslingen.de>
A patch to add "const" to lots of structures and function parameters
(this prompted me to go and do a lot more).
Nils Kassube <kassube@gmx.net>
The patch for the referer-from option.
And Other Useful Contributions From
- - - - - - - - - - - - - - - - - -
Too many people to mention - (everybody that e-mailed me).
Suggestions and bug reports.
|