Checkbot -- a WWW link verifier
Checkbot is a perl5 script which can verify links within a region of
the World Wide Web. It checks all pages within an identified region,
and all links within that region. After checking all links within the
region, it will also check all links which point outside of the
region, and then stop.
Checkbot regularly writes reports on its findings, including all
servers found in the region, and all links with problems on those
Checkbot was written originally to check a number of servers at
once. This has implied some design decisions, so you might want to
keep that in mind when making suggestions. Speaking of which, be sure
to check the to do file on the website for things which have been
suggested for Checkbot.
Making and installing Checkbot is easy:
You will need to have the following Perl modules installed in order to
properly install Checkbot:
Mail::Send (optional, contained in the MailTools package)
Time::Duration (optional, used for additional info in report)
WHERE TO FIND IT
Checkbot is distributed at: http://degraaff.org/checkbot/
Problems, bug reports, and feature enhancements are welcome at
There is an announcement mailing list to which announcements of new
versions are posted. You can sign up for the list at
Hans de Graaff <email@example.com>
Changes in versino 1.80 (15-Oct-2008)
* Fix handling of nofollow robots tag.
* Require newer version of LWP for better handling of character
* Ignore mms scheme.
* Minor clarification in output.
Changes in version 1.79 (3-Feb-2007)
* Correctly parse documents to avoid problems with UTF-8
documents. This avoids the "Parsing of undecoded UTF-8 will give
garbage when decoding entities" messages.
* Allow regular expressions in the suppression file, and complain if
the suppression file is not a proper file.
* More robust handling of HTTP and FTP servers that have problems
responding to HEAD requests.
* Use the original URL to report problems.
* Ensure XHTML compliance.
Changes in version 1.78 (3-May-2006)
* Don't throw errors for links that cannot be expected to be valid
all the time (e.g. the classid attribute of an object element)
* Better fallbacks for some cases where the HEAD request does not
* Add more classes and ids to allow more styling of results pages
(including example CSS file)
* Ensure XHTML compliance
* Better checks for optional dependencies
Changes in version 1.77 (28-Jul-2005)
* Fix silly build-related problem that prevented checkbot 1.76 from
running at all.
* Check for presence of robots meta tag and act on it.
Changes in version 1.76 (25-Jul-2005)
* Error reports now include the page title for easier identification.
* Documentation updates.
Changes in version 1.75 (22-Apr-2004)
* New --cookies option to accept cookies from servers while checking.
* New --noproxy option indicates which domains should not be
passed through the proxy.
* New error code for unknown domains; only known non-checkable
schemes are ignored now.
* Minor bug fixes.
* Documentation updates.
Changes in version 1.74 (17-Dec-2003)
* New --suppress option allows Response code/URL combinations not
to be reported as problems.
* Checkbot warnings are now handled as pseudo-HTTP status messages
so that they can make use of all Checkbot features such as
* Option --allow-simple-hosts is deprecated due to this change.
* More robust handling of (lack of) status messages.
* Checkbot now requires LWP 5.70 due to bugfixes in this release,
although it should still also work with older LWP versions.
* Documentation fixes.
Changes in version 1.73 (31-Aug-2003)
* Checkbot now tries to produce valid XHTML 1.1
* URLs matching the --ignore option are now completely ignored;
they used to be checked but not reported.
* Proxy support works again, but --proxy now applies to all links
* Documentation fixes
Changes in version 1.72 (04-May-2003)
* URLs with query strings are now checked by default, the
--exclude option can be used to revert to the previous behavior
* The server results page contains shortcut links to each section
* Removed warning for unqualified hostnames for news: URLs
* Handling of signals such as SIGINT
* Bug and documentation fixes
Changes in version 1.71 (29-Dec-2002)
* New --filter option allows rewriting of URLs before they will be checked
* Problematic links are now reported for each page on which they occur
* New statistics which should work correctly
* Much simplified storage of information on problem links
* Duplicate links are now properly detected and not checked twice
* Rewritten internals for link checking, as a consequence internal
and external links are checked at the same time now, not in two
passes like before
* Rewritten internals for message output
* A simple test case for 'make test'
* Minor cleanups of the code
Version 1.70 was only released for testing purposes
Changes in version 1.69
* Improved makefile and packaging
* Better default for --match argument
* Additional instance of using GET instead of HEAD added
* Bug fixes in printing of web server feedback
Changes in version 1.68
* Add --allow-simple-hosts which doesn't check for unqualified hosts
* Mention --style option in help and added example style file
* Change --sleep implementation so that fractional seconds can be used
* Fix a bug with handling <base> tags
* Tighten checks for http and https schemes
* Remove harmless warnings