W3C Link Checker Documentation

About this service
What it does
Use it online
Install it locally
Robots exclusion
Comments, suggestions and bugs

About this service

In order to check the validity of the technical reports that W3C publishes, the Systems Team has developed a link checker.

A first version was developed in August 1998 by Renaud Bruyeron. Since it was lacking some functionalities, Hugo Haas rewrote it more or less from scratch in November 1999. It has been improved by Ville Skyttä and many other volunteers since.

The source code is available publicly under the W3C IPR software notice from CPAN (released versions) and CVS (development and archived release versions).

What it does

The link checker reads an HTML or XHTML document and extracts a list of anchors and links.

It checks that no anchor is defined twice.

It then checks that all the links are dereferenceable, including the fragments. It warns about HTTP redirects, including directory redirects.

It can check recursively a part of a Web site.

There is a command line version and a CGI version. They both support HTTP basic authentication. This is achieved in the CGI version by passing through the authorization information from the user browser to the site tested.

Use it online

There is an online version of the link checker.

In the online version (and in general, when run as a CGI script), the number of documents that can be checked recursively is limited. Both the command line version and the online one sleep at least one second between requests to each server to avoid abuses and target server congestion.

Install it locally

The link checker is written in Perl. It is packaged as a standard CPAN distribution, and depends on a few other modules which are also available from CPAN.

In order to install it:

Install Perl.
You will need the following CPAN distributions, as well as the distributions they possibly depend on. Depending on your Perl version, you might already have some of these installed. Also, the latest versions of these may require a recent version of Perl. As long as the minimum version requirement(s) below are satisfied, everything should be fine. The latest version should not be needed, just get an older version that works with your Perl. For an introduction to installing Perl modules, see The CPAN FAQ.
- W3C-LinkChecker (the link checker itself)
- CGI.pm (required for CGI mode only)
- Config-General (optional, version 2.06 or newer; required only for reading the (optional) configuration file)
- HTML-Parser (version 3.00 or newer)
- libwww-perl (version 5.66 or newer; version 5.70 or newer recommended, except for 5.76 which has a bug that may cause the link checker follow redirects to file: URLs)
- Net-IP
- TermReadKey (optional but recommended; required only in command line mode for password input)
- Time-HiRes
- URI
Optionally install the link checker configuration file, etc/checklink.conf contained in the link checker distribution package into /etc/w3c/checklink.conf or set the W3C_CHECKLINK_CFG environment variable to the location where you installed it.
Optionally, install the checklink script into a location in your web server which allows execution of CGI scripts (typically a directory named cgi-bin somewhere below your web server's root directory).
See also the README and INSTALL file(s) included in the above distributions.

Running checklink --help shows how to use the command line version. The distribution package also includes more extensive POD documentation, use perldoc checklink (or man checklink on Unixish systems) to view it.

If you want to enable the authentication capabilities with Apache, have a look at Steven Drake's hack.

Some environment variables affect the way how the link checker uses FTP. In particular, passive mode is the default. See Net::FTP(3) for more information.

There are multiple alternatives for configuring the default NNTP server for use with news: URIs without explicit hostnames, see Net::NNTP(3) for more information.

Robots exclusion

As of version 4.0, the link checker honors robots exclusion rules. To place rules specific to the W3C Link Checker in /robots.txt files, sites can use the W3C-checklink user agent string. For example, to allow the link checker to access all documents on a server and to disallow all other robots, one could use the following:

User-Agent: *
Disallow: /

User-Agent: W3C-checklink
Disallow:

Robots exlusion support in the link checker is based on the LWP::RobotUA Perl module. It currently supports the "original 1994 version" of the standard. The robots META tag, ie. <meta name="robots" content="...">, is not supported. Other than that, the link checker's implementation goes all the way in trying to honor robots exclusion rules; if a /robots.txt disallows it, not even the first document submitted as the root for a link checker run is fetched.

Note that /robots.txt rules affect only user agents that honor it; it is not a generic method for access control.

Comments, suggestions and bugs

The current version has proven to be stable. It could however be improved, see the list of open enhancement ideas and bugs for details.

Please send comments, suggestions and bug reports about the link checker to the www-validator mailing list (archives), with 'checklink' in the subject.