This file is no longer being updated, as I have moved my project
management into ShadowPlan. I hope to have an HTML version of the
current state online soon. -- Hans 30-Apr-2001
* Doing a HEAD on an ISMAP link causes a 500 server error. Maybe a
GET will be better for this.
* Add a mechanism for entering authentication data in order to check
such areas as well. Possibly allow multiple pairs. Rather not enter
the passwords on the commandline.
* Obey Robot rules. My current idea is to obey robot rules by default
on all 'external' request, and ignore them on 'local'
requests. Both of these should be changeable.
Obeying internal robots.txt might be a nice way to have a good
* Make it possible to convert http:// references to file://
references, so that a local server can be checked without going
through the WWW server.
* Keep state between runs, but make sure we still are able to run
Checkbot on several areas (concurrently). Uses for state
information: list of consistent bad host, remembering previous bad
links and just check those with a `quick' option, report on hosts
which keep timing out.
* Retry time-out problem links after checking all other links to
better deal with transient problems. See above as well.
* Parse client-side (and server-side) MAP's if possible.
* Maybe use a Netscape feature to open problem links in a new
browser, so that the problem links page remains visible and
available. Frames? (*shudder*)
* Include (or link to) a page which contains explanations for the
different error messages. (But watch out for server-specific
messages, if any)
* The external link count is way off. Write code to parse the
external queue first, and then run through it to actually check the
* Keep an internal list of hosts to which we cannot connect, so that
we avoid being stalled a while for each link to that host.
* The exclude option is somewhat confusing, in that the links
matching this option will still be checked (the links on these pages
will really be excluded). Maybe add a new option to really ignore
options? Or just redo current options.
Perhaps a solution is to make separate exclude options for internal
and external links?
* Add an option to count hops instead of using match, and only hop
that many links away? Suggested for single page checking, but might
be useful on a larger scale as well? Yes, for instance against
serves that create recursive symlinks by accident.
This option could be further specialized to apply to specific match
* Sort problems on the server page in a different order
(e.g. critical errors first).
* Sort the problem reports by page instead of by type of problem for
* Improve the reporting pages to be more clear about what is going
* Using IP address in reports can conflict with HTTP 1.1 virtual
hosting. Also, in the current situation several reports will be
generated when a server has several names for the same IP address,
or when the same content is served through two web servers. The
latter situation could be solved by adding an option which would
indicate these are actually the same:
* Get Checkbot listed as a module, so that it can be installed with
* List all options on the generated page, not just the URLs and the
* Perhaps use HTML::Summary to provide summaries of the documents
* Proxy is a bit of a mess, I should look in to this and clean it up.
* Setting timeout on the useragent doesn't always make URLs time
out. Leon Abelman points to the FAQ and suggest either an extra
option, or dealing with the timeout ourselves.
21) I set the $ua->timeout() to 1 minute, but LWP might still uses
a long time to fetch a document. Is this a bug?
No, $ua->timeout sets a timeout on how long LWP allows the
connection to be idle before giving up. If only a single byte
arrives each 50 seconds, LWP will keep on listening but will
probably not finish downloading for a long, long time. If you want
an absolute timeout you can do it with alarm() or by setting up a
callback and measuring the time yourself. Take a look at
preceeding two questions. (Gisle Aas)
* Sometimes it is difficult (for the Web-editors) to locate the
source of problems links, especially when the URL's is extracted
from a database (as we often do). Maybe it would be useful if
Checkbot could (optionally) list the link text (the text between <a
href="..."> and </a>) in the Checkbot output.
* Only send a report by email when there is something to report?
* I am wondering if you could implement some kind of parallelism in
checking the links (e.g. checking a definable number of links in
parallel, speeding things up). Especially external links sometimes
take a long time to check. It would be nice if that could be sped
up by spawning several HTTP requests in parallel. -- Thomas
* Also check local links: I tried your checkbot, and like to thank
you very much for this nice tool - despite the fact that I needes
something which checks local references (<a name="..."> / <a
href="...#...">) as well :-( -- Michael Hoennig
* Have --sleep parameter accept split second times, and implement
timers to handle this internally. Would be useful to not stress the
server too much, but still have a speedy run. -- Thomas Kuerten