File: TODO

package info (click to toggle)
checkbot 1.67-1
  • links: PTS
  • area: main
  • in suites: woody
  • size: 116 kB
  • ctags: 28
  • sloc: perl: 728; makefile: 52
file content (135 lines) | stat: -rw-r--r-- 5,662 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
This file is no longer being updated, as I have moved my project
management into ShadowPlan. I hope to have an HTML version of the
current state online soon. -- Hans 30-Apr-2001

*  Doing a HEAD on an ISMAP link causes a 500 server error. Maybe a 
   GET will be better for this.

*  Add a mechanism for entering authentication data in order to check
   such areas as well. Possibly allow multiple pairs. Rather not enter
   the passwords on the commandline.

*  Obey Robot rules. My current idea is to obey robot rules by default
   on all 'external' request, and ignore them on 'local'
   requests. Both of these should be changeable.

   Obeying internal robots.txt might be a nice way to have a good
   exclude mechanism.

*  Make it possible to convert http:// references to file://
   references, so that a local server can be checked without going
   through the WWW server.

*  Keep state between runs, but make sure we still are able to run
   Checkbot on several areas (concurrently). Uses for state
   information: list of consistent bad host, remembering previous bad
   links and just check those with a `quick' option, report on hosts
   which keep timing out.

*  Retry time-out problem links after checking all other links to
   better deal with transient problems. See above as well.

*  Parse client-side (and server-side) MAP's if possible.

*  Maybe use a Netscape feature to open problem links in a new
   browser, so that the problem links page remains visible and
   available. Frames? (*shudder*)

*  Include (or link to) a page which contains explanations for the
   different error messages. (But watch out for server-specific
   messages, if any)

*  The external link count is way off. Write code to parse the
   external queue first, and then run through it to actually check the
   links. 

*  Keep an internal list of hosts to which we cannot connect, so that
   we avoid being stalled a while for each link to that host.

*  The exclude option is somewhat confusing, in that the links
   matching this option will still be checked (the links on these pages
   will really be excluded). Maybe add a new option to really ignore
   options? Or just redo current options.

   Perhaps a solution is to make separate exclude options for internal
   and external links?

*  Add an option to count hops instead of using match, and only hop
   that many links away? Suggested for single page checking, but might
   be useful on a larger scale as well? Yes, for instance against
   serves that create recursive symlinks by accident.

   This option could be further specialized to apply to specific match
   rules.

*  Sort problems on the server page in a different order
   (e.g. critical errors first). 

*  Sort the problem reports by page instead of by type of problem for
   easier fixing.

*  Improve the reporting pages to be more clear about what is going
   on, etc.

*  Using IP address in reports can conflict with HTTP 1.1 virtual
   hosting. Also, in the current situation several reports will be
   generated when a server has several names for the same IP address,
   or when the same content is served through two web servers. The
   latter situation could be solved by adding an option which would
   indicate these are actually the same:
   www1.domain=www2.domain=www3.domain. 

*  Get Checkbot listed as a module, so that it can be installed with
   CPAN?

*  List all options on the generated page, not just the URLs and the
   match option?

*  Perhaps use HTML::Summary to provide summaries of the documents
   with errors?

*  Proxy is a bit of a mess, I should look in to this and clean it up.

*  Setting timeout on the useragent doesn't always make URLs time
   out. Leon Abelman points to the FAQ and suggest either an extra
   option, or dealing with the timeout ourselves.

   http://people.we.mediaone.net/kfrankel/lwpfaq.txt

   21) I set the $ua->timeout() to 1 minute, but LWP might still uses
   a long time to fetch a document.  Is this a bug?  

   No, $ua->timeout sets a timeout on how long LWP allows the
   connection to be idle before giving up.  If only a single byte
   arrives each 50 seconds, LWP will keep on listening but will
   probably not finish downloading for a long, long time.  If you want
   an absolute timeout you can do it with alarm() or by setting up a
   callback and measuring the time yourself.  Take a look at
   preceeding two questions.  (Gisle Aas)


*  Sometimes it is difficult (for the Web-editors) to locate the
   source of problems links, especially when the URL's is extracted
   from a database (as we often do). Maybe it would be useful if
   Checkbot could (optionally) list the link text (the text between <a
   href="..."> and </a>) in the Checkbot output.

*  Only send a report by email when there is something to report?

*  I am wondering if you could implement some kind of parallelism in
   checking the links (e.g. checking a definable number of links in
   parallel, speeding things up). Especially external links sometimes
   take a long time to check. It would be nice if that could be sped
   up by spawning several HTTP requests in parallel. -- Thomas
   Sch´┐Żrger <>

*  Also check local links: I tried your checkbot, and like to thank
   you very much for this nice tool - despite the fact that I needes
   something which checks local references (<a name="..."> / <a
   href="...#...">) as well :-( -- Michael Hoennig

*  Have --sleep parameter accept split second times, and implement
   timers to handle this internally. Would be useful to not stress the
   server too much, but still have a speedy run. -- Thomas Kuerten
   <thomas@noc.de>