File: index.html

package info (click to toggle)
linkchecker 5.2-2
  • links: PTS
  • area: main
  • in suites: squeeze
  • size: 3,508 kB
  • ctags: 3,805
  • sloc: python: 22,666; lex: 1,114; yacc: 785; makefile: 276; ansic: 95; sh: 68; sql: 19; awk: 4
file content (165 lines) | stat: -rw-r--r-- 6,875 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>Check websites for broken links</title>
    <link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
    <link rel="stylesheet" href="pygments.css" type="text/css" />
<style type="text/css">
img { border: 0; }
</style>

  </head>
  <body>
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
<table border="0"><tr>
 <td><img
  src="logo64x64.png" border="0" alt="LinkChecker"/></td>
 <td><h1>LinkChecker</h1></td>
</tr></table>
</div>

<h1>Documentation</h1>

<h2>Basic usage</h2>

<p>To check a URL like <code>http://www.myhomepage.org/</code> it is enough to
execute <code>linkchecker http://www.myhomepage.org/</code>. This will check the
complete domain of www.myhomepage.org recursively. All links pointing
outside of the domain are also checked for validity.</p>

<h2>Performed checks</h2>

<p>All URLs have to pass a preliminary syntax test. Minor quoting
mistakes will issue a warning, all other invalid syntax issues
are errors.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.</p>

<ul>
<li><p>HTTP links (<code>http:</code>, <code>https:</code>)</p>

<p>After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
Permanently moved pages issue a warning.
All final HTTP status codes other than 2xx are errors.</p></li>
<li><p>Local files (<code>file:</code>)</p>

<p>A regular, readable file that can be opened is valid. A readable
directory is also valid. All other files, for example device files,
unreadable or non-existing files are errors.</p>

<p>File contents are checked for recursion.</p></li>
<li><p>Mail links (<code>mailto:</code>)</p>

<p>A mailto: link eventually resolves to a list of email addresses.
If one address fails, the whole list will fail.
For each mail address we check the following things:</p>

<p>1) Check the adress syntax, both of the part before and after
 the @ sign.
2) Look up the MX DNS records. If we found no MX record,
 print an error.
3) Check if one of the mail hosts accept an SMTP connection.
 Check hosts with higher priority first.
 If no host accepts SMTP, we print a warning.
4) Try to verify the address with the VRFY command. If we got
 an answer, print the verified address as an info.</p></li>
<li><p>FTP links (<code>ftp:</code>)</p>

<p>For FTP links we do:</p>

<p>1) connect to the specified host
2) try to login with the given user and password. The default
 user is <code>anonymous</code>, the default password is <code>anonymous@</code>.
3) try to change to the given directory
4) list the file with the NLST command</p></li>
<li><p>Telnet links (<code>telnet:</code>)</p>

<p>We try to connect and if user/password are given, login to the
given telnet server.</p></li>
<li><p>NNTP links (<code>news:</code>, <code>snews:</code>, <code>nntp</code>)</p>

<p>We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.</p></li>
<li><p>Ignored links (<code>javascript:</code>, etc.)</p>

<p>An ignored link will only print a warning. No further checking
will be made.</p>

<p>Here is a complete list of recognized, but ignored links. The most
prominent of them should be JavaScript links.</p>

<ul>
<li><code>acap:</code>      (application configuration access protocol)</li>
<li><code>afs:</code>       (Andrew File System global file names)</li>
<li><code>chrome:</code>    (Mozilla specific)</li>
<li><code>cid:</code>       (content identifier)</li>
<li><code>clsid:</code>     (Microsoft specific)</li>
<li><code>data:</code>      (data)</li>
<li><code>dav:</code>       (dav)</li>
<li><code>fax:</code>       (fax)</li>
<li><code>find:</code>      (Mozilla specific)</li>
<li><code>gopher:</code>    (Gopher)</li>
<li><code>imap:</code>      (internet message access protocol)</li>
<li><code>isbn:</code>      (ISBN (int. book numbers))</li>
<li><code>javascript:</code> (JavaScript)</li>
<li><code>ldap:</code>      (Lightweight Directory Access Protocol)</li>
<li><code>mailserver:</code> (Access to data available from mail servers)</li>
<li><code>mid:</code>       (message identifier)</li>
<li><code>mms:</code>       (multimedia stream)</li>
<li><code>modem:</code>     (modem)</li>
<li><code>nfs:</code>       (network file system protocol)</li>
<li><code>opaquelocktoken:</code> (opaquelocktoken)</li>
<li><code>pop:</code>       (Post Office Protocol v3)</li>
<li><code>prospero:</code>  (Prospero Directory Service)</li>
<li><code>rsync:</code>     (rsync protocol)</li>
<li><code>rtsp:</code>      (real time streaming protocol)</li>
<li><code>service:</code>   (service location)</li>
<li><code>shttp:</code>     (secure HTTP)</li>
<li><code>sip:</code>       (session initiation protocol)</li>
<li><code>tel:</code>       (telephone)</li>
<li><code>tip:</code>       (Transaction Internet Protocol)</li>
<li><code>tn3270:</code>    (Interactive 3270 emulation sessions)</li>
<li><code>vemmi:</code>     (versatile multimedia interface)</li>
<li><code>wais:</code>      (Wide Area Information Servers)</li>
<li><code>z39.50r:</code>   (Z39.50 Retrieval)</li>
<li><code>z39.50s:</code>   (Z39.50 Session)</li>
</ul></li>
</ul>

<h2>Recursion</h2>

<p>Before descending recursively into a URL, it has to fulfill several
conditions. They are checked in this order:</p>

<ol>
<li><p>A URL must be valid.</p></li>
<li><p>A URL must be parseable. This currently includes HTML files,
Opera bookmarks files, and directories. If a file type cannot
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.</p></li>
<li><p>The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.</p></li>
<li><p>The maximum recursion level must not be exceeded. It is configured
with the <code>--recursion-level</code> option and is unlimited per default.</p></li>
<li><p>It must not match the ignored URL list. This is controlled with
the <code>--ignore-url</code> option.</p></li>
<li><p>The Robots Exclusion Protocol must allow links in the URL to be
followed recursively. This is checked by searching for a
"nofollow" directive in the HTML header data.</p></li>
</ol>

<p>Note that the directory recursion reads all files in that
directory, not just a subset like <code>index.htm*</code>.</p>
    <div class="footer">
      &copy; Copyright 2009, Bastian Kleineidam.
    </div>
  </body>
</html>