File: checkbot.pl

package info (click to toggle)
checkbot 1.67-1
  • links: PTS
  • area: main
  • in suites: woody
  • size: 116 kB
  • ctags: 28
  • sloc: perl: 728; makefile: 52
file content (1512 lines) | stat: -rwxr-xr-x 50,554 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
use Config;
use File::Basename qw(basename dirname);
chdir(dirname($0));
($file = basename($0)) =~ s/\.pl$//;
$file =~ s/\.pl$//
        if ($Config{'osname'} eq 'VMS' or
            $Config{'osname'} eq 'OS2');  # "case-forgiving"
open OUT,">$file" or die "Can't create $file: $!";
chmod(0755, $file);
print "Extracting $file (with variable substitutions)\n";
 
print OUT <<"!GROK!THIS!";
$Config{'startperl'} -w
    eval 'exec perl -S \$0 "\$@"'
        if 0;
 
!GROK!THIS!
 
print OUT <<'!NO!SUBS!';
#!/usr/bin/perl -w
#
# checkbot - A perl5 script to check validity of links in www document trees
#
# Hans de Graaff <hans@degraaff.org>, 1994-2001.
# Based on Dimitri Tischenko, Delft University of Technology, 1994
# Based on the testlinks script by Roy Fielding
# With contributions from Bruce Speyer <bruce.speyer@elecomm.com>
#
# Info-URL: http://degraaff.org/checkbot/
# Comments to: checkbot@degraaff.org
#
# $Id: checkbot.pl,v 1.67 2001/12/16 16:15:55 graaff Exp $
# (Log information can be found at the end of the script)

require 5.004;
use strict;

require LWP;

use File::Basename;

=head1 NAME

Checkbot - WWW Link Verifier

=head1 SYNOPSIS

checkbot [B<--debug>] [B<--help>] [B<--verbose>] [B<--url> start URL] 
         [B<--match> match string] [B<--exclude> exclude string]
         [B<--proxy> proxy URL] [B<--internal-only>]
         [B<--ignore> ignore string] [B<--file> file name]
         [B<--style> style file URL]
         [B<--mailto> email address]
         [B<--note> note] [B<--sleep> seconds] [B<--timeout> timeout]
         [B<--interval> seconds] [B<--dontwarn> HTTP responde codes]
         [B<--enable-virtual>]
         [B<--language> language code]
         [start URLs]

=head1 DESCRIPTION

Checkbot verifies the links in a specific portion of the World Wide
Web. It creates HTML pages with diagnostics.

Checkbot uses LWP to find URLs on pages and to check them. It supports
the same schemes as LWP does, and find the same links that
HTML::LinkExtor will find.

Options for Checkbot are:

=over 4

=item --url <start URL>

Set the start URL. Checkbot starts checking at this URL, and then
recursively checks all links found on this page. The start URL takes
precedence over additional URLs specified on the command line.

If no scheme is protocol for the URL, the file protocol is assumed.

=item --match <match string>

This option selects which pages Checkbot considers local. If the
I<match string> is contained within the URL, then Checkbot considers
the page local, retrieves it, and will check all the links contained
on it. Otherwise the page is considered external and it is only
checked with a HEAD request.

If no explicit I<match string> is given, the first start URL (See
option C<--url>) will be used as a match string instead.

The I<match string> can be a perl regular expression.

For example, to check the main server page and all HTML pages directly
underneath it, but not the HTML pages in the subdirectories of the
server, the I<match string> would be
"www.someserver.xyz/($|[^/]+.html)".

=item --exclude <exclude string>

URLs matching the I<exclude string> are considered to be external,
even if they happen to match the I<match string> (See option C<--match>).

The I<exclude string> can be a perl regular expression.

=item --ignore <ignore string>

If a URL has an error, and matches the I<ignore string>, its error
will not be listed. This can be useful to stop certain errors from
being listed.

The I<ignore string> can be a perl regular expression.

=item --proxy <proxy URL>

This attribute specifies the URL for a proxy server.  Only external URLs 
are queried through this proxy server, because Checkbot assumes all 
internal URLs can be accessed directly.  Currently only the HTTP and FTP 
protocols will be send to the proxy server.

=item --internal-only 

Skip the checking of external links at the end of the Checkbot
run. Only matching links are checked. Not that some redirections may
still cause external links to be checked.

=item --mailto <email address>

Send mail to the I<email address> when Checkbot is done
checking. Includes a small summary of the results.

=item --note <note>

The I<note> is included verbatim in the mail message (See option
C<--mailto>). This can be useful to include the URL of the summary HTML page
for easy reference, for instance.

Only meaningful in combination with the C<--mailto> option.

=item --help

Shows brief help message on the standard output.

=item --verbose

Show verbose output while running. Includes all links checked, results
from the checks, etc.

=item --debug

Enable debugging mode. Not really supported anymore, but it will keep
some files around that otherwise would be deleted.

=item --sleep <seconds>

Number of I<seconds> to sleep in between requests. Default is 0
seconds. Setting this option can be useful to keep the load on the web
server down while running Checkbot.

=item --timeout <timeout>

Default timeout for the requests, specified in seconds. The default is
2 minutes.

=item --interval <seconds>

The maximum interval between updates in seconds. Default is 3 hours
(10800 seconds). Checkbot will start the intervale at one minute, and
gradually extend it towards the maximum interval.

=item --file <file name>

Write the summary pages into file I<file name>. Default is C<checkbot.html>.

=item --style <URL of style file>

When this option is used, Checkbot embeds this URL as a link to a
style file on each page it writes. This makes it easy to customize the
layout of pages generated by Checkbot.

=item --dontwarn <HTTP response codes regular expression>

Do not include warnings on the result pages for those HTTP response
codes which match the regular expression. For instance, --dontwarn
"(301|404)" would not include 301 and 404 response codes.

=item --enable-virtual

This option enables dealing with virtual servers. Checkbot then
assumes that all hostnames for internal servers are unique, even
though their IP addresses may be the same. Normally Checkbot uses the
IP address to distinguish servers. This has the advantage that if a
server has two names (e.g. www and bamboozle) its pages only get
checked once. When you want to check multiple virtual servers this
causes problems, which this feature work around by using the hostname
to distinguish the server.

=item --language

The argument for this option is a two-letter language code. Checkbot
will use language negotiation to request files in that language. The
default is to request English language (language code 'en').

=back

=head1 PREREQUISITES

This script uses the C<LWP> modules.

=head1 COREQUISITES

This script can send mail when C<Mail::Send> is present.

=head1 AUTHOR

Hans de Graaff <hans@degraaff.org>

=pod OSNAMES 

any

=cut

# Prepare the use of DBM files, and show our preferences
use AnyDBM_File;
@AnyDBM_File::ISA = qw(DB_File GDBM_File NDBM_File);

# Declare some global variables, avoids ugly use of main:: all around
my @starturls = ();
# A hash of warnings generated by Checkbot apart from HTTP responses
my %warning = ();

# This hash is indexed by two fields, and contains an array of responses:
# $problems{HTTP Response code}{URL} = @( HTTP::Response );
my %problems = ();

# Version information
my $VERSION;
( $VERSION ) = sprintf("%d.%02d", q$Revision: 1.67 $ =~ /(\d+)\.(\d+)/);

# If on a Mac we should ask for the arguments through some MacPerl stuff
if ($^O eq 'MacOS') {
  $main::mac_answer = eval "MacPerl::Ask('Enter Command-Line Options')";
  push(@ARGV, split(' ', $main::mac_answer));
}

&check_options();
&init_modules();
&init_globals();
&setup();

# Start actual application
&check_internal();

# Empty checked array to clear up memory
undef %main::checked;
%main::checked = ();

if (defined $main::opt_internal_only) {
  print STDERR "*** Not checking external links because of --internal-only option.\n"
    if defined $main::opt_verbose;
} else {
  check_external();
}

&create_page(1);
&send_mail if defined $main::opt_mailto;

&clean_up();

exit 0;

### Initialization and setup routines

sub check_options {

  # Get command-line arguments
  use Getopt::Long;
  my $result = GetOptions(qw(debug help verbose url=s match=s exclude|x=s file=s style=s ignore|z=s mailto|M=s note|N=s proxy=s internal-only sleep=i timeout=i interval=i dontwarn=s enable-virtual language=s));

  # Handle arguments, some are mandatory, some have defaults
  &print_help if (($main::opt_help && $main::opt_help) 
                  || (!$main::opt_url && $#ARGV == -1));
  $main::opt_timeout = 120 unless defined($main::opt_timeout) && length($main::opt_timeout);
  $main::opt_verbose = 0 unless $main::opt_verbose;
  $main::opt_sleep = 0 unless defined($main::opt_sleep) && length($main::opt_sleep);
  $main::opt_interval = 10800 unless defined $main::opt_interval and length $main::opt_interval;
  $main::opt_dontwarn = "xxx" unless defined $main::opt_dontwarn and length $main::opt_dontwarn;
  $main::opt_enable_virtual = 0 unless defined $main::opt_enable_virtual;
  # Set the default language and make sure it is a two letter, lowercase code
  $main::opt_language = 'en' unless defined $main::opt_language;
  $main::opt_language = lc(substr($main::opt_language, 0, 2));
  $main::opt_language =~ tr/a-z//cd;
  if ($main::opt_language !~ /[a-z][a-z]/) {
    warn "Argument --language $main::opt_language is not a valid language code\nUsing English as a default.\n";
    $main::opt_language = 'en';
  }

  # The default for opt_match will be set later, because we might want
  # to muck with opt_url first.

  # Display messages about the options
  print STDERR "*** Starting Checkbot $VERSION in verbose mode\n" 
    if $main::opt_verbose;
  print STDERR "    Will skip checking of external links\n"
    if $main::opt_internal_only;
}

sub init_modules {

# Using keep-alive from HTTP 1.1 will be a great efficiency
# improvement, but it wont be available until libwww-perl 5.54. This
# code hints at what should be done, but it's not an actual
# implementation.

#      if ( $server->{keep_alive} ) {
#           eval {
#               require LWP::Protocol::http11;
#               LWP::Protocol::implementor('http', 'LWP::Protocol::http11');
#           };
#           warn "Cannot use keep alives, please upgrade LWP -- $@\n" if $@;
#      }

  use URI;
  # Prepare the user agent to be used:
  use LWP::UserAgent;
  use LWP::MediaTypes;
  #use LWP::Debug qw(- +debug);
  use HTML::LinkExtor;
  $main::ua = new LWP::UserAgent;
  $main::ua->agent("Checkbot/$VERSION LWP/" . LWP::Version);
  $main::ua->timeout($main::opt_timeout);

  require Mail::Send if defined $main::opt_mailto;

  use HTTP::Status;
}

sub init_globals {
  my $url;

  # Remember start time
  $main::start_time = localtime();

  # Directory and files for output
  if ($main::opt_file) {
    $main::file = $main::opt_file;
    $main::file =~ /(.*)\./;
    $main::server_prefix = $1;
  } else { 
    $main::file = "checkbot.html";
    $main::server_prefix = "checkbot";
  }
  $main::tmpdir = ($ENV{'TMPDIR'} or $ENV{'TMP'} or $ENV{'TEMP'} or "/tmp") . "/Checkbot.$$";

  $main::cur_queue  = $main::tmpdir . "/queue";
  $main::new_queue  = $main::tmpdir . "/queue-new";
  $main::extfile = $main::tmpdir . "/external";

  # Set up hashes to be used
  %main::checked = ();
  %main::servers = ();
  %main::servers_get_only = ();

  # Initialize the start URLs. --url takes precedence. Otherwise
  # just process URLs in order as they appear on the command line.
  unshift(@ARGV, $main::opt_url) if $main::opt_url;
  foreach (@ARGV) {
    $url = URI->new($_);
    # If no scheme is defined we will assume file is used, so that
    # it becomes easy to check a single file.
    $url->scheme('file') unless defined $url->scheme;
    $url->host('localhost') if $url->scheme eq 'file';
    if (!defined $url->host) {
      warn "No host specified in URL $url, ignoring it.\n";
      next;
    }
    push(@starturls, $url);
  }
  die "There are no valid starting URLs to begin checking with!\n"
    if scalar(@starturls) == -1;

  # Set the automatic matching expression to a concatenation of the starturls
  if (!defined $main::opt_match) {
    my @matchurls;
    my $matchurl;
    foreach $matchurl (@starturls) {
      push(@matchurls, quotemeta $matchurl);
    }
    $main::opt_match = '(' . join('|', @matchurls) . ')';
    print STDERR "--match defaults to $main::opt_match\n" if $main::opt_verbose;
  }
  # Variables to keep track of number of links
  $main::LINKS = 0;
  $main::DUPS  = 1;
  $main::PROBL = 2;
  $main::TODO = 3;

  @main::st_int = (0, 0, 0, scalar(@starturls));
  @main::st_ext = (0, 0, 0, 0);

  # We write out our status every now and then.
  $main::cp_int = 1;
  $main::cp_last = 0;
}


sub setup {

  mkdir $main::tmpdir, 0755
    || die "$0: unable to create directory $main::tmpdir: $!\n";

  # Explicitly set the record separator. I had the problem that this
  # was not defined under my perl 5.00502. This should fix that, and
  # not cause problems for older versions of perl.
  $/ = "\n";

  open(EXTERNAL, ">$main::extfile")
    || die "$0: Unable to open EXTERNAL $main::extfile for writing: $!\n";
  open(CURRENT, ">$main::cur_queue")
    || die "$0: Unable to open CURRENT $main::cur_queue for writing: $!\n";
  open(QUEUE, ">$main::new_queue")
    || die "$0: Unable to open QUEUE $main::new_queue for writing: $!\n";


  # Prepare CURRENT queue with starting URLs
  foreach (@starturls) {
    print CURRENT $_->as_string . "|\n";
  }
  close CURRENT;

  open(CURRENT, $main::cur_queue) 
    || die "$0: Unable to open CURRENT $main::cur_queue for reading: $!\n";

}

### Cleaning up after running Checkbot
sub clean_up {
  unless (defined($main::opt_debug)) {
    unlink $main::cur_queue, $main::new_queue, $main::extfile;
    rmdir $main::tmpdir;
  }
}


### Main application code

sub check_internal {
  my $line;

  # As long as there are links to do, check them
  while ( $main::st_int[$main::TODO] > 0 ) {
    # Read a line from the queue, and process it
    while (defined ($line = <CURRENT>) ) {
      chomp($line);
      &handle_url($line);
      &check_point();
      $main::st_int[$main::TODO]--;
    }

    # Move queues around, and try again, but only if there are still
    # things to do
    if ($main::st_int[$main::TODO]) {
      print STDERR "*** Moving queues around, $main::st_int[$main::TODO] to do\n" 
	if $main::opt_verbose;
      close CURRENT if defined(<CURRENT>);
      close QUEUE;

      # TODO: should check whether these succeed
      unlink($main::cur_queue);
      rename($main::new_queue, $main::cur_queue);
    
      open(CURRENT, "$main::cur_queue") 
	|| die "$0: Unable to open $main::cur_queue for reading: $!\n";
      open(QUEUE, ">$main::new_queue") 
	|| die "$0: Unable to open $main::new_queue for writing: $!\n";

      # This should not happen, but it might anyway,and we don't want
      # to loop forever.
      $main::st_int[$main::TODO] = 0 if -z $main::cur_queue;
    }
  }
  close CURRENT;
  close QUEUE;
  close EXTERNAL;
}

sub handle_url {
  my ($line) = @_;
  my ($urlstr, $urlparent) = split(/\|/, $line);
  my $reqtype;
  my $response;
  my $type;

  # Add this URL to the ones we've seen already, return 
  # if it is a duplicate
  return if &add_checked($urlstr);

  my $url = URI->new($urlstr);
  $main::st_int[$main::LINKS]++;
	
  if (defined($url->scheme) 
      && $url->scheme =~ /^(http|file|ftp|gopher|nntp)$/o
      && $url->path !~ /[=\?]/o ) {
    if ($url->path =~ /\/$/o || $url->path eq "") {
      $type = 'text/html';
    } else {
      $type = guess_media_type($url->path);
    }

    # If we are unsure about the type it could well be a special
    # extension for e.g. on-the-fly HTML, or a script or some
    # such. Should ask the server. We do this when we cannot determine
    # what the type of file is, or when we request a file with the
    # HTTP protocol and the type is not text/html. In these cases we
    # can logically expect to get HTMl in return, so we ask the server
    # to be sure.

    if ($type eq 'application/octet-stream'
	or $url->scheme eq 'http' && $type ne 'text/html') {
      $response = performRequest('HEAD', $url, $urlparent, $type, $main::opt_language);
      $type = $response->content_type;
    }

    if ($type =~ /html/o 
        && $url->scheme =~ /^(http|file|ftp|gopher)$/o
        && (!defined $main::opt_exclude || $url !~ /$main::opt_exclude/o)) {
      $reqtype = 'GET';
    } else {
      $reqtype = 'HEAD';
    }

    $response = performRequest($reqtype, $url, $urlparent, $type, $main::opt_language)
      unless defined($response) and $reqtype eq 'HEAD';

    if ($response->is_success) {
      sleep($main::opt_sleep) unless $main::opt_debug || $url->scheme eq 'file';

      # If this url moves off of this site then return now
      if ($url !~ /$main::opt_match/o 
	  || (defined $main::opt_exclude && $url =~ /$main::opt_exclude/o)) {
	print STDERR " Exclude $url\n" if $main::opt_verbose;
      } else {
	&handle_doc($response) if $reqtype eq 'GET';
      }
    } else {
      if (defined $main::opt_ignore && $url =~ /$main::opt_ignore/o) {
	print STDERR "Ignore  $url\n" if $main::opt_verbose;
      } else {
	print STDERR "         ", $response->status_line, "\n" 
	  if $main::opt_verbose;
        push @{$problems{$response->code}{$url}}, $response unless $response->code =~ /$main::opt_dontwarn/o;
	$main::st_int[$main::PROBL]++;
      }

      if ($response->is_redirect) {
	if ($response->code == 300) {  # multiple choices, but no redirection available
	  print STDERR "        Multiple choices.\n" if $main::opt_verbose;
	} else {
	  my $baseURI = URI->new($url);
	  if (defined $response->header('Location')) {
	    my $redir_url = URI->new_abs($response->header('Location'), $baseURI);
	    print STDERR "         Redirected to " . $redir_url . "\n" if $main::opt_verbose;
	    add_to_queue($redir_url, $urlparent);
	  } else {
	    print STDERR "         Location header missing from redirect response.\n" if $main::opt_verbose;
	  }
	}
      }
    }

    # Done with this URL
  } else {
    # not interested in other URLs right now
    print STDERR "  Ignore $url\n" if $main::opt_verbose;
  }
}

sub performRequest {
  my ($reqtype, $url, $urlparent, $type, $language) = @_;

  # Output what we are going to do with this link
  printf STDERR "    %4s %s (%s)\n", $reqtype, $url, $type
    if $main::opt_verbose;

  my %header_hash = (
	'Referer'	=> $urlparent
  );
  if (defined($language) && ($language ne '')) {
    $header_hash{'Accept-Language'} = $language;
  }
  my $ref_header = new HTTP::Headers(%header_hash);
  my $request = new HTTP::Request($reqtype, $url, $ref_header);
  my $response = $main::ua->simple_request($request);

  return $response;
}

sub check_external {
  my $newurl;
  my $urlparent;
  my $prevurl = "";
  my $reqtype;

  # Add a proxy to the user agent, if defined
  $main::ua->proxy(['http', 'ftp'], $main::opt_proxy) 
    if defined($main::opt_proxy);

  # I'll just read in all external URLs in an array, sort them, and
  # then remove duplicates and tally them. This improves over the old
  # situation because we don't rely on sort and wc being there, and
  # the tally will actually be correct. Drawback: we might get into
  # memory problems more easily. Oh well, buy more, I suppose.

  print STDERR "*** Reading and sorting external URLs\n" 
    if $main::opt_verbose;;
  open EXTERNAL, $main::extfile
    or die "$0: Unable to open $main::extfile for reading: $!\n";
  # Gobble!
  my @externals = sort <EXTERNAL>;

  close EXTERNAL;

  $main::st_ext[$main::TODO] = $#externals;
  $main::st_ext[$main::DUPS] = $main::st_ext[$main::LINKS] - $main::st_ext[$main::TODO];

  print STDERR "*** Checking $main::st_ext[$main::TODO] external links\n"
    if $main::opt_verbose;
  # We know that our list is sorted, but the same URL
  # can exist several types, once for each parent
  # For now we just look at the first url/parent pair, but
  # ideally we should list this for each pair (i.e. for
  # each page on which the link occurs.
  foreach (@externals) {
    ($newurl, $urlparent) = split(/\|/);
    $main::st_ext[$main::TODO]--;

    next if $prevurl eq $newurl;
    $prevurl = $newurl;

    my $url = URI->new_abs($newurl, $urlparent);

    if ($url->scheme =~ /^(http|file|ftp|gopher|nntp)$/o) { 

      my $response = get_headers($url, $urlparent, $main::opt_language);

      if ($response->is_error || $response->is_redirect) {
	if (defined $main::opt_ignore && $url =~ /$main::opt_ignore/o) {
	  print STDERR "Ignore  $url error\n" if $main::opt_verbose;
	} else {
	  printf STDERR "          ", $response->status_line
	    if $main::opt_verbose;
          push @{$problems{$response->code}{$url}}, $response unless $response->code =~ /$main::opt_dontwarn/o;
	  $main::st_ext[$main::PROBL]++;
	}
      }
    }
    &check_point();
  }
}


# This routine creates a (temporary) WWW page based on the current
# findings This allows somebody to monitor the process, but is also
# convenient when this program crashes or waits because of diskspace
# or memory problems

sub create_page {
    my($final_page) = @_;

    my $path = "";
    my $prevpath = "";
    my $prevcode = 0;
    my $prevmessage = "";

    print STDERR "*** Start writing results page\n" if $main::opt_verbose;

    open(OUT, ">$main::file.new") 
	|| die "$0: Unable to open $main::file.new for writing:\n";
    print OUT "<html><head>\n";
    if (!$final_page) {
      printf OUT "<META HTTP-EQUIV=\"Refresh\" CONTENT = %d>\n",
      int($main::cp_int * 60 / 2 - 5);
    }

    print OUT "<title>Checkbot report</title>\n";
    print OUT "<link rel=\"StyleSheet\" type=\"text/css\" href=\"$main::opt_style\">\n" if defined $main::opt_style;
    print OUT "</head>\n";
    print OUT "<body>\n";
    print OUT "<h1><em>Checkbot</em>: main report</h1>\n";

    # Show the status of this checkbot session
    print OUT "<table><tr><th>Status:</th><td>";
    if ($final_page) {
      print OUT "Done.\n"
    } else {
      print OUT "Running since $main::start_time.<br>\n";
      print OUT "Last update at ". localtime() . ".<br>\n";
      print OUT "Next update in <b>", int($main::cp_int), "</b> minutes.\n";
    }
    print OUT "</td></tr></table>\n\n";

    # Summary (very brief overview of key statistics)
    print OUT "<hr><h2>Report summary</h2>\n";

    print OUT "<table>\n";
    print OUT "<tr> <td> </td> <th>Total<br>links</th> <th>Links<br>To Do</th>";
    print OUT "<th>Unique</br>links</th> <th>Problem<br>links</th> ";
    print OUT "<th>Ratio</th> </tr>\n";

    if ($main::st_int[$main::LINKS]) {
    print OUT "<tr> <th>Internal</th>";
    printf OUT "<td align=right>%d</td> <td align=right>%d</td> <td align=right>%d</td> <td align=right>%d</td> <td align=right>%d%%</td> </tr>\n",
    $main::st_int[$main::LINKS] + $main::st_int[$main::DUPS],
    $main::st_int[$main::TODO],
    $main::st_int[$main::LINKS], $main::st_int[$main::PROBL],
    $main::st_int[$main::PROBL] / $main::st_int[$main::LINKS] * 100;
    }

  if ($main::st_ext[$main::LINKS]) {
    print OUT "<tr> <th>External</th>";
    printf OUT "<td align=right>%d</td> <td align=right>%d</td> <td align=right>%d</td> <td align=right>%d</td> <td align=right>%d%%</td> </tr>\n",
    $main::st_ext[$main::LINKS] + $main::st_ext[$main::DUPS],
    $main::st_ext[$main::TODO],
    $main::st_ext[$main::LINKS], $main::st_ext[$main::PROBL],
    $main::st_ext[$main::PROBL] / $main::st_ext[$main::LINKS] * 100;
  }

    print OUT "</table>\n\n";

    # Server information
    printAllServers($final_page);

    # Checkbot session parameters
    print OUT "<hr><h2>Checkbot session parameters</h2>\n";
    print OUT "<table>\n";
    print OUT "<tr><th align=left>--url</th><td>Start URL(s)</td><td>",
              join(',', @starturls), "</td></tr>\n";
    print OUT "<tr><th align=left>--match</th><td>Match regular expression</td><td>$main::opt_match</td></tr>\n";
    print OUT "<tr><th align=left>--exclude</th><td>Exclude regular expression</td><td>$main::opt_exclude</td></tr>\n" if defined $main::opt_exclude;
    print OUT "<tr><th align=left>--ignore</th><td>Ignore regular expression</td><td>$main::opt_ignore</td></tr>\n" if defined $main::opt_ignore;
    print OUT "<tr><th align=left>--dontwarn</th><td>Don't warn for these codes</td><td>$main::opt_dontwarn</td></tr>\n" if $main::opt_dontwarn ne 'xxx';
    print OUT "<tr><th align=left>--enable-virtual</th><td>Use virtual names only</td><td>yes</td></tr>\n" if $main::opt_enable_virtual;
    print OUT "<tr><th align=left>--internal-only</th><td>Check only internal links</td><td>yes</td></tr>\n" if defined $main::opt_internal_only;
    
    print OUT "</table>\n";

    # Statistics for types of links

    print OUT signature();

    close(OUT);

    rename($main::file, $main::file . ".bak");
    rename($main::file . ".new", $main::file);

    unlink $main::file . ".bak" unless $main::opt_debug;

    print STDERR "*** Done writing result page\n" if $main::opt_verbose;
}

# Create a list of all the servers, and create the corresponding table
# and subpages. We use the servers overview for this. This can result
# in strange effects when the same server (e.g. IP address) has
# several names, because several entries will appear. However, when
# using the IP address there are also a number of tricky situations,
# e.g. with virtual hosting. Given that likely the servers have
# different names for a reasons, I think it is better to have
# duplicate entries in some cases, instead of working off of the IP
# addresses.

sub printAllServers {
  my ($finalPage) = @_;

  my $server;
  print OUT "<hr><h2>Overview per server</h2>\n";
  print OUT "<table> <tr> <th>Server</th><th>Server<br>Type</th><th>Unique<br>links</th><th>Problem<br>links</th><th>Ratio</th></tr>\n";

  foreach $server (sort keys %main::servers) {
    print_server($server, $finalPage);
  }
  print OUT "</table>\n\n";
}

sub get_server_type {
  my($server) = @_;

  my $result;

  if ( ! defined($main::server_type{$server})) {
    if ($server eq 'localhost') {
      $result = 'Direct access through filesystem';
    } else {
      my $request = new HTTP::Request('HEAD', "http://$server/");
      my $response = $main::ua->simple_request($request);
      $result = $response->header('Server');
    }
    $result = "Unknown server type" if ! defined $result or $result eq "";
    print STDERR "=== Server $server is a $result\n" if $main::opt_verbose;
    $main::server_type{$server} = $result;
  }
  $main::server_type{$server};
}

sub add_checked {
  my($urlstr) = @_;
  my $item;
  my $result = 0;

  unless ($main::opt_enable_virtual) {
    # Substitute hostname with IP-address. This keeps us from checking
    # the same pages for each name of the server, wasting time & resources.
    # Only do this if we are not dealing with virtual servers.
    my $url = URI->new($urlstr);

    # TODO: This should be fixed in a more sane way. Really the URI
    # class should deal with stuff like this. Checkbot can't be expected
    # to know which URI schemes do have a host component and which types
    # don't.
    $url->host(ip_address($url->host)) if $url->scheme =~ /^(http|ftp|gopher|https|ldap|news|nntp|pop|rlogin|snews|telnet)$/;
    $urlstr = $url->as_string;
  }

  if (defined $main::checked{$urlstr}) {
    $result = 1;
    $main::st_int[$main::DUPS]++;
    $main::checked{$urlstr}++;
  } else {
    $main::checked{$urlstr} = 1;
  }

  return $result;
}

# Parse document, and get the links
sub handle_doc {
  my ($response) = @_;
  my ($doc_new, $doc_dup, $doc_ext) = (0, 0, 0);

  # TODO: we are making an assumption here that the $reponse->base is
  # valid, which might not always be true! This needs to be fixed, but
  # first let's try to find out why this stuff is sometimes not
  # valid...

  # When we received the document we can add a notch to its server
  $main::servers{$response->base->authority}++;

  my $p = HTML::LinkExtor->new(undef, $response->base);
  $p->parse($response->content);
  $p->eof;

  # Parse the links we found in this document
  my @links = $p->links();
  foreach (@links) {
    my ($tag, %l) = @{$_};
    foreach (keys %l) {
      # Get the canonical URL, so we don't need to worry about base, case, etc.
      my $url = $l{$_}->canonical;

      # Remove fragments, if any
      $url->fragment(undef);

      # Check whether URL has fully-qualified hostname
      if ($url->scheme =~ /^(http|ftp)/) {
        if (! defined $url->host) {
          print STDERR "--> No host name found in URL: $url\n" if $main::opt_verbose;
          $warning{'No host name found in URL'}{$url} .= $response->base->as_string . "\n";
          next;
        } elsif ($url->host !~ /\./) {
          print STDERR "--> Unqualified host name: $url\n" if $main::opt_verbose;
          $warning{'Unqualified host name in URL'}{$url} .= $response->base->as_string . "\n";
        }
      }

      if ($url =~ /$main::opt_match/o) {
	if (defined $main::checked{$url}) {
	  $doc_dup++;
	} else {
	  add_to_queue($url, $response->base);
	  $doc_new++;
	}
      } else {
	# Add this as an external link if we can check the protocol later
	if ($url =~ /^(http|ftp|gopher):/o) {
	  print EXTERNAL $url . "|" . $response->base . "\n";
	  $doc_ext++;
	} else {
	  print STDERR "--> Ignored: external URL is not HTTP, FTP or gopher: $url\n" if $main::opt_verbose;
	}
      }
    }
  }
  $main::st_int[$main::DUPS] += $doc_dup;
  $main::st_ext[$main::LINKS] += $doc_ext;
  $main::st_ext[$main::TODO] += $doc_ext;
  if ($main::opt_verbose) {
    my @string = ();
    push(@string, "$doc_new new") if $doc_new > 0;
    push(@string, "$doc_dup dup") if $doc_dup > 0;
    push(@string, "$doc_ext ext") if $doc_ext > 0;
    printf STDERR "         (%s)\n", join(',', @string) if $#string > 0;
  }
}

sub get_headers {

# A better solution here would be to use GET exclusively. Here is how
# to do that. We would have to set this max_size thing in
# check_external, I guess...

# Set $ua->max_size(1) and then try a normal GET request.

  my ($url, $urlparent, $language) = @_;

  # Normally, we would only need to do a HEAD, but given the way LWP
  # handles gopher requests, we need to do a GET on those to get at
  # least a 500 and 501 error. We would need to parse the document
  # returned by LWP to find out if we had problems finding the
  # file. -- Patch by Bruce Speyer <bspeyer@texas-one.org>

  # We also need to do GET instead of HEAD if we know the remote
  # server won't accept it.  The standard way for an HTTP server to
  # indicate this is by returning a 405 ("Method Not Allowed") or 501
  # ("Not Implemented").  Other circumstances may also require sending
  # GETs instead of HEADs to a server.  Details are documented below.
  # -- Larry Gilbert <larry@n2h2.com>

  my ($reqtype, $ref_header, $request, $response);

  foreach $reqtype ('HEAD', 'GET') {
    # Under normal circumstances, we will go through this loop only once
    # and thus make a single HEAD request.

    # If we already know a HEAD won't work, skip it and do GET
    next if $reqtype eq 'HEAD' && ($url->scheme =~ /^gopher$/o ||
				   $main::servers_get_only{$url->authority} );

    print STDERR "     $reqtype $url\n" if $main::opt_verbose;

    my %header_hash = (
	'Referer'	=> $urlparent
    );
    if (defined($language) && ($language ne '')) {
      $header_hash{'Accept-Language'} = $language;
    }
    $ref_header = new HTTP::Headers(%header_hash);
    $request = new HTTP::Request($reqtype, $url, $ref_header);
    $response = $main::ua->simple_request($request);

    if ($reqtype eq 'HEAD') {

      # 405 and 501 are standard indications that HEAD shouldn't be used
      if ($response->code =~ /^(405|501)$/o) {
	print STDERR "Server doesn't like HEAD requests; retrying\n"
	  if $main::opt_verbose;
	$main::servers_get_only{$url->authority}++;
	next;
      }

      # Microsoft IIS has been seen dropping the connection prematurely
      # when it should be returning 405 instead
      elsif ($response->status_line =~ /^500 unexpected EOF/o) {
	print STDERR "Server hung up on HEAD request; retrying\n"
	  if $main::opt_verbose;
	$main::servers_get_only{$url->authority}++;
	next;
      }

      # Netscape Enterprise has been seen returning 500 and even 404
      # (yes, 404!!) in response to HEAD requests
      elsif (defined $response->server and $response->server =~ /^Netscape-Enterprise/o &&
	     $response->code =~ /^(404|500)$/o) {
	print STDERR "Unreliable response to HEAD request; retrying\n"
	  if $main::opt_verbose;
	$main::servers_get_only{$url->authority}++;
	next;
      }

      # JavaWebServer/1.1.3 has been seen returning 500
      elsif (defined $response->server and $response->server =~ /^JavaWebServer/o &&
	     $response->code =~ /^500$/o) {
	print STDERR "Unreliable response to HEAD request; retrying\n"
	  if $main::opt_verbose;
	$main::servers_get_only{$url->authority}++;
	next;
      }

      # ALEPH has been seen returning 500
      elsif (defined $response->server and $response->server =~ /^ALEPH/o &&
	     $response->code =~ /^500$/o) {
	print STDERR "Unreliable response to HEAD request; retrying\n"
	  if $main::opt_verbose;
	$main::servers_get_only{$url->authority}++;
	next;
      }

      # If a HEAD request resulted in nothing noteworthy, no need for
      # any further attempts
      else { last; }
    }
  }

  return $response;
}


sub add_to_queue {
  my ($url, $parent) = @_;

  print QUEUE $url . '|' . $parent . "\n";
  $main::st_int[$main::TODO]++;
}

sub print_server {
  my($server, $final_page) = @_;

  my $host = $server;
  $host =~ s/(.*):\d+/$1/;

  print STDERR "    Writing server $server (really ", ip_address($host), ")\n" if $main::opt_verbose;

  my $server_problem = &count_problems($server);
  my $filename = "$main::server_prefix-$server.html";
  $filename =~ s/:/-/o;

  print OUT "<tr> <td>";
  print OUT "<a href=\"@{[ (fileparse($filename))[0] ]}\">" if $server_problem > 0;
  print OUT "$server";
  print OUT "</a>" if $server_problem > 0;
  print OUT "</td>";
  print OUT "<td>" . &get_server_type($server) . "</td>";
  printf OUT "<td align=right>%d</td> <td align=right>%d</td>",
  $main::servers{$server} + $server_problem,
  $server_problem;
  my $ratio = $server_problem / ($main::servers{$server} + $server_problem) * 100;
  print OUT "<td align=right>";
  print OUT "<b>" unless $ratio < 0.5;
  printf OUT "%4d%%", $ratio;
  print OUT "</b>" unless $ratio < 0.5;
  print OUT "</td>";
  print OUT "</tr>\n";

  # Create this server file
  open(SERVER, ">$filename")
    || die "Unable to open server file $filename for writing: $!";
  print SERVER "<html><head>\n";
  if (!$final_page) {
    printf SERVER "<META HTTP-EQUIV=\"Refresh\" CONTENT = %d>\n",
      int($main::cp_int * 60 / 2 - 5);
  }
  print SERVER "<link rel=\"StyleSheet\" type=\"text/css\" href=\"$main::opt_style\">\n" if defined $main::opt_style;
  print SERVER "<title>Checkbot: output for server $server</title></head>\n";
  print SERVER "<body><h2><em>Checkbot</em>: report for server <tt>$server</tt></h2>\n";
  print SERVER "Go To: <a href=\"@{[ (fileparse($main::file))[0] ]}\">Main report page</a>.\n";

  printServerProblems($server, $final_page);
  printServerWarnings($server, $final_page);

  print SERVER "\n";
  print SERVER signature();

  close SERVER;
}

# Return a string containing Checkbot's signature for HTML pages
sub signature {
  return "<hr>\nPage created by <a href=\"http://degraaff.org/checkbot/\">Checkbot $VERSION</a> on <em>" . localtime() . "</em>.\n</body></html>";
}

# Loop through all possible problems, select relevant ones for this server
# and display them in a meaningful way.
sub printServerProblems {
  my ($server) = @_;
  $server = quotemeta $server;

  my $separator = "<hr>\n";

  foreach my $code (sort keys %problems) {
    my $codeOutput = '';
    my %thisServerList = ();

    foreach my $url (sort keys %{ $problems{$code}}) {

      foreach my $response (@{ $problems{$code}{$url} }) {
	my $parent = $response->request->header('Referer');
	next if $parent !~ $server;

	chomp $parent;
	$thisServerList{$parent}{$url} = $response;
      }
    }

    foreach my $parent (sort keys %thisServerList) {
      my $urlOutput = '';
      foreach my $url (sort keys %{ $thisServerList{$parent} }) {
	my $response = $thisServerList{$parent}{$url};
	$urlOutput .= "<li><a href=\"$url\">$url</a><br>\n";
	$urlOutput .= $response->message . "\n"
           if defined $response->message
              and $response->message ne status_message($code);
      }
      if ($urlOutput ne '') {
	$codeOutput .= "<dt><a href=\"$parent\">$parent</a>\n<dd>\n";
	$codeOutput .= "<ul>\n$urlOutput\n</ul>\n\n";
      }
    }

    if ($codeOutput ne '') {
      print SERVER $separator if $separator;
      $separator = '';
      print SERVER "<h4>$code ", status_message($code), "</h4>\n";
      print SERVER "<dl>\n$codeOutput\n</dl>\n";
    }
  }

}

sub printServerWarnings {
  my ($server,$finalPage) = @_;

  my $warningType;
  my $oldWarningType = '';
  my $url;
  my $parent;
  my $separator = "<hr>\n";

  foreach $warningType (sort keys %warning) {
    my %thisServerList = ();

    foreach $url (sort keys %{$warning{$warningType}}) {
      # Get only those parent URLs that match this server
      my @parents = grep(/$server/, split(/\n/, $warning{$warningType}{$url}));
      if ($#parents >= 0) {
	foreach my $parent (@parents) {
	  $thisServerList{$parent}{$url} = 1;
	}
      }
    }

    next if ! keys %thisServerList;

    print SERVER $separator if $separator;
    $separator = '';
    print SERVER "<h4>$warningType</h4>\n<dl>\n" unless $oldWarningType eq $warningType;
    $oldWarningType = $warningType;

    foreach my $parent (sort keys %thisServerList) {
      print SERVER "<dt><a href=\"$parent\">$parent</a>\n<dd>\n";
      foreach my $url (sort keys %{ $thisServerList{$parent} }) {
	print SERVER "<a href=\"$url\">$url</a><br>";
      }
    }
    print SERVER "</dl>\n";
  }
}

sub check_point {
    if ( ($main::cp_last + 60 * $main::cp_int < time()) 
	 || ($main::opt_debug && $main::opt_verbose)) {
	&create_page(0);
	$main::cp_last = time();
	$main::cp_int = $main::cp_int * 1.25 unless $main::opt_debug;
        $main::cp_int = $main::cp_int > $main::opt_interval ? $main::opt_interval : $main::cp_int;
    }
}

sub send_mail {
  my $msg = new Mail::Send;
  my $sub = 'Checkbot results for ';
  $sub .= join(', ', @starturls);

  $msg->to($main::opt_mailto);
  $msg->subject($sub);

  my $fh = $msg->open;

  print $fh "Checkbot results for:\n  " . join("\n  ", @starturls) . "\n\n";
  print $fh "User-supplied note: $main::opt_note\n\n"
    if defined $main::opt_note;

  print $fh "A brief summary of the results follows:\n\n";
  my $line = sprintf "Internal: %6d total, %6d unique, %6d problems, ratio = %3d%%\n",
    $main::st_int[$main::LINKS] + $main::st_int[$main::DUPS],
    $main::st_int[$main::LINKS], $main::st_int[$main::PROBL],
    $main::st_int[$main::PROBL] / $main::st_int[$main::LINKS] * 100;
  if ($main::st_ext[$main::LINKS] > 0) {
    $line .= sprintf "External: %6d total, %6d unique, %6d problems, ratio = %3d%%\n",
      $main::st_ext[$main::LINKS] + $main::st_ext[$main::DUPS],
      $main::st_ext[$main::LINKS], $main::st_ext[$main::PROBL],
      $main::st_ext[$main::PROBL] / $main::st_ext[$main::LINKS] * 100;
  }
  print $fh $line;

  print $fh "\n\n-- \nCheckbot $VERSION\n";
  print $fh "<URL:http://degraaff.org/checkbot/>\n";

  $fh->close;
}

sub print_help {
  print "Checkbot $VERSION command line options:\n\n";
  print "  --debug            Debugging mode: No pauses, stop after 25 links.\n";
  print "  --verbose          Verbose mode: display many messages about progress.\n";
  print "  --url url          Start URL\n";
  print "  --match match      Check pages only if URL matches `match'\n";
  print "                     If no match is given, the start URL is used as a match\n";
  print "  --exclude exclude  Exclude pages if the URL matches 'exclude'\n";
  print "  --ignore ignore    Do not list error messages for pages that the\n";
  print "                     URL matches 'ignore'\n";
  print "  --file file        Write results to file, default is checkbot.html\n";
  print "  --mailto address   Mail brief synopsis to address when done.\n";
  print "  --note note        Include Note (e.g. URL to report) along with Mail message.\n";
  print "  --proxy URL        URL of proxy server for external http and ftp requests.\n";
  print "  --internal-only    Only check internal links, skip checking external links.\n";
  print "  --sleep seconds    Sleep this many seconds between requests (default 0)\n";
  print "  --timeout seconds  Timeout for http requests in seconds (default 120)\n";
  print "  --interval seconds Maximum time interval between updates (default 10800)\n";
  print "  --dontwarn codes   Do not write warnings for these HTTP response codes\n";
  print "  --enable-virtual   Use only virtual names, not IP numbers for servers\n";
  print "  --language         Specify 2-letter language code for language negotiation\n";
  print "\n";
  print "Options --match, --exclude, and --ignore can take a perl regular expression\nas their argument\n\n";
  print "Use 'perldoc checkbot' for more verbose documentation.\n\n";
  print "Checkbot WWW page     : http://degraaff.org/checkbot/\n";
  print "Mail bugs and problems: checkbot\@degraaff.org\n";
    
  exit 0;
}

sub ip_address {
  my($host) = @_;

  return $main::ip_cache{$host} if defined $main::ip_cache{$host};

  my($name,$aliases,$adrtype,$length,@addrs) = gethostbyname($host);
  if (defined $addrs[0]) {
    my($n1,$n2,$n3,$n4) = unpack ('C4',$addrs[0]);
    $main::ip_cache{$host} = "$n1.$n2.$n3.$n4";
  } else {
    # Whee! No IP-address found for this host. Just keep whatever we
    # got for the host. If this really is some kind of error it will
    # be found later on.
    $main::ip_cache{$host} = $host;
   }
}

sub count_problems {
  my ($server) = @_;
  $server = quotemeta $server;
  my $count = 0;

  foreach my $code (sort keys %problems) {
    foreach my $url (sort keys %{ $problems{$code}}) {
      foreach my $response (@{ $problems{$code}{$url} }) {
	my $parent = URI->new($response->request->header('Referer'));
	$count++ if $parent =~ m/$server/;
      }
    }
  }

  return $count;
}


# $Log: checkbot.pl,v $
# Revision 1.67  2001/12/16 16:15:55  graaff
#  * A --language option to ask the server for pages in other languages
#  * Bug fixes related to URI package and non-standard server names
#  * Some other minor bugfixes detailed in the ChangeLog
#  * Added example for use of the --match argument
#
# Revision 1.66  2001/10/25 19:46:42  graaff
# Fixed two bugs introduced in 1.65 related to using URI.
#
# Revision 1.65  2001/10/21 08:07:19  graaff
#  * Move to URI from URI::URL
#  * Start writing results pages right at the start
#  * A few small bugs fixed (see ChangeLog)
#
# Revision 1.64  2001/04/15 19:34:30  graaff
# Fix printing of starting URLs in email, removed duplicate header in
# report, try more environment variables to set temporary directory,
# avoid using printf on pipes, and fix silly typo.
#
# Revision 1.63  2001/01/14 15:25:52  graaff
#  * Require LWP 5.50. A bug fix was introduced in LWP 5.49 which solves
#    problems with relative Location: headers for Checkbot.
#  * Create a defaults --match argument based on all the start URLs, not
#    just the first one.
#  * Remove .bak files when the new files are written correctly.
#  * Create correct URLs when --file argument also contains directories.
#  * Deal with redirects without a Location: header.
#  * Don't exclude checkbot's own pages automagically.
#  * Always ask server about file type for HTTP requests when uncertain.
#  * Make output well-formed HTML.
#  * Several typo's and other output fixes.
#
# Revision 1.62  2000/09/24 13:51:46  graaff
# Checkbot 1.62.
#
# Revision 1.61  2000/06/29 19:56:48  graaff
# Updated Makefile.PL
# Use GET instead of HEAD for confused servers.
# Update email and web address.
#
# Revision 1.60  2000/04/30 13:34:32  graaff
# Add option --dontwarn to avoid listing certain HTTP responses. Deal
# with 300 Multiple Choices HTTP Response. Fix warning with
# --internal-only option and add message when used. Use MacPerl stuff to
# get command line options on a Mac. Check whether URLs on command line
# have a proper host.
#
# Revision 1.59  2000/01/30 20:23:32  graaff
# --internal-only option, hide some warnings when not running verbose,
# and fixed a warning.
#
# Revision 1.58  2000/01/02 15:39:59  graaff
# Deal with hostnameless URIs, use TMPDIR where available, and work
# nicely with the new HTML::LinkExtor.
#
# Revision 1.57  1999/10/24 16:11:00  graaff
# Added URI check.
#
# Revision 1.56  1999/07/31 14:52:17  graaff
# Fixed redirection URL's, deal with new URI way of handling
# hostname-less URI's.
#
# Revision 1.55  1999/05/09 15:30:34  graaff
# List broken links under the pages that contain them, instead of the
# other way around, reverting back to the way things are in 1.53 and
# earlier.
# Handle redirected, but unqualified links.
# Only print each warning header once.
# Documentation fixes.
#
# Revision 1.54  1999/01/18 22:22:29  graaff
# Fixed counting of problem links to correct checkbot.html results page.
#
# Revision 1.53  1999/01/17 20:59:14  graaff
# Fixed internal problem storage.
# Changed report to collate HTTP response codes.
# Added warning section to pages with additional warnings.
# Hammered out bug with record separator in perl 5.005.
#
# Revision 1.52  1998/10/10 08:41:50  graaff
# new version, some documentation work, and the HTML::Parse problem fixed.
#
# Revision 1.51  1997/09/06 14:01:58  graaff
# per 5.004 changes and address changes
#
# Revision 1.50  1997/04/28 07:10:26  graaff
# Fixed small problem with VERSION
#
# Revision 1.49  1997/04/27 19:24:22  graaff
# A bunch of smaller stuff
#
# Revision 1.48  1997/04/05 15:28:35  graaff
# Small fixes
#
# Revision 1.47  1997/01/28 13:48:00  graaff
# Protect against corrupted todo link count
#
# Revision 1.46  1996/12/30 15:27:11  graaff
# Several bugs fixed and features added, see changelog
#
# Revision 1.45  1996/12/24 13:59:15  graaff
# Deal with IP address not found.
#
# Revision 1.44  1996/12/11 16:16:07  graaff
# Proxy support, small bugs fixed.
#
# Revision 1.43  1996/12/05 12:35:41  graaff
# Checked URLs indexed with IP address, small changes to layout etc.
#
# Revision 1.42  1996/11/04 13:21:07  graaff
# Fixed several small problems. See ChangeLog.
#
# Revision 1.41  1996/10/04 15:15:35  graaff
# use long option names now
#
# Revision 1.40  1996/09/28 08:18:14  graaff
# updated, see ChangeLog
#
# Revision 1.39  1996/09/25 13:25:48  graaff
# update rev
#
# Revision 1.4  1996/09/25 12:53:04  graaff
# Moved checkbot back to checkbot.pl so that we can substitute some
# variables upon installation.
#
# Revision 1.37  1996/09/12 13:12:05  graaff
# Updates, and checkbot now requires LWP 5.02, which fixes some bugs.
#
# Revision 1.36  1996/09/05 14:13:58  graaff
# Mainly documentation fixes. Also fixed comparison.
#
# Revision 1.35  1996/09/01 19:39:24  graaff
# Small stuff. See Changelog.
#
# Revision 1.34  1996/08/07 08:10:18  graaff
# Stupid bug in parsing the LinkExtor output fixed.
#
# Revision 1.33  1996/08/05 06:47:43  graaff
# Fixed silly bug in calculation of percentage for each server.
#
# Revision 1.32  1996/08/02 21:51:18  graaff
# Use the new LinkExtor to retrieve links from a document. Uses less
# memory, and should be quicker.
#
# Revision 1.31  1996/08/02 21:38:39  graaff
# Added a number of patches by Bruce Speyer.
# Added POD documentation.
# Added summary to mail message.
#
# Revision 1.30  1996/08/02 11:11:09  graaff
# See ChangeLog
#
# Revision 1.29  1996/07/27 20:28:35  graaff
# See Changelog
#
# Revision 1.28  1996/07/23 12:32:09  graaff
# See ChangeLog
#
# Revision 1.27  1996/07/22 20:34:44  graaff
# Fixed silly bug in columns printf
#
# Revision 1.26  1996/06/22 12:52:57  graaff
# redirection, optimization, correct base url
#
# Revision 1.25  1996/06/20 14:13:52  graaff
# Major rewrite of initialization. Fixed todo links indicators.
#
# Revision 1.24  1996/06/19 15:49:38  graaff
# added -M option, fixed division by 0 bug
#
# Revision 1.23  1996/06/01 17:33:40  graaff
# lwp-win32 changes, and counting cleanup
#
# Revision 1.22  1996/05/29 18:36:37  graaff
# Fixed error in regexp, small bugs
#
# Revision 1.21  1996/05/26 08:06:13  graaff
# Possibly add ending slash to URL's
#
# Revision 1.20  1996/05/13 17:01:17  graaff
# hide messages behind verbose flag
#
# Revision 1.19  1996/05/13 13:05:53  graaff
# See ChangeLog
#
# Revision 1.18  1996/05/05 07:25:38  graaff
# see changelog
#
# Revision 1.17  1996/04/29 16:23:11  graaff
# Updated, see Changelog for details.
#
# Revision 1.16  1996/04/29 06:43:57  graaff
# Updated
#
# Revision 1.15  1996/04/28 19:42:11  graaff
# See Changelog
#
# Revision 1.14  1996/03/29 10:09:36  graaff
# See ChangeLog
#
# Revision 1.13  1996/03/24 19:16:23  graaff
# See Changelog
#
# Revision 1.12  1996/03/22 13:10:03  graaff
# *** empty log message ***
#
# Revision 1.11  1996/03/17 09:33:26  graaff
# See ChangeLog
#
# Revision 1.10  1996/02/27 09:05:22  graaff
# See ChangeLog
#
# Revision 1.9  1996/02/26 14:47:31  graaff
# Fixed bug with referer field, added -x option to help, make server
# page auto-refresh.
#
# Revision 1.8  1996/02/24 12:14:48  graaff
# Added -x option
#
# Revision 1.7  1995/12/08 12:44:33  graaff
# Major rewrite of internals
# Changed the way the checked links are kept
#
# Revision 1.6  1995/11/29 07:52:10  graaff
# Small fixes to verbose layout.
#
# Revision 1.5  1995/11/27 08:50:46  graaff
# stupid bug in calling sort
#
# Revision 1.4  1995/11/24 15:48:34  graaff
# Fixed numerous small problems, mostly in the output.
# Fixed checking of external links (each link now gets checked only once)
# Sorting of errors is now done by error code, by error text, by page.
#
# Revision 1.3  1995/11/22 09:51:58  graaff
# Last part of major revision towards Perl 5 and libwww5. Checkbot now
# seems to work again, and at least generates the proper reports.
# However, more work, in particular cleanups, is needed.
#
# Revision 1.2  1995/08/25 11:28:57  graaff
# First rewrite towards perl 5, most stuff done in a crude way.
#
# Revision 1.1  1995/08/25 09:16:29  graaff
# First version is identical to the perl4 version. I will change it
# gradually.
#
!NO!SUBS!