File: MakeHandler.pl

package info (click to toggle)
newsclipper 1.32-5
  • links: PTS
  • area: main
  • in suites: woody
  • size: 448 kB
  • ctags: 206
  • sloc: perl: 5,831; makefile: 62; sh: 12
file content (691 lines) | stat: -rw-r--r-- 19,823 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
#!/usr/bin/perl

# use perl                                  -*- mode: Perl; -*-

use strict;
use File::Cache;

my $VERSION = 0.62;
my $COMPATIBLE_NEWS_CLIPPER_VERSION = 1.18;

my %info;
# author_name
# author_email
# maintainer_name
# maintainer_email
# name
# description
# category
# url: url for the web page
# license
# for_news_clipper_version
# language
# urlcode: code that sets "my $url"
# defaults: code to set defaults
# attributes: an array of all the attributes and their possible values, as
#   "attribute:(value1|value2|value3)".
# patterncode: code to set the start and end patterns
# getfunction: the name of the Get function (GetHtml, e.g.)
# getcode: the Get code.
# date: code for the GetUpdateTimes function
# defaulthandlers: code that contains the GetDefaultHandlers function

my $handlerserver = 'handlers.newsclipper.com';
#my $handlerserver = 'localhost';

sub Prompt;
my ($editor,@prompt);

print <<EOF;
Greetings! This is News Clipper's MakeHandler.pl program, which is designed to
help you write handlers. Well, actually, it will help you write *Acquisition*
handlers. General filter and output handlers are a bit more involved, and it's
not likely that you'll need to write them.

This utility will invoke a text editor to allow you to enter responses.  The
main control will happen here at the script, but most of the data entry will
be done in the editor.  Just enter the requested information in the space
between the prompts, which look like this:

vvvvv
<THIS IS WHERE YOU ANSWER>
^^^^^

Please read the hints for handler writers, at
http://www.newsclipper.com/handlers.htm#Handler_Tutorial

EOF

print "Enter the name of your favorite editor (notepad, vi, emacs): ";
$editor = <STDIN>;
chomp $editor;

#-------------------------------------------------------------------------------

print "First, we need a few details...\n\n";

push @prompt,<<EOF;
What's your name?
EOF

push @prompt,<<EOF;
Where can people reach you by email, in case they have problems with your
handler, or want to make modifications to it?
EOF

push @prompt,<<EOF;
Enter the handler's name, in all lower case.
EOF

push @prompt,<<EOF;
Enter a one line description of the handler, so people can understand what it
does when it is listed on the handler webpage.
EOF

push @prompt,<<EOF;
Enter a URL where people can surf to in order to get an idea of where the data
comes from.
EOF

push @prompt,<<EOF;
What language will the handler be most suited for?
EOF

push @prompt,<<EOF;
Sometimes people like to put a license on their code that gives other people
limited rights to modify, copy, sell, etc. it. The GPL is pretty popular, as
is the Artistic license. For a summary of the major open source licenses, see
http://www.oreilly.com/catalog/opensources/book/perens.html

Enter the license type, if you want to license your code:
EOF

push @prompt,<<EOF;
Please choose a category for your handler. If you would like to use a category
that is not here, you can enter anything you would like now. However, you will
need to email the database maintainer at SubmitHandler\@newsclipper.com and
ask that the category be added before you submit the handler to the database.

General
Tech
Business
Sports
Science
Weather
Music
Local
Humor
Comics
Linux
Programming
Personal Computers
Miscellaneous
EOF

($info{author_name},$info{author_email},$info{name},$info{description},
  $info{url},$info{language},$info{license},$info{category}) = Prompt(@prompt);

$info{maintainer_name} = $info{author_name};
$info{maintainer_email} = $info{author_email};
$info{for_news_clipper_version} = $COMPATIBLE_NEWS_CLIPPER_VERSION;

#-------------------------------------------------------------------------------

print <<EOF;
Some handlers get their information from different URLs, depending on what
parameters the user enters. For example, the yahootopstories handler can
grab data from several different URLs that all share a common format.

Will your URL depend on a parameter?
EOF

my $yesno;
$yesno = <STDIN>;
if ($yesno =~ /^y/i)
{
push @prompt,<<EOF;
What is the name of the attribute that you would like the URL to depend on?
("source", for example.)
EOF

push @prompt,<<EOF;
Enter the values (in lower case) and corresponding URLs, like so:
'headlines' => 'http://some.server.com/headlines'
'technews' => 'http://some.server.com/technews'
'humor' => 'http://some.server.com/humor'
EOF

push @prompt,<<EOF;
What is the default value for the attribute?
EOF

my ($attr,$values,$default) = Prompt(@prompt);

$values =~ s/\n/,\n    /gs;

$info{urlcode} = "  my \%urlMap = (\n    $values,\n  );\n\n";
$info{urlcode} .= "  my \$url = \$urlMap{\$attributes->{'$attr'}};";

$info{defaults} .= "  \$attributes->{'$attr'} = '$default'\n    " .
                   "unless defined \$attributes->{'$attr'};\n";

$values =~ s/'\s*=>\s*'[^']+'/|/gs;
$values =~ s/['\n ,]//g;
$values =~ s/\|$//gs;

push @{$info{attributes}}, "$attr:($values)";
}
else
{
push @prompt,<<EOF;
What is the URL from which News Clipper should grab the data?
EOF

($info{urlcode}) = Prompt(@prompt);
$info{urlcode} = "  my \$url = '$info{urlcode}';";
}

#-------------------------------------------------------------------------------

RETRY1:

print <<EOF;
Choose an acquisition function:
(1) GetUrl -- Gets raw data from a URL, without making links absolute. Use
              this for text and such. Grabs all the data from the URL.
(2) GetText -- Extracts text from HTML, fixing < to &lt;, etc.
(3) GetHtml -- Extracts HTML, fixing relative links.
(4) GetImages -- Extracts images, fixing relative links.
(5) GetLinks -- Extracts hyperlinks, fixing relative links and removing
    formatting.
EOF

my $input = <STDIN>;
chomp $input;
goto RETRY1 if $input !~ /^[12345]$/;

$info{getcode} = '  my $data = ';

my $acqFunction;
$info{getfunction} = 'GetUrl' if $input eq '1';
$info{getfunction} = 'GetText' if $input eq '2';
$info{getfunction} = 'GetHtml' if $input eq '3';
$info{getfunction} = 'GetImages' if $input eq '4';
$info{getfunction} = 'GetLinks' if $input eq '5';

$info{getcode} .= 'GetUrl' if $input eq '1';
$info{getcode} .= 'GetText' if $input eq '2';
$info{getcode} .= 'GetHtml' if $input eq '3';
$info{getcode} .= 'GetImages' if $input eq '4';
$info{getcode} .= 'GetLinks' if $input eq '5';

$info{getcode} .= "(\$url,\$startPattern,\$endPattern);\n"
  if $info{getfunction} ne 'GetUrl';
$info{getcode} .= "(\$url);\n"
  if $info{getfunction} eq 'GetUrl';

$info{getcode} .= "  return undef unless defined \$data;\n";

#-------------------------------------------------------------------------------

if ($info{getfunction} ne 'GetUrl')
{
print <<EOF;
Sometimes there are several sections on a web page, and you want to allow the
user to choose only one. For example, a site might put headlines and tech
articles on the same web page. Answer "no" to this question if there are
several pieces of data, and it's likely that the user will want them all.

Do you want the grabbed data to depend on a parameter?
EOF

$yesno;
$yesno = <STDIN>;
if ($yesno =~ /^y/i)
{
push @prompt,<<EOF;
First, a word about patterns. You should prefix the pattern with (?i) if you
want it to be case insensitive. ^ matches the beginning of the web page, and \$
matches the end. You can use \\n to match the end of a line.

Try to choose something that is unlikely to change when the web site gets
redesigned. For example, if you're grabbing a comic, and you know that the
comic is the only image that has a filename like "blah29385829.gif", don't try
to precisely grab the <img src="blah29385829.gif"> tag using GetHtml. Instead,
grab all the images using GetImages, then weed out everything but the one you
want. (When using GetLinks and GetImages, you can afford to have a "loose"
match if it allows you to pick a better pattern.)

You should return "clean" HTML, without any unclosed <em>s and such. In fact,
you should strip out <font> tags, since that restricts the web designer. Later
you can manually edit the handler and clean up the HTML using the TrimTags and
StripTags functions.

What is the name of the attribute that you would like the grabbed data to
depend on? ("source", for example.)
EOF

push @prompt,<<EOF;
Enter the values (in lower case) and corresponding starting and ending
patterns that News Clipper can use to grab the information, like so:
'headlines' => ['<!-- Start Headlines -->','<!-- End Headlines -->'],
'technews' => ['<!-- Start Technews -->','<!-- End Technews -->'],
'humor' => ['<!-- Start Humor -->','<!-- End Humor -->'],
EOF

push @prompt,<<EOF;
What is the default value for the attribute?
EOF

my ($attr,$values,$default) = Prompt(@prompt);

$values =~ s/\n/\n    /gs;

$info{patterncode} = "  my \%patternMap = (\n    $values,\n  );\n\n";
$info{patterncode} .= "  my \$startPattern = \$patternMap{\$attributes->{'$attr'}}[0];\n";
$info{patterncode} .= "  my \$endPattern = \$patternMap{\$attributes->{'$attr'}}[1];\n";

$info{defaults} .= "  \$attributes->{'$attr'} = '$default'\n    " .
                   "unless defined \$attributes->{'$attr'};\n";

$values =~ s/'\s*=>\s*\[[^\]]+\]/|/gs;
$values =~ s/['\n ,]//g;
$values =~ s/\|$//gs;

push @{$info{attributes}}, "$attr:($values)";
}
else
{
push @prompt,<<EOF;
First, a word about patterns. You should prefix the pattern with (?i) if you
want it to be case insensitive. ^ matches the beginning of the web page, and \$
matches the end. You can use \\n to match the end of a line.

Try to choose something that is unlikely to change when the web site gets
redesigned. For example, if you're grabbing a comic, and you know that the
comic is the only image that has a filename like "blah29385829.gif", don't try
to precisely grab the <img src="blah29385829.gif"> tag using GetHtml. Instead,
grab all the images using GetImages, then weed out everything but the one you
want. (When using GetLinks and GetImages, you can afford to have a "loose"
match if it allows you to pick a better pattern.)

You should return "clean" HTML, without any unclosed <em>s and such. In fact,
you should strip out <font> tags, since that restricts the web designer. Later
you can manually edit the handler and clean up the HTML using the TrimTags and
StripTags functions.

What is the starting pattern News Clipper can use to grab the data?
EOF

push @prompt,<<EOF;
What is the ending pattern News Clipper can use to grab the data?
EOF

my ($start,$end) = Prompt(@prompt);

$info{patterncode} = "  my \$startPattern = '$start';\n";
$info{patterncode} .= "  my \$endPattern = '$end';\n";
}
}

#-------------------------------------------------------------------------------

if ($info{getfunction} =~ /(Url|Text|Html)/)
{
  $info{defaulthandlers} =<<EOF;
sub GetDefaultHandlers
{
  my \$self = shift;
  my \$inputAttributes = shift;

  my \$returnVal =<<'  EOF';
    <output name='string'>
  EOF

  return \$returnVal;
}
EOF
}
else
{
  $info{defaulthandlers} =<<EOF;
sub GetDefaultHandlers
{
  my \$self = shift;
  my \$inputAttributes = shift;

  my \$returnVal =<<'  EOF';
    <filter name='limit' number=10>
    <output name='array'>
  EOF

  return \$returnVal;
}
EOF
}

#-------------------------------------------------------------------------------

push @prompt,<<EOF;
Now specify the times at which you know the remote server updates its data,
and that News Clipper should refresh its cached data.  Please be a little
conservative here -- If you specify every hour of the day, lots of people will
be hitting their server when they probably aren't even looking at their News
Clipper webpage.

Date specifications are of the form "[day] hour,hour,hour [time zone]". If you
leave out the day, every day is assumed. If you leave out the time zone,
Pacific Standard Time is assumed. If you leave out everything, the default of
"2,5,8,11,14,17,20,23 PST" is used.  For example, if you are making a handler
for a daily comic, you might want to just use '7', since the comic changes at
6 am PST every day.

The days are: sun,mon,tue,wed,thu,fri,sat. You can specify multiple times, for
example:

mon 6,8 EST
tues 16 CST
20

would update Mondays at 6am and 8am EST, Tuesdays at 4pm CST, and every day
at 8pm PST.

Enter your date specification:
EOF

my ($datespec) = Prompt(@prompt);

if ($datespec eq '')
{
  $info{date} = '';
}
else
{
  $datespec =~ s/\n/',\n    '/gs;

  $info{date} =<<EOF;
sub GetUpdateTimes
{
  return ['$datespec'];
}
EOF
}

#-------------------------------------------------------------------------------

my $attributes;
$attributes = join ' ',@{$info{attributes}} if defined $info{attributes};
$attributes = '' unless defined $info{attributes};

$attributes =~ s/:([^)]+)\)/=X/g;
$attributes = " $attributes" if $attributes ne '';

my $att2;
$att2 = join "\n  ",@{$info{attributes}} if defined $info{attributes};
$att2 = '' unless defined $info{attributes};
$att2 = "\n  $att2" if $att2 ne '';

my $code =<<"    EOF";
--------> THESE LINES WILL BE REMOVED BY MAKEHANDLER               <--------
--------> EDIT THIS VERSION OF THE HANDLER IF YOU NEED TO.         <--------
--------> FOR EXAMPLE, IF YOU WERE DOING THE "ASTROPIC" HANDLER,   <--------
--------> YOU WOULD WANT TO RETURN A HASH CONTAINING SEVERAL DATA  <--------
--------> ITEMS, SO YOU'D HAVE TO EDIT THE "GET" FUNCTION, AS WELL <--------
--------> AS THE "GETDEFAULTHANDLERS" FUNCTION.                    <--------
# -*- mode: Perl; -*-

package NewsClipper::Handler::Acquisition::$info{name};

use vars qw( \@ISA \$VERSION \%handlerInfo );

--------> FIX THESE COMMENTS. ADD ANY ADDITIONAL ATTRIBUTES YOU    <--------
--------> NEED, AND EXPLAIN WHAT THE ATTRIBUTES MEAN. THIS IS THE  <--------
--------> DOCUMENTATION THAT THE USER WILL REFER TO IN ORDER TO    <--------
--------> USE YOUR HANDLER.                                        <--------
\$handlerInfo{'Author_Name'}              = '$info{author_name}';
\$handlerInfo{'Author_Email'}             = '$info{author_email}';
\$handlerInfo{'Maintainer_Name'}          = '$info{maintainer_name}';
\$handlerInfo{'Maintainer_Email'}         = '$info{maintainer_email}';
\$handlerInfo{'Description'}              = <<'EOF';
$info{description}
EOF
\$handlerInfo{'Category'}                 = '$info{category}';
\$handlerInfo{'URL'}                      = <<'EOF';
$info{url}
EOF
\$handlerInfo{'License'}                  = '$info{license}';
\$handlerInfo{'For_News_Clipper_Version'} = '$info{for_news_clipper_version}';
\$handlerInfo{'Language'}                 = '$info{language}';
\$handlerInfo{'Notes'}                    = <<'EOF';
EOF
\$handlerInfo{'Syntax'}                   = <<'EOF';
<input name=$info{name}$attributes>$att2
EOF

use strict;
use NewsClipper::Handler;
\@ISA = qw(NewsClipper::Handler);

# - The first number should be incremented when a change is made to the
#   handler that will break people's input files.
# - The second number should be incremented when a change is made that won't
#   break people's input files, but changes the functionality.
# - The third number should be incremented when only a bugfix is applied.

\$VERSION = do {my \@r=('0.1.0'=~/\\d+/g);sprintf "\%d."."\%02d"x\$#r,\@r};

# ------------------------------------------------------------------------------

sub ComputeURL
{
  my \$self = shift;
  my \$attributes = shift;

$info{urlcode}

  return \$url;
}

# ------------------------------------------------------------------------------

# This subroutine checks the handler's attributes to make sure they are valid,
# and sets any default attributes if necessary.

sub ProcessAttributes
{
  my \$self = shift;
  my \$attributes = shift;
  my \$handlerRole = shift;

  # Set defaults here. You can safely delete this function if your handler has
  # no attributes with default values.

  # \$attributes->{'some_attribute'} = 'default_value'
  #   unless defined \$attributes->{'some_attribute'};

  # Verify any attributes you need to here. Output an error and return undef
  # if something is wrong.

  # unless (\$attributes->{somevalue} > 0)
  # {
  #   error "The \\"somevalue\\" attribute for handler \\"HANDLERNAME\\" " .
  #     "should be greater than 0.\\n";
  #   return undef;
  # }

$info{defaults}

  return \$attributes;
}

# ------------------------------------------------------------------------------


# This function is used to get the raw data from the URL.
sub Get
{
  my \$self = shift;
  my \$attributes = shift;

$info{patterncode}

  my \$url = \$self->ComputeURL(\$attributes);

$info{getcode}
--------> IF YOU NEED TO DO ADDITIONAL PROCESSING, LIKE A          <--------
--------> \@\$data = grep {/\d{5}.gif/} \@\$data;                      <--------
--------> TO FILTER OUT IMAGES THAT DON'T HAVE 5 DIGITS, OR IF     <--------
--------> YOU NEED TO SPLIT THE DATA UP FURTHER, DO IT HERE.       <--------
  return \$data;
}

# ------------------------------------------------------------------------------

--------> MAKEHANDLER TRIED TO MAKE A GOOD GUESS HERE. YOU MIGHT   <--------
--------> NEED TO CHANGE THIS.                                     <--------
$info{defaulthandlers}

# ------------------------------------------------------------------------------

$info{date}

1;

    EOF

open FILE,">MakeHandler.inp";
print FILE $code;
close FILE;

system ("$editor MakeHandler.inp");

open FILE,"<MakeHandler.inp";
my $finishedcode = join '',<FILE>;
close FILE;

$finishedcode =~ s/-------->[^<]+<--------//gs;
$finishedcode =~ s/\n\n\n+/\n\n/gs;

$finishedcode =~ s/^\n*//s;
$finishedcode =~ s/\n*$/\n/s;

open FILE, ">$info{name}.pm";
print FILE $finishedcode;
close FILE;

print <<EOF;
A basic handler called $info{name}.pm has been created for you.

To try it out, put it in your handlers directory (typically
NewsClipper/Handler/) and put
<!--newsclipper
  <input name=$info{name}>
-->
in your input file.

Have fun!
EOF

#-------------------------------------------------------------------------------

sub Prompt
{
open FILE,">MakeHandler.inp";
foreach my $data (@prompt)
{
  print FILE $data,"vvvvv\n\n","^^^^^\n","-"x78,"\n";
}
close FILE;

RETRY2:

system ("$editor MakeHandler.inp");

open FILE,"<MakeHandler.inp";
my $returnString = join '',<FILE>;
close FILE;

my @returnvals = ();

foreach my $data (@prompt)
{
  my $response = undef;
  my $pattern = $data;

  ($response) = $returnString =~ /vvvvv\s*(.*?)\s*\^\^\^\^\^/s;

  unless (defined $response)
  {
    print "Sorry, I couldn't figure out what you answered for the ",
          "question:\n\n$data\nTry again. <press enter to confirm>\n";
    goto RETRY2;
  }

  $returnString =~ s/vvvvv/vvvv/;

  push @returnvals, $response;
}

@prompt = ();

return @returnvals;
}

#-------------------------------------------------------------------------------

# Needed by compiler

#perl2exe_include File/Spec/Win32

#-------------------------------------------------------------------------------

=head1 NAME

MakeHandler.pl - A generator for handlers suitable for use by News Clipper.

=head1 DESCRIPTION

I<MakeHandler.pl> is a handler generator. It asks the user a few questions,
and then creates a handler.pm file, which can then be edited further. It
jump-starts the handler writing process.

Handlers are the extensible mechanism by which I<News Clipper> can be
customized to acquire and display information from new data sources. News
Clipper provides an API of useful functions that can be used by the handler
writer.

For more information and hints about writing handlers, see
http://www.newsclipper.com/makehan.htm. Also
see the API description in the documentation for NewsClipper.pl.

=head1 OPTIONS AND ARGUMENTS

None.

=head1 PREREQUISITES

No additional Perl modules are needed.

=head1 AUTHOR

Spinnaker Software, Inc.
David Coppit, <david@coppit.org>, http://coppit.org/

=begin CPAN

=pod COREQUISITES

none

=pod OSNAMES

any

=pod SCRIPT CATEGORIES

HTML/Preprocessors

=end CPAN

=cut