1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>
1. Sitescooper README
</title>
</head>
<body bgcolor="#ffffff" text="#000000" link="#3300cc" vlink="#660066">
<h1>
1. Sitescooper README
</h1>
<p>
This is sitescooper, a perl script which you run on your Palm
Computing handheld organizer's hotsync machine. It will
retrieve news stories automatically from various news
websites and convert them into Palm DOC, iSilo, RichReader or
text format; in addition, it can now convert into any other
format for which you have a conversion program that takes
text or HTML input.
</p>
<p>
(If you've just installed sitescooper, you probably don't want to
read the blurb again; so just go straight to the
Installation page.)
</p>
<p>
HTTP and local files, using the file:/// protocol, are both
supported.
</p>
<p>
Multiple types of sites can be snarfed:
</p>
<blockquote>
1-level sites, where the text to be converted is all present
on one
page, (such as Slashdot, Linux Weekly News, BluesNews,
NTKnow, Ars
Technica);
</blockquote>
<blockquote>
2-level sites, where the text to be converted is linked to
from a Table
of Contents page (such as Wired News, BBC News, and I,
Cringely);
</blockquote>
<blockquote>
3-level sites, where the text to be converted is linked to
from a Table
of Contents page, which in turned is linked to from a list of
issues
page (such as PalmPower or New Scientist).
</blockquote>
<p>
In addition sites that post news as items on one big page,
such as Slashdot, Ars Technica, and BluesNews, are supported
using diff.
</p>
<p>
It even trims out sidebar tables automatically, by making the assumption
that tables < 30% of the average browser width are not part of the
news story. Effectively, sitescooper is a <a
href=http://www.research.ibm.com/networked_data_systems/transcoding/>transcoder</a>
for handheld PCs.
</p>
<p>
The script should run easily on most UNIX variants that
support perl, as well as the Win32 platform, even Windows 95
(tested with ActivePerl 5.00502 build 509). It has been
reported to work on a Mac, using MacPerl 5.1.9r4.
</p>
<p>
Output is supported in the following formats:
</p>
<ul>
<li>
<p>
plain text
</p>
</li>
<li>
<p>
<a href="http://plucker.gnu-designs.org/">Plucker</a>, a HTML-based
format for Palm Computing organizers. Plucker is free software
licensed under the GPL, like sitescooper.
</p>
</li>
<li>
<p>
iSilo, a HTML-based format for the Palm Computing
organizers from DC and Co., available from <tt><a href=
"http://www.isilo.com/">http://www.isilo.com/</a></tt>
</p>
</li>
<li>
<p>
RichReader format, an RTF-based format with formatting,
see <tt><a href=
"http://users.erols.com/arenakm/palm/RichReader.html">
http://users.erols.com/arenakm/palm/RichReader.html</a></tt>
</p>
</li>
<li>
<p>
DOC format, as used by AportisDoc, TealDoc, CSpotRun,
etc.
</p>
</li>
<li>
<p>
any other format using the -pipe switch.
</p>
</li>
</ul>
<p>
DOC format, Plucker format, and text are all free. RichReader is
shareware, and iSilo has both shareware and free readers
available.
</p>
<p>
You may ask, "why not just use AvantGo, 'lynx -dump' and 'makedoc', or
some other web-page-downloading software?" Well, sitescooper has several
advantages:
</p>
<ul>
<li>
<p>
it will follow links, and has a sophisticated set of
mechanisms to follow the right links and use the
"printing version" of a story;
</p>
</li>
<li>
<p>
it can use heuristics to trim out irrelevant tables;
</p>
</li>
<li>
<p>
the HTML rendering code is optimised for viewing on a
Palm handheld, by trimming all images (even their ALT
tags), forms, and extraneous headers and footers (based
on the .site file), resulting in much more space free on
your handheld;
</p>
</li>
<li>
<p>
it's <i>very</i> configurable for each target site -- you can
even use Perl code in a site file to rewrite the HTML as it's
scooped;
</p>
</li>
<li>
<p>
it tracks what stories you've already read, and is quite
sophisticated about removing text you've seen before;
</p>
</li>
<li>
<p>
it's portable to UNIX, Win32, Mac, and any other
perl-supporting platform;
</p>
</li>
<li>
<p>
it's free software, distributed under the GNU GPL.
</p>
</li>
</ul>
<p>
In short, it's pretty neat.
</p>
<p>
Pick up the latest version of sitescooper at the following
URL:
</p>
<blockquote>
<tt><a href="http://sitescooper.org/">
http://sitescooper.org/</a></tt>
</blockquote>
<p>
Sitescooper is distributed under the GNU GPL, and as such
is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
</p>
<p>
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
The full text of the GPL is available <a href=gpl.html>here</a>.
</p>
<p>
The next thing to do is to follow the links below to the next
section, Installing.
</p>
<!-- start of nav links --><hr>
<p align=right>
<nobr> [
<a href=index.html>README</a> ]
<br>
[
<a href=running.html>Running</a> ]|[
<a href=sitescooper.html>Command-line Arguments Reference</a> ]
<br>
[
<a href=writing_site.html>Writing a Site File</a> ]|[
<a href=site_params.html>Site File Parameters Reference</a> ]
<br>
[
<a href=rss-to-site.html>The rss-to-site Conversion Tool</a> ]|[
<a href=subs-to-site.html>The subs-to-site Conversion Tool</a> ]
<br>
[
<a href=contributing.html>Contributing</a> ]|[
<a href=gpl.html>GPL</a> ]|[
<a href=http://sitescooper.org/>Home Page</a> ]
</nobr>
</p>
<!-- end of nav links --> </body></html>
|