1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
|
PROPAGANDA:
See http://www.math.uio.no/~janl/w3mir/ for propaganda.
--------------------------------------------------------------------------
Q: Where can I get a new version of w3mir?
A: The w3mir homepage is at http://www.math.uio.no/~janl/w3mir/.
W3mir is also distributed on CPAN: http://www.perl.com/
Q: Are there any mailing lists?
A: Yes, see below.
Q: Should I subscribe to any of the mailinglists?
A: Yes, if you use w3mir at all you should subscribe to
w3mir-info@usit.uio.no, send e-mail to janl@math.uio.no to be
subscribed.
Q: I found a bug!
A: See below.
--------------------------------------------------------------------------
BUGS:
- -lc switch does not work too well.
Please see below for how to report bugs.
--------------------------------------------------------------------------
FEATURES (NOT bugs):
- URLs with two /es ('//') in the path component does not work
as some might expect. According to my reading of the http/url spec.
it is an illegal construct, which is a Good Thing, because I don't
know how to handle it if it's legal.
- If you start at http://foo/bar/ then index.html might be gotten twice.
- Some documents point to a point above the server root, i.e.,
http://some.server/../stuff.html. Netscape, and other browsers, in
defiance of the URL standard documents will change the URL to
http://some.server/stuff.html. W3mir will not.
--------------------------------------------------------------------------
MAIL LISTS, REPORTING BUGS:
Please send bug reports to w3mir-core@usit.uio.no, please include URL
and command line that triggered the bug. Ideas (see todo lists
further down please), questions about usage, general discussions and
other related talk to w3mir-info@usit.uio.no. To subscribe to these
lists email janl@math.uio.no. The w3mir-core list is intended for
w3mir hackers only.
--------------------------------------------------------------------------
COPYRIGTHS:
w3mir, w3http.pm, w3pdfuri.pm and htmlop.pm are free but it is
Copyrighted by the various involved hackers. If you want to copy,
hack or distribte w3mir you can do that providing you comply with the
'Artistic License' enclosed in the w3mir distribution in the file
named Artistic.
--------------------------------------------------------------------------
CREDITS:
- Oscar Nierstrasz: Wrote htget
- Gorm Haug Eriksen: Started w3mir on the foundations of htget,
contributed code later.
- Nicolai Langfeldt: Learning from Oscar and Gorms mistakes, rewrote
everything.
- Chris Szurgot: Adapting to win32, good ideas and code contribs,
Debugging. And criticism.
- Ed Jordan: patch, debugging.
- Rik Faith: Uses w3mir extensively, not shy about complaining and
commenting and suggesting.
- The libwww-perl author(s) that made adding some new featres
ridicolously easy.
--------------------------------------------------------------------------
TODO LIST:
Currently I'm preparing for version 1 of w3mir.
* TODO for version 1:
- Fix bugs discovered.
- Release 1.0
* TODO, after version 1:
Some of these are speculative, some others are very useful.
- CSS parsing/support at the same level as HTML
- Full support for APPLETS/OBJECT tags.
- Alias rules. These would enable w3mir to map ukoln.bath.ac.uk and
bubl.bath.ac.uk to www.ukoln.ac.uk and know that the objects contained
in these are all the same. Another use would be to mirror from a mirror
instead of the _real_ site, since the original site to which you have
references are on a slow link while the mirror is on a fast link.
- FTP support (easy if through a http style ftp proxy, but is that what
we want?)
- SSL support
- Retrive recursively until N-th order links are reached. This
differs siginificantly from directory recursion which we do now. When
this is done w3mir should also know the difference between inline and
external links, inline links should always be retrived. Trivia question:
What order is needed to reach every findable document on the web from
Yahoo?
- Integrate with cvs or rcs (or other version controll system) to make
retriver able to reproduce mirrored site for any given date.
- Some text processing: Adding and removing text/sgml comments when suitable
options and tags are found. Suggested by Ed Jordan.
- Put retrival date-stamps as comments in html files, to document the when
and how of how this document was retrived.
- Example: If you're mirroring a site primarily to get to the papers, but
the site has n versions of each paper: foo.ps.gz, foo.ps.Z, foo.dvi.gz
foo.dvi.Z, foo.tar.gz, foo.zip and you only need one version. Implement
a way to get only one version of documents provided in multipele versions,
something like multi axis preference list to get only the most attractive
version of the doc.
- Logging of retrivals to file, need to change every print to a functioncall.
- Your suggestion here.
* TODO, http related
- Use Keep-alive. Then we should probably stop using 30 second pauses
between document retrivals.
- HTTP/1.1? HTTP/1.1 servers should do keep-alive even with 1.0 requests.
- Separate quenes for each server, interleave requests.
* If perl gets threads:
- Make the retrival and analysis engines separate threads, and have each
one retrival thread pr. method/server/port and do paralell retrivals.
|