File: README

package info (click to toggle)
w3mir 1.0pre4-2
  • links: PTS
  • area: main
  • in suites: hamm
  • size: 312 kB
  • ctags: 47
  • sloc: perl: 1,260; makefile: 37
file content (130 lines) | stat: -rw-r--r-- 5,346 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
PROPAGANDA:

See http://www.math.uio.no/~janl/w3mir/ for propaganda.

--------------------------------------------------------------------------
Q: Where can I get a new version of w3mir?
A: The w3mir homepage is at http://www.math.uio.no/~janl/w3mir/.
   W3mir is also distributed on CPAN: http://www.perl.com/

Q: Are there any mailing lists?
A: Yes, see below.

Q: Should I subscribe to any of the mailinglists?
A: Yes, if you use w3mir at all you should subscribe to 
   w3mir-info@usit.uio.no, send e-mail to janl@math.uio.no to be
   subscribed.

Q: I found a bug!
A: See below.

--------------------------------------------------------------------------
BUGS:

- -lc switch does not work too well.

Please see below for how to report bugs.

--------------------------------------------------------------------------
FEATURES (NOT bugs):

- URLs with two /es ('//') in the path component does not work
  as some might expect.  According to my reading of the http/url spec.
  it is an illegal construct, which is a Good Thing, because I don't
  know how to handle it if it's legal.
- If you start at http://foo/bar/ then index.html might be gotten twice.
- Some documents point to a point above the server root, i.e.,
  http://some.server/../stuff.html.  Netscape, and other browsers, in
  defiance of the URL standard documents will change the URL to
  http://some.server/stuff.html.  W3mir will not.

--------------------------------------------------------------------------
MAIL LISTS, REPORTING BUGS:

Please send bug reports to w3mir-core@usit.uio.no, please include URL
and command line that triggered the bug.  Ideas (see todo lists
further down please), questions about usage, general discussions and
other related talk to w3mir-info@usit.uio.no.  To subscribe to these
lists email janl@math.uio.no.  The w3mir-core list is intended for
w3mir hackers only.

--------------------------------------------------------------------------
COPYRIGTHS:

w3mir, w3http.pm, w3pdfuri.pm and htmlop.pm are free but it is
Copyrighted by the various involved hackers.  If you want to copy,
hack or distribte w3mir you can do that providing you comply with the
'Artistic License' enclosed in the w3mir distribution in the file
named Artistic.

--------------------------------------------------------------------------
CREDITS:

- Oscar Nierstrasz: Wrote htget
- Gorm Haug Eriksen: Started w3mir on the foundations of htget,
	contributed code later.
- Nicolai Langfeldt: Learning from Oscar and Gorms mistakes, rewrote
  everything.
- Chris Szurgot: Adapting to win32, good ideas and code contribs, 
  Debugging.  And criticism.
- Ed Jordan: patch, debugging.
- Rik Faith: Uses w3mir extensively, not shy about complaining and
  commenting and suggesting.  
- The libwww-perl author(s) that made adding some new featres
  ridicolously easy.

--------------------------------------------------------------------------
TODO LIST:

Currently I'm preparing for version 1 of w3mir.

* TODO for version 1:

- Fix bugs discovered.
- Release 1.0

* TODO, after version 1:

Some of these are speculative, some others are very useful.

- CSS parsing/support at the same level as HTML
- Full support for APPLETS/OBJECT tags.
- Alias rules.  These would enable w3mir to map ukoln.bath.ac.uk and 
  bubl.bath.ac.uk to www.ukoln.ac.uk and know that the objects contained
  in these are all the same.  Another use would be to mirror from a mirror
  instead of the _real_ site, since the original site to which you have
  references are on a slow link while the mirror is on a fast link.
- FTP support (easy if through a http style ftp proxy, but is that what
  we want?)
- SSL support
- Retrive recursively until N-th order links are reached.  This
  differs siginificantly from directory recursion which we do now.  When
  this is done w3mir should also know the difference between inline and
  external links, inline links should always be retrived.  Trivia question:
  What order is needed to reach every findable document on the web from
  Yahoo?
- Integrate with cvs or rcs (or other version controll system) to make
  retriver able to reproduce mirrored site for any given date.
- Some text processing:  Adding and removing text/sgml comments when suitable
  options and tags are found.  Suggested by Ed Jordan.
- Put retrival date-stamps as comments in html files, to document the when
  and how of how this document was retrived.
- Example: If you're mirroring a site primarily to get to the papers, but
  the site has n versions of each paper: foo.ps.gz, foo.ps.Z, foo.dvi.gz
  foo.dvi.Z, foo.tar.gz, foo.zip and you only need one version.  Implement
  a way to get only one version of documents provided in multipele versions,
  something like multi axis preference list to get only the most attractive
  version of the doc.
- Logging of retrivals to file, need to change every print to a functioncall.
- Your suggestion here.

* TODO, http related
- Use Keep-alive.  Then we should probably stop using 30 second pauses
  between document retrivals.
- HTTP/1.1?  HTTP/1.1 servers should do keep-alive even with 1.0 requests.
- Separate quenes for each server, interleave requests.

* If perl gets threads:
- Make the retrival and analysis engines separate threads, and have each
  one retrival thread pr. method/server/port and do paralell retrivals.