File: crawler.txt

package info (click to toggle)
libwhisker2-perl 2.4-1
  • links: PTS
  • area: main
  • in suites: lenny
  • size: 664 kB
  • ctags: 303
  • sloc: perl: 7,262; makefile: 52
file content (155 lines) | stat: -rw-r--r-- 5,718 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
This file contains an explanation of the crawl variables.

$CRAWL is assumed to be a $CRAWLER_OBJECT returned by crawl_new().

---------------------------------------------------------------------------
Crawl data structures
---------------------------------------------------------------------------

%$CRAWL->{config}
- configuration values (see below); key=config key, value=value of key

&$CRAWL->{crawl}
- subfunction which just calls LW2::crawl($CRAWL)

&$CRAWL->{reset}
- subfunction which resets all the values in $CRAWL

%$CRAWL->{track}
- All the URLs seen/requested; key=url, value=HTTP response code, or '?' if
  not actually requested

%$CRAWL->{request}
- Libwhisker request hash used during crawling

%$CRAWL->{response}
- Libwhisker response hash used during crawling

$CRAWL->{depth}
- Default max depth set by crawl_new()

$CRAWL->{start}
- Default start URL set by crawl_new()

@$CRAWL->{errors}
- All encountered errors during crawl'ing

@$CRAWL->{urls}
- Temporary array used internally by crawl()

%$CRAWL->{server_tags}
- Server banners encountered while crawling; key=banner, value=# times seen

%$CRAWL->{referrers}
- Keeps track of who refers to what URL; key=target URL, value=anon array
  of all URLs that point to it

%$CRAWL->{offsites}
- All URLs that point to other hosts; key=URL, value=# times seen

%$CRAWL->{non_http}
- All non-http/https URLs found; key=URL, value=# times seen

%$CRAWL->{cookies}
- All cookies encountered during crawling; key=cookie string, value=# times
  seen

%$CRAWL->{forms}
- URLs which were the target of <form> tags; key=URL, value=# times seen

%$CRAWL->{jar}
- Temporary hash used internally by crawl() to track cookies

$CRAWL->{parsed_page_count}
- The number of HTML pages parsed for URLs

---------------------------------------------------------------------------
Crawl config options & values:
---------------------------------------------------------------------------

You generally access the values below by:

	$CRAWL->{config}->{KEY}=VALUE;

Where 'KEY' is the target key value (such as save_cookies), and VALUE is
the config value for that key.

---------------------------------------------------------------------------

save_cookies (value: 0 or 1)
- save encountered cookies into %$CRAWL->{cookies}; key is entire cookie
  string, value is how many times cookie was encountered

save_offsites (value: 0 or 1)
- save all URLs not on this host to %$CRAWL->{offsites}; key is
  offsite URL, value is how many times it was referenced

save_referrers (value: 0 or 1)
- save the URLs that refer to the given URL in %$CRAWL->{referrers};
  key is target URL, and the value is an anon array of all URLs that
  referred to it

save_non_http (value: 0 or 1)
- save any non-http/https URLs into %$CRAWL->{non_http}; basically
  all your ftp://, mailto:, and javascript: URLs, etc.

follow_moves (value: 0 or 1)
- crawl will transparently follow the URL given in a 30x move response

use_params (value: 0 or 1)
- crawl will factor in URI parameters when considering if a URI is unique 
  or not (otherwise parameters are discarded)

params_double_record (value: 0 or 1)
- if both use_params and params_double_record are set, crawl will make two
  track entries for each URI which has paramaters: one with and one without
  the parameters

reuse_cookies (value: 0 or 1)
- crawl will resubmit any received/prior cookies, much like a browser would

skip_ext (value: anonymous hash)
- the keys of the anonymous hash are file extensions that crawl() should
  skip trying to crawl; defaults to common binary/multimedia files (gif,
  jpg, pdf, etc)

save_skipped (value: 0 or 1)
- any URLs that are skipped via skip_ext, or are above the specified DEPTH 
  will be recorded in the tracking hash with a value of '?' (instead of an
  HTTP response code).

callback (value: 0 or \&sub)
- crawl will call this function (if this is a reference to a function), 
  passing it the current URI and the @ST array.  If the function returns a 
  TRUE value, then crawl will skip that URI.  Set to value 0 (zero) if you
  do not want to use a callback.

netloc_bug (value: 0 or 1)
- technically a url of the form '//www.host.com/url' is valid; the scheme
  (http/https) is assumed.  However, it's also possible to have bad
  relative references such as '//dir/file', which is similar in spirit
  to '/dir//file' (i.e. too many slashes).  When netloc_bug is enabled,
  any URL of the form '//blah/url' will be turned into 'http://blah/url'.
  This option was formerly called 'slashdot_bug' in LW 1.x, since 
  slashdot.org was the first site I encountered using it (it makes for a
  great way to catch web crawlers ;)  Note that this is enabled by
  default.

source_callback (value: 0 or \&sub)
- crawl will call this function (if this is a reference to a function), 
  passing references to %hin and %hout, right before it parses the page
  for HTML links.  This allows the callback function to review or
  modify the HTML before it's parsed for links.  Return value is ignored.
  
url_limit (value: integer)
- number or URLs that crawl will queue up at one time; defaults to 1000

do_head (value: 0 or 1)
- use head requests to determine if a file has a content-type worth
  downloading.  Potentially saves some time, assuming the server properly
  supports HEAD requests.  Set to value 1 to use (0/off by default).

normalize_uri (value: 0 or 1)
- when set, crawl() will normalize found URIs in order to ensure there
  are not duplicates (normalization means turning '/blah/../foo' and
  '/./foo' into '/foo')