1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
|
This file contains an explanation of the crawl variables.
$CRAWL is assumed to be a $CRAWLER_OBJECT returned by crawl_new().
---------------------------------------------------------------------------
Crawl data structures
---------------------------------------------------------------------------
%$CRAWL->{config}
- configuration values (see below); key=config key, value=value of key
&$CRAWL->{crawl}
- subfunction which just calls LW2::crawl($CRAWL)
&$CRAWL->{reset}
- subfunction which resets all the values in $CRAWL
%$CRAWL->{track}
- All the URLs seen/requested; key=url, value=HTTP response code, or '?' if
not actually requested
%$CRAWL->{request}
- Libwhisker request hash used during crawling
%$CRAWL->{response}
- Libwhisker response hash used during crawling
$CRAWL->{depth}
- Default max depth set by crawl_new()
$CRAWL->{start}
- Default start URL set by crawl_new()
@$CRAWL->{errors}
- All encountered errors during crawl'ing
@$CRAWL->{urls}
- Temporary array used internally by crawl()
%$CRAWL->{server_tags}
- Server banners encountered while crawling; key=banner, value=# times seen
%$CRAWL->{referrers}
- Keeps track of who refers to what URL; key=target URL, value=anon array
of all URLs that point to it
%$CRAWL->{offsites}
- All URLs that point to other hosts; key=URL, value=# times seen
%$CRAWL->{non_http}
- All non-http/https URLs found; key=URL, value=# times seen
%$CRAWL->{cookies}
- All cookies encountered during crawling; key=cookie string, value=# times
seen
%$CRAWL->{forms}
- URLs which were the target of <form> tags; key=URL, value=# times seen
%$CRAWL->{jar}
- Temporary hash used internally by crawl() to track cookies
$CRAWL->{parsed_page_count}
- The number of HTML pages parsed for URLs
---------------------------------------------------------------------------
Crawl config options & values:
---------------------------------------------------------------------------
You generally access the values below by:
$CRAWL->{config}->{KEY}=VALUE;
Where 'KEY' is the target key value (such as save_cookies), and VALUE is
the config value for that key.
---------------------------------------------------------------------------
save_cookies (value: 0 or 1)
- save encountered cookies into %$CRAWL->{cookies}; key is entire cookie
string, value is how many times cookie was encountered
save_offsites (value: 0 or 1)
- save all URLs not on this host to %$CRAWL->{offsites}; key is
offsite URL, value is how many times it was referenced
save_referrers (value: 0 or 1)
- save the URLs that refer to the given URL in %$CRAWL->{referrers};
key is target URL, and the value is an anon array of all URLs that
referred to it
save_non_http (value: 0 or 1)
- save any non-http/https URLs into %$CRAWL->{non_http}; basically
all your ftp://, mailto:, and javascript: URLs, etc.
follow_moves (value: 0 or 1)
- crawl will transparently follow the URL given in a 30x move response
use_params (value: 0 or 1)
- crawl will factor in URI parameters when considering if a URI is unique
or not (otherwise parameters are discarded)
params_double_record (value: 0 or 1)
- if both use_params and params_double_record are set, crawl will make two
track entries for each URI which has paramaters: one with and one without
the parameters
reuse_cookies (value: 0 or 1)
- crawl will resubmit any received/prior cookies, much like a browser would
skip_ext (value: anonymous hash)
- the keys of the anonymous hash are file extensions that crawl() should
skip trying to crawl; defaults to common binary/multimedia files (gif,
jpg, pdf, etc)
save_skipped (value: 0 or 1)
- any URLs that are skipped via skip_ext, or are above the specified DEPTH
will be recorded in the tracking hash with a value of '?' (instead of an
HTTP response code).
callback (value: 0 or \&sub)
- crawl will call this function (if this is a reference to a function),
passing it the current URI and the @ST array. If the function returns a
TRUE value, then crawl will skip that URI. Set to value 0 (zero) if you
do not want to use a callback.
netloc_bug (value: 0 or 1)
- technically a url of the form '//www.host.com/url' is valid; the scheme
(http/https) is assumed. However, it's also possible to have bad
relative references such as '//dir/file', which is similar in spirit
to '/dir//file' (i.e. too many slashes). When netloc_bug is enabled,
any URL of the form '//blah/url' will be turned into 'http://blah/url'.
This option was formerly called 'slashdot_bug' in LW 1.x, since
slashdot.org was the first site I encountered using it (it makes for a
great way to catch web crawlers ;) Note that this is enabled by
default.
source_callback (value: 0 or \&sub)
- crawl will call this function (if this is a reference to a function),
passing references to %hin and %hout, right before it parses the page
for HTML links. This allows the callback function to review or
modify the HTML before it's parsed for links. Return value is ignored.
url_limit (value: integer)
- number or URLs that crawl will queue up at one time; defaults to 1000
do_head (value: 0 or 1)
- use head requests to determine if a file has a content-type worth
downloading. Potentially saves some time, assuming the server properly
supports HEAD requests. Set to value 1 to use (0/off by default).
normalize_uri (value: 0 or 1)
- when set, crawl() will normalize found URIs in order to ensure there
are not duplicates (normalization means turning '/blah/../foo' and
'/./foo' into '/foo')
|