This file contains an explanation of the crawl variables. $CRAWL is assumed to be a $CRAWLER_OBJECT returned by crawl_new(). --------------------------------------------------------------------------- Crawl data structures --------------------------------------------------------------------------- %$CRAWL->{config} - configuration values (see below); key=config key, value=value of key &$CRAWL->{crawl} - subfunction which just calls LW2::crawl($CRAWL) &$CRAWL->{reset} - subfunction which resets all the values in $CRAWL %$CRAWL->{track} - All the URLs seen/requested; key=url, value=HTTP response code, or '?' if not actually requested %$CRAWL->{request} - Libwhisker request hash used during crawling %$CRAWL->{response} - Libwhisker response hash used during crawling $CRAWL->{depth} - Default max depth set by crawl_new() $CRAWL->{start} - Default start URL set by crawl_new() @$CRAWL->{errors} - All encountered errors during crawl'ing @$CRAWL->{urls} - Temporary array used internally by crawl() %$CRAWL->{server_tags} - Server banners encountered while crawling; key=banner, value=# times seen %$CRAWL->{referrers} - Keeps track of who refers to what URL; key=target URL, value=anon array of all URLs that point to it %$CRAWL->{offsites} - All URLs that point to other hosts; key=URL, value=# times seen %$CRAWL->{non_http} - All non-http/https URLs found; key=URL, value=# times seen %$CRAWL->{cookies} - All cookies encountered during crawling; key=cookie string, value=# times seen %$CRAWL->{forms} - URLs which were the target of
tags; key=URL, value=# times seen %$CRAWL->{jar} - Temporary hash used internally by crawl() to track cookies $CRAWL->{parsed_page_count} - The number of HTML pages parsed for URLs --------------------------------------------------------------------------- Crawl config options & values: --------------------------------------------------------------------------- You generally access the values below by: $CRAWL->{config}->{KEY}=VALUE; Where 'KEY' is the target key value (such as save_cookies), and VALUE is the config value for that key. --------------------------------------------------------------------------- save_cookies (value: 0 or 1) - save encountered cookies into %$CRAWL->{cookies}; key is entire cookie string, value is how many times cookie was encountered save_offsites (value: 0 or 1) - save all URLs not on this host to %$CRAWL->{offsites}; key is offsite URL, value is how many times it was referenced save_referrers (value: 0 or 1) - save the URLs that refer to the given URL in %$CRAWL->{referrers}; key is target URL, and the value is an anon array of all URLs that referred to it save_non_http (value: 0 or 1) - save any non-http/https URLs into %$CRAWL->{non_http}; basically all your ftp://, mailto:, and javascript: URLs, etc. follow_moves (value: 0 or 1) - crawl will transparently follow the URL given in a 30x move response use_params (value: 0 or 1) - crawl will factor in URI parameters when considering if a URI is unique or not (otherwise parameters are discarded) params_double_record (value: 0 or 1) - if both use_params and params_double_record are set, crawl will make two track entries for each URI which has paramaters: one with and one without the parameters reuse_cookies (value: 0 or 1) - crawl will resubmit any received/prior cookies, much like a browser would skip_ext (value: anonymous hash) - the keys of the anonymous hash are file extensions that crawl() should skip trying to crawl; defaults to common binary/multimedia files (gif, jpg, pdf, etc) save_skipped (value: 0 or 1) - any URLs that are skipped via skip_ext, or are above the specified DEPTH will be recorded in the tracking hash with a value of '?' (instead of an HTTP response code). callback (value: 0 or \&sub) - crawl will call this function (if this is a reference to a function), passing it the current URI and the @ST array. If the function returns a TRUE value, then crawl will skip that URI. Set to value 0 (zero) if you do not want to use a callback. netloc_bug (value: 0 or 1) - technically a url of the form '//www.host.com/url' is valid; the scheme (http/https) is assumed. However, it's also possible to have bad relative references such as '//dir/file', which is similar in spirit to '/dir//file' (i.e. too many slashes). When netloc_bug is enabled, any URL of the form '//blah/url' will be turned into 'http://blah/url'. This option was formerly called 'slashdot_bug' in LW 1.x, since slashdot.org was the first site I encountered using it (it makes for a great way to catch web crawlers ;) Note that this is enabled by default. source_callback (value: 0 or \&sub) - crawl will call this function (if this is a reference to a function), passing references to %hin and %hout, right before it parses the page for HTML links. This allows the callback function to review or modify the HTML before it's parsed for links. Return value is ignored. url_limit (value: integer) - number or URLs that crawl will queue up at one time; defaults to 1000 do_head (value: 0 or 1) - use head requests to determine if a file has a content-type worth downloading. Potentially saves some time, assuming the server properly supports HEAD requests. Set to value 1 to use (0/off by default). normalize_uri (value: 0 or 1) - when set, crawl() will normalize found URIs in order to ensure there are not duplicates (normalization means turning '/blah/../foo' and '/./foo' into '/foo')