File: crawler.txt

package info (click to toggle)
libwhisker2-perl 2.4-1
links: PTS
area: main
in suites: lenny
size: 664 kB
ctags: 303
sloc: perl: 7,262; makefile: 52
file content (155 lines) | stat: -rw-r--r-- 5,718 bytes
parent folder | download | duplicates (4)
This file contains an explanation of the crawl variables.

$CRAWL is assumed to be a $CRAWLER_OBJECT returned by crawl_new().

---------------------------------------------------------------------------
Crawl data structures
---------------------------------------------------------------------------

%$CRAWL->{config}
- configuration values (see below); key=config key, value=value of key

&$CRAWL->{crawl}
- subfunction which just calls LW2::crawl($CRAWL)

&$CRAWL->{reset}
- subfunction which resets all the values in $CRAWL

%$CRAWL->{track}
- All the URLs seen/requested; key=url, value=HTTP response code, or '?' if
  not actually requested

%$CRAWL->{request}
- Libwhisker request hash used during crawling

%$CRAWL->{response}
- Libwhisker response hash used during crawling

$CRAWL->{depth}
- Default max depth set by crawl_new()

$CRAWL->{start}
- Default start URL set by crawl_new()

@$CRAWL->{errors}
- All encountered errors during crawl'ing

@$CRAWL->{urls}
- Temporary array used internally by crawl()

%$CRAWL->{server_tags}
- Server banners encountered while crawling; key=banner, value=# times seen

%$CRAWL->{referrers}
- Keeps track of who refers to what URL; key=target URL, value=anon array
  of all URLs that point to it

%$CRAWL->{offsites}
- All URLs that point to other hosts; key=URL, value=# times seen

%$CRAWL->{non_http}
- All non-http/https URLs found; key=URL, value=# times seen

%$CRAWL->{cookies}
- All cookies encountered during crawling; key=cookie string, value=# times
  seen

%$CRAWL->{forms}
- URLs which were the target of <form> tags; key=URL, value=# times seen

%$CRAWL->{jar}
- Temporary hash used internally by crawl() to track cookies

$CRAWL->{parsed_page_count}
- The number of HTML pages parsed for URLs

---------------------------------------------------------------------------
Crawl config options & values:
---------------------------------------------------------------------------

You generally access the values below by:

	$CRAWL->{config}->{KEY}=VALUE;

Where 'KEY' is the target key value (such as save_cookies), and VALUE is
the config value for that key.

---------------------------------------------------------------------------

save_cookies (value: 0 or 1)
- save encountered cookies into %$CRAWL->{cookies}; key is entire cookie
  string, value is how many times cookie was encountered

save_offsites (value: 0 or 1)
- save all URLs not on this host to %$CRAWL->{offsites}; key is
  offsite URL, value is how many times it was referenced

save_referrers (value: 0 or 1)
- save the URLs that refer to the given URL in %$CRAWL->{referrers};
  key is target URL, and the value is an anon array of all URLs that
  referred to it

save_non_http (value: 0 or 1)
- save any non-http/https URLs into %$CRAWL->{non_http}; basically
  all your ftp://, mailto:, and javascript: URLs, etc.

follow_moves (value: 0 or 1)
- crawl will transparently follow the URL given in a 30x move response

use_params (value: 0 or 1)
- crawl will factor in URI parameters when considering if a URI is unique 
  or not (otherwise parameters are discarded)

params_double_record (value: 0 or 1)
- if both use_params and params_double_record are set, crawl will make two
  track entries for each URI which has paramaters: one with and one without
  the parameters

reuse_cookies (value: 0 or 1)
- crawl will resubmit any received/prior cookies, much like a browser would

skip_ext (value: anonymous hash)
- the keys of the anonymous hash are file extensions that crawl() should
  skip trying to crawl; defaults to common binary/multimedia files (gif,
  jpg, pdf, etc)

save_skipped (value: 0 or 1)
- any URLs that are skipped via skip_ext, or are above the specified DEPTH 
  will be recorded in the tracking hash with a value of '?' (instead of an
  HTTP response code).

callback (value: 0 or \&sub)
- crawl will call this function (if this is a reference to a function), 
  passing it the current URI and the @ST array.  If the function returns a 
  TRUE value, then crawl will skip that URI.  Set to value 0 (zero) if you
  do not want to use a callback.

netloc_bug (value: 0 or 1)
- technically a url of the form '//www.host.com/url' is valid; the scheme
  (http/https) is assumed.  However, it's also possible to have bad
  relative references such as '//dir/file', which is similar in spirit
  to '/dir//file' (i.e. too many slashes).  When netloc_bug is enabled,
  any URL of the form '//blah/url' will be turned into 'http://blah/url'.
  This option was formerly called 'slashdot_bug' in LW 1.x, since 
  slashdot.org was the first site I encountered using it (it makes for a
  great way to catch web crawlers ;)  Note that this is enabled by
  default.

source_callback (value: 0 or \&sub)
- crawl will call this function (if this is a reference to a function), 
  passing references to %hin and %hout, right before it parses the page
  for HTML links.  This allows the callback function to review or
  modify the HTML before it's parsed for links.  Return value is ignored.
  
url_limit (value: integer)
- number or URLs that crawl will queue up at one time; defaults to 1000

do_head (value: 0 or 1)
- use head requests to determine if a file has a content-type worth
  downloading.  Potentially saves some time, assuming the server properly
  supports HEAD requests.  Set to value 1 to use (0/off by default).

normalize_uri (value: 0 or 1)
- when set, crawl() will normalize found URIs in order to ensure there
  are not duplicates (normalization means turning '/blah/../foo' and
  '/./foo' into '/foo')