File: README

package info (click to toggle)
libweb-scraper-perl 0.38-2
links: PTS, VCS
area: main
in suites: bookworm, forky, sid, trixie
size: 308 kB
sloc: perl: 473; makefile: 2
file content (188 lines) | stat: -rw-r--r-- 6,615 bytes
parent folder | download | duplicates (3)
NAME
    Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or
    XPath expressions

SYNOPSIS
      use URI;
      use Web::Scraper;
      use Encode;

      # First, create your scraper block
      my $authors = scraper {
          # Parse all TDs inside 'table[width="100%]"', store them into
          # an array 'authors'.  We embed other scrapers for each TD.
          process 'table[width="100%"] td', "authors[]" => scraper {
            # And, in each TD,
            # get the URI of "a" element
            process "a", uri => '@href';
            # get text inside "small" element
            process "small", fullname => 'TEXT';
          };
      };

      my $res = $authors->scrape( URI->new("http://search.cpan.org/author/?A") );

      # iterate the array 'authors'
      for my $author (@{$res->{authors}}) {
          # output is like:
          # Andy Adler      http://search.cpan.org/~aadler/
          # Aaron K Dancygier       http://search.cpan.org/~aakd/
          # Aamer Akhter    http://search.cpan.org/~aakhter/
          print Encode::encode("utf8", "$author->{fullname}\t$author->{uri}\n");
      }

    The structure would resemble this (visually) { authors => [ { fullname
    => $fullname, link => $uri }, { fullname => $fullname, link => $uri }, ]
    }

DESCRIPTION
    Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent
    Scrapi. It provides a DSL-ish interface for traversing HTML documents
    and returning a neatly arranged Perl data structure.

    The *scraper* and *process* blocks provide a method to define what
    segments of a document to extract. It understands HTML and CSS Selectors
    as well as XPath expressions.

METHODS
  scraper
      $scraper = scraper { ... };

    Creates a new Web::Scraper object by wrapping the DSL code that will be
    fired when *scrape* method is called.

  scrape
      $res = $scraper->scrape(URI->new($uri));
      $res = $scraper->scrape($html_content);
      $res = $scraper->scrape(\$html_content);
      $res = $scraper->scrape($http_response);
      $res = $scraper->scrape($html_element);

    Retrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings
    and creates a DOM object, then fires the callback scraper code to
    retrieve the data structure.

    If you pass URI or HTTP::Response object, Web::Scraper will
    automatically guesses the encoding of the content by looking at
    Content-Type headers and META tags. Otherwise you need to decode the
    HTML to Unicode before passing it to *scrape* method.

    You can optionally pass the base URL when you pass the HTML content as a
    string instead of URI or HTTP::Response.

      $res = $scraper->scrape($html_content, "http://example.com/foo");

    This way Web::Scraper can resolve the relative links found in the
    document.

  process
      scraper {
          process "tag.class", key => 'TEXT';
          process '//tag[contains(@foo, "bar")]', key2 => '@attr';
          process '//comment()', 'comments[]' => 'TEXT';
      };

    *process* is the method to find matching elements from HTML with CSS
    selector or XPath expression, then extract text or attributes into the
    result stash.

    If the first argument begins with "//" or "id(" it's treated as an XPath
    expression and otherwise CSS selector.

      # <span class="date">2008/12/21</span>
      # date => "2008/12/21"
      process ".date", date => 'TEXT';

      # <div class="body"><a href="http://example.com/">foo</a></div>
      # link => URI->new("http://example.com/")
      process ".body > a", link => '@href';

      # <div class="body"><!-- HTML Comment here --><a href="http://example.com/">foo</a></div>
      # comment => " HTML Comment here "
      #
      # NOTES: A comment nodes are accessed when installed
      # the HTML::TreeBuilder::XPath (version >= 0.14) and/or
      # the HTML::TreeBuilder::LibXML (version >= 0.13)
      process "//div[contains(@class, 'body')]/comment()", comment => 'TEXT';

      # <div class="body"><a href="http://example.com/">foo</a></div>
      # link => URI->new("http://example.com/"), text => "foo"
      process ".body > a", link => '@href', text => 'TEXT';

      # <ul><li>foo</li><li>bar</li></ul>
      # list => [ "foo", "bar" ]
      process "li", "list[]" => "TEXT";

      # <ul><li id="1">foo</li><li id="2">bar</li></ul>
      # list => [ { id => "1", text => "foo" }, { id => "2", text => "bar" } ];
      process "li", "list[]" => { id => '@id', text => "TEXT" };

  process_first
    "process_first" is the same as "process" but stops when the first
    matching result is found.

      # <span class="date">2008/12/21</span>
      # <span class="date">2008/12/22</span>
      # date => "2008/12/21"
      process_first ".date", date => 'TEXT';

  result
    "result" allows to return not the default value after processing but a
    single value specified by a key or a hash reference built from several
    keys.

      process 'a', 'want[]' => 'TEXT';
      result 'want';

EXAMPLES
    There are many examples in the "eg/" dir packaged in this distribution.
    It is recommended to look through these.

NESTED SCRAPERS
    Scrapers can be nested thus allowing to scrape already captured data.

      # <ul>
      # <li class="foo"><a href="foo1">bar1</a></li>
      # <li class="bar"><a href="foo2">bar2</a></li>
      # <li class="foo"><a href="foo3">bar3</a></li>
      # </ul>
      # friends => [ {href => 'foo1'}, {href => 'foo2'} ];
      process 'li', 'friends[]' => scraper {
        process 'a', href => '@href',
      };

FILTERS
    Filters are applied to the result after processing. They can be declared
    as anonymous subroutines or as class names.

      process $exp, $key => [ 'TEXT', sub { s/foo/bar/ } ];
      process $exp, $key => [ 'TEXT', 'Something' ];
      process $exp, $key => [ 'TEXT', '+MyApp::Filter::Foo' ];

    Filters can be stacked

      process $exp, $key => [ '@href', 'Foo', '+MyApp::Filter::Bar', \&baz ];

    More about filters you can find in Web::Scraper::Filter documentation.

XML backends
    By default HTML::TreeBuilder::XPath is used, this can be replaces by a
    XML::LibXML backend using Web::Scraper::LibXML module.

      use Web::Scraper::LibXML;

      # same as Web::Scraper
      my $scraper = scraper { ... };

AUTHOR
    Tatsuhiko Miyagawa <miyagawa@bulknews.net>

LICENSE
    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

SEE ALSO
    <http://blog.labnotes.org/category/scrapi/>

    HTML::TreeBuilder::XPath