File: README

package info (click to toggle)

libparse-mediawikidump-perl 0.51-1

links: PTS, VCS
area: main
in suites: lenny
size: 124 kB
ctags: 75
sloc: perl: 928; xml: 83; makefile: 42; sql: 16

file content (90 lines) | stat: -rw-r--r-- 2,452 bytes

parent folder | download | duplicates (2)

Parse-MediaWikiDump

Parse::MediaWikiDump is a collection of classes for processing various
MediaWiki dump files such as those at 
http://download.wikimedia.org/wikipedia/en/; the package requires XML::Parser. 
Using this software it is nearly trivial to get access to the information in 
supported dump files.

Currently the following dump files are supported:
  * Current page dumps for all languages
  * Current links dumps for all languages

INSTALLATION

To install this module, run the following commands:

    perl Makefile.PL
    make
    make test
    make install

LIMITATIONS

Parse::MediaWikiDump currently can not properly handle the full page dumps (a 
dump where each page has more than one revision). In this instance 
Parse::MediaWikiDump will abort processing of the archive. 

Parse::MediaWikiDump is not as fast as it could be but it is faster than using
most other XML parsing frameworks. The parser could stand to be rewritten to 
be faster and handle the full page dumps. 

EXAMPLE

Extract the text for a given article from the given dump file:

#!/usr/bin/perl

use strict;
use warnings;
use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages";
my $title = shift(@ARGV) or die "must specify an article title";
my $dump = Parse::MediaWikiDump::Pages->new($file);

binmode(STDOUT, ':utf8');
binmode(STDERR, ':utf8');

#this is the only currently known value but there could be more in the future
if ($dump->case ne 'first-letter') {
  die "unable to handle any case setting besides 'first-letter'";
}

#enforce the MediaWiki case rules
$title = case_fixer($title);

#iterate over the entire dump file, article by article
while(my $page = $dump->next) {
  if ($page->title eq $title) {
    print STDERR "Located text for $title\n";
    my $text = $page->text;
    print $$text;
    exit 0;
  }
}

print STDERR "Unable to find article text for $title\n";
exit 1;

#removes any case sensativity from the very first letter of the title
#but not from the optional namespace name
sub case_fixer {
  my $title = shift;

  #check for namespace
  if ($title =~ /^(.+?):(.+)/) {
    $title = $1 . ':' . ucfirst($2);
  } else {
    $title = ucfirst($title);
  }

  return $title;
}

COPYRIGHT & LICENSE
       Copyright 2005 Tyler Riddle, all rights reserved.

       This program is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.