1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
|
Parse-MediaWikiDump
Parse::MediaWikiDump is a collection of classes for processing various
MediaWiki dump files such as those at
http://download.wikimedia.org/wikipedia/en/; the package requires XML::Parser.
Using this software it is nearly trivial to get access to the information in
supported dump files.
Currently the following dump files are supported:
* Current page dumps for all languages
* Current links dumps for all languages
INSTALLATION
To install this module, run the following commands:
perl Makefile.PL
make
make test
make install
LIMITATIONS
Parse::MediaWikiDump currently can not properly handle the full page dumps (a
dump where each page has more than one revision). In this instance
Parse::MediaWikiDump will abort processing of the archive.
Parse::MediaWikiDump is not as fast as it could be but it is faster than using
most other XML parsing frameworks. The parser could stand to be rewritten to
be faster and handle the full page dumps.
EXAMPLE
Extract the text for a given article from the given dump file:
#!/usr/bin/perl
use strict;
use warnings;
use Parse::MediaWikiDump;
my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages";
my $title = shift(@ARGV) or die "must specify an article title";
my $dump = Parse::MediaWikiDump::Pages->new($file);
binmode(STDOUT, ':utf8');
binmode(STDERR, ':utf8');
#this is the only currently known value but there could be more in the future
if ($dump->case ne 'first-letter') {
die "unable to handle any case setting besides 'first-letter'";
}
#enforce the MediaWiki case rules
$title = case_fixer($title);
#iterate over the entire dump file, article by article
while(my $page = $dump->next) {
if ($page->title eq $title) {
print STDERR "Located text for $title\n";
my $text = $page->text;
print $$text;
exit 0;
}
}
print STDERR "Unable to find article text for $title\n";
exit 1;
#removes any case sensativity from the very first letter of the title
#but not from the optional namespace name
sub case_fixer {
my $title = shift;
#check for namespace
if ($title =~ /^(.+?):(.+)/) {
$title = $1 . ':' . ucfirst($2);
} else {
$title = ucfirst($title);
}
return $title;
}
COPYRIGHT & LICENSE
Copyright 2005 Tyler Riddle, all rights reserved.
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
|