File: README

package info (click to toggle)
libparse-mediawikidump-perl 1.0.4-1
  • links: PTS, VCS
  • area: main
  • in suites: squeeze
  • size: 192 kB
  • ctags: 123
  • sloc: perl: 1,148; xml: 205; sql: 16; makefile: 2
file content (80 lines) | stat: -rw-r--r-- 2,030 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
Parse-MediaWikiDump

Parse::MediaWikiDump is a collection of classes for processing various
MediaWiki dump files such as those at 
http://download.wikimedia.org/wikipedia/en/; the package requires XML::Parser. 
Using this software it is nearly trivial to get access to the information in 
supported dump files.

Currently the following dump files are supported:
  * Current page dumps for all languages
  * Current links dumps for all languages

INSTALLATION

To install this module, run the following commands:

    perl Makefile.PL
    make
    make test
    make install

EXAMPLE

Extract the text for a given article from the given dump file:

#!/usr/bin/perl

use strict;
use warnings;
use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages";
my $title = shift(@ARGV) or die "must specify an article title";
my $dump = Parse::MediaWikiDump::Pages->new($file);

binmode(STDOUT, ':utf8');
binmode(STDERR, ':utf8');

#this is the only currently known value but there could be more in the future
if ($dump->case ne 'first-letter') {
  die "unable to handle any case setting besides 'first-letter'";
}

#enforce the MediaWiki case rules
$title = case_fixer($title);

#iterate over the entire dump file, article by article
while(my $page = $dump->next) {
  if ($page->title eq $title) {
    print STDERR "Located text for $title\n";
    my $text = $page->text;
    print $$text;
    exit 0;
  }
}

print STDERR "Unable to find article text for $title\n";
exit 1;

#removes any case sensativity from the very first letter of the title
#but not from the optional namespace name
sub case_fixer {
  my $title = shift;

  #check for namespace
  if ($title =~ /^(.+?):(.+)/) {
    $title = $1 . ':' . ucfirst($2);
  } else {
    $title = ucfirst($title);
  }

  return $title;
}

COPYRIGHT & LICENSE
       Copyright 2005 Tyler Riddle, all rights reserved.

       This program is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.