1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
|
#! /usr/bin/perl -w
use strict;
# -- Filter PDF to simple HTML for swish
# --
# -- 2000-05 rasc
#
=pod
This filter requires two programs "pdfinfo" and "pdftotext".
These programs are part of the xpdf package found at
http://www.foolabs.com/xpdf/xpdf.html.
These programs must be found in the PATH when indexing is run, or
explicitly set the path in this program:
$ENV{PATH} = '/path/to/programs'
"pdfinfo" extracts the document info from a pdf file, if any exist,
and creates metanames for swish to index. See man pdfinfo(1) for
information what keywords are available.
An HTML title is created from the "title" and "subject" pdf info data.
Adjust as needed below.
How the extracted keyword info is indexed in Swish-e is controlled by
the following Swish-e configuration settings: MetaNames, PropertyNames,
UndefinedMetaTags.
Passing the -raw option to pdftotext may improve indexing time by
reducing the size of the converted output.
=cut
my $file = shift || die "Usage: $0 <filename>\n";
#
# -- read pdf meta information
#
my %metadata;
open F, "pdfinfo $file |" ||
die "$0: Failed to open $file $!";
while (<F>) {
if ( /^\s*([^:]+):\s+(.+)$/ ) {
my ( $metaname, $value ) = ( lc( $1 ), escapeHTML( $2 ) );
$metaname =~ tr/ /_/;
$metadata{$metaname} = $value;
}
}
close F or die "$0: Failed close on pipe to pdfinfo for $file: $?";
# Set the default title from the title and subject info
my @title = grep { $_ } @metadata{ qw/title subject/ };
delete $metadata{$_} for qw/title subject/;
my $title = join ' // ', ( @title ? @title : 'Unknown title' );
my $metadata =
join "\n",
map { qq[<meta name="$_" content="$metadata{$_}">] }
sort keys %metadata;
print <<EOF;
<html>
<head>
<title>
$title
</title>
$metadata
</head>
<body>
EOF
# Might be faster to use sysread and read in larger blocks
open F, "pdftotext $file - |" or die "$0: failed to run pdftotext: $!";
print escapeHTML($_) while ( <F> );
close F or die "$0: Failed close on pipe to pdftotext for $file: $?";
print "</body></html>\n";
# How are URLs printed with pdftotext?
sub escapeHTML {
my $str = shift;
for ( $str ) {
s/&/&/go;
s/</</go;
s/>/>/go;
s/"/"/go;
}
return $str;
}
|