org.pdfbox.searchengine.lucene
Class LucenePDFDocument

java.lang.Object
  extended byorg.pdfbox.searchengine.lucene.LucenePDFDocument

public final class LucenePDFDocument
extends Object

This class is used to create a document for the lucene search engine. This should easily plug into the IndexHTML or IndexFiles that comes with the lucene project. This class will populate the following fields.

Lucene Field Name Description
path File system path if loaded from a file
url URL to PDF document
contents Entire contents of PDF document, indexed but not stored
summary First 500 characters of content
modified The modified date/time according to the url or path
uid A unique identifier for the Lucene document.
CreationDate From PDF meta-data if available
Creator From PDF meta-data if available
Keywords From PDF meta-data if available
ModificationDate From PDF meta-data if available
Producer From PDF meta-data if available
Subject From PDF meta-data if available
Trapped From PDF meta-data if available

Version:
$Revision: 1.22 $
Author:
Ben Litchfield

Constructor Summary
LucenePDFDocument()
          Constructor.
 
Method Summary
 Document convertDocument(File file)
          This will take a reference to a PDF document and create a lucene document.
 Document convertDocument(InputStream is)
          Convert the PDF stream to a lucene document.
 Document convertDocument(URL url)
          Convert the document from a PDF to a lucene document.
 DateTools.Resolution getDateTimeResolution()
          Get the Lucene data time resolution.
static Document getDocument(File file)
          This will get a lucene document from a PDF file.
static Document getDocument(InputStream is)
          This will get a lucene document from a PDF file.
static Document getDocument(URL url)
          This will get a lucene document from a PDF file.
static void main(String[] args)
          This will test creating a document.
 void setDateTimeResolution(DateTools.Resolution resolution)
          Set the Lucene data time resolution.
 void setTextStripper(PDFTextStripper aStripper)
          Set the text stripper that will be used during extraction.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LucenePDFDocument

public LucenePDFDocument()
Constructor.

Method Detail

setTextStripper

public void setTextStripper(PDFTextStripper aStripper)
Set the text stripper that will be used during extraction.

Parameters:
aStripper - The new pdf text stripper.

getDateTimeResolution

public DateTools.Resolution getDateTimeResolution()
Get the Lucene data time resolution.

Returns:
current date/time resolution

setDateTimeResolution

public void setDateTimeResolution(DateTools.Resolution resolution)
Set the Lucene data time resolution.

Parameters:
resolution - set new date/time resolution

convertDocument

public Document convertDocument(InputStream is)
                         throws IOException
Convert the PDF stream to a lucene document.

Parameters:
is - The input stream.
Returns:
The input stream converted to a lucene document.
Throws:
IOException - If there is an error converting the PDF.

convertDocument

public Document convertDocument(File file)
                         throws IOException
This will take a reference to a PDF document and create a lucene document.

Parameters:
file - A reference to a PDF document.
Returns:
The converted lucene document.
Throws:
IOException - If there is an exception while converting the document.

convertDocument

public Document convertDocument(URL url)
                         throws IOException
Convert the document from a PDF to a lucene document.

Parameters:
url - A url to a PDF document.
Returns:
The PDF converted to a lucene document.
Throws:
IOException - If there is an error while converting the document.

getDocument

public static Document getDocument(InputStream is)
                            throws IOException
This will get a lucene document from a PDF file.

Parameters:
is - The stream to read the PDF from.
Returns:
The lucene document.
Throws:
IOException - If there is an error parsing or indexing the document.

getDocument

public static Document getDocument(File file)
                            throws IOException
This will get a lucene document from a PDF file.

Parameters:
file - The file to get the document for.
Returns:
The lucene document.
Throws:
IOException - If there is an error parsing or indexing the document.

getDocument

public static Document getDocument(URL url)
                            throws IOException
This will get a lucene document from a PDF file.

Parameters:
url - The file to get the document for.
Returns:
The lucene document.
Throws:
IOException - If there is an error parsing or indexing the document.

main

public static void main(String[] args)
                 throws IOException
This will test creating a document. usage: java pdfparser.searchengine.lucene.LucenePDFDocument <pdf-document>

Parameters:
args - command line arguments.
Throws:
IOException - If there is an error.