org.pdfbox.util
Class PDFText2HTML

java.lang.Object
  extended byorg.pdfbox.util.PDFStreamEngine
      extended byorg.pdfbox.util.PDFTextStripper
          extended byorg.pdfbox.util.PDFText2HTML

public class PDFText2HTML
extends PDFTextStripper

Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.

Version:
$Revision: 1.3 $
Author:
jjb - http://www.johnjbarton.com

Field Summary
 
Fields inherited from class org.pdfbox.util.PDFTextStripper
charactersByArticle, output
 
Constructor Summary
PDFText2HTML()
          Constructor.
 
Method Summary
 void endDocument(PDDocument pdf)
          This method is available for subclasses of this class. It will be called after processing of the document finishes.
protected  void endParagraph()
          Write out the paragraph separator.
protected  void flushText()
          This will print the text to the output stream.
protected  String getTitleGuess()
          The guess to the document title.
protected  TextPosition guessTitle(Iterator textIter)
          This method will attempt to guess the title of the document.
 boolean isSuppressParagraphs()
           
 void setSuppressParagraphs(boolean shouldSuppressParagraphs)
           
protected  void startParagraph()
          Write out the paragraph separator.
protected  void writeCharacters(TextPosition position)
          Write the string to the output stream.
protected  void writeHeader()
          Write the header to the output document.
 
Methods inherited from class org.pdfbox.util.PDFTextStripper
endPage, getCharactersByArticle, getCurrentPageNo, getEndBookmark, getEndPage, getLineSeparator, getOutput, getPageSeparator, getStartBookmark, getStartPage, getText, getText, getWordSeparator, processPage, processPages, setEndBookmark, setEndPage, setLineSeparator, setPageSeparator, setShouldSeparateByBeads, setSortByPosition, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, shouldSeparateByBeads, shouldSortByPosition, shouldSuppressDuplicateOverlappingText, showCharacter, startDocument, startPage, writeText, writeText
 
Methods inherited from class org.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getXObjects, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix, showString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PDFText2HTML

public PDFText2HTML()
             throws IOException
Constructor.

Throws:
IOException - If there is an error during initialization.
Method Detail

writeHeader

protected void writeHeader()
                    throws IOException
Write the header to the output document.

Throws:
IOException - If there is a problem writing out the header to the document.

getTitleGuess

protected String getTitleGuess()
The guess to the document title.

Returns:
A string that is the title of this document.

flushText

protected void flushText()
                  throws IOException
This will print the text to the output stream.

Overrides:
flushText in class PDFTextStripper
Throws:
IOException - If there is an error writing the text.

endDocument

public void endDocument(PDDocument pdf)
                 throws IOException
This method is available for subclasses of this class. It will be called after processing of the document finishes.

Overrides:
endDocument in class PDFTextStripper
Parameters:
pdf - The PDF document that is being processed.
Throws:
IOException - If an IO error occurs.

guessTitle

protected TextPosition guessTitle(Iterator textIter)
This method will attempt to guess the title of the document.

Parameters:
textIter - The characters on the first page.
Returns:
The text position that is guessed to be the title.

startParagraph

protected void startParagraph()
                       throws IOException
Write out the paragraph separator.

Overrides:
startParagraph in class PDFTextStripper
Throws:
IOException - If there is an error writing to the stream.

endParagraph

protected void endParagraph()
                     throws IOException
Write out the paragraph separator.

Overrides:
endParagraph in class PDFTextStripper
Throws:
IOException - If there is an error writing to the stream.

writeCharacters

protected void writeCharacters(TextPosition position)
                        throws IOException
Write the string to the output stream.

Overrides:
writeCharacters in class PDFTextStripper
Parameters:
position - The text to write to the stream.
Throws:
IOException - If there is an error when writing the text.

isSuppressParagraphs

public boolean isSuppressParagraphs()
Returns:
Returns the suppressParagraphs.

setSuppressParagraphs

public void setSuppressParagraphs(boolean shouldSuppressParagraphs)
Parameters:
shouldSuppressParagraphs - The suppressParagraphs to set.