Package org.apache.pdfbox.tools
Class PDFText2HTML
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.LegacyPDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.apache.pdfbox.tools.PDFText2HTML
-
public class PDFText2HTML extends PDFTextStripper
Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static classPDFText2HTML.FontStateA helper class to maintain the current font state.
-
Field Summary
Fields Modifier and Type Field Description private PDFText2HTML.FontStatefontStateprivate static intINITIAL_PDF_TO_HTML_BYTES-
Fields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output
-
-
Constructor Summary
Constructors Constructor Description PDFText2HTML()Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private static voidappendEscaped(java.lang.StringBuilder builder, char character)protected voidendArticle()Write out the article separator.voidendDocument(PDDocument document)This method is available for subclasses of this class.private static java.lang.Stringescape(java.lang.String chars)Escape some HTML characters.protected java.lang.StringgetTitle()This method will attempt to guess the title of the document using either the document properties or the first lines of text.protected voidstartArticle(boolean isLTR)Write out the article separator (div tag) with proper text direction information.protected voidstartDocument(PDDocument document)This method is available for subclasses of this class.protected voidwriteParagraphEnd()Writes the paragraph end "</p>" to the output.protected voidwriteString(java.lang.String chars)Write a string to the output stream and escape some HTML characters.protected voidwriteString(java.lang.String text, java.util.List<TextPosition> textPositions)Write a string to the output stream, maintain font state, and escape some HTML characters.-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
beginMarkedContentSequence, endMarkedContentSequence, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeWordSeparator
-
Methods inherited from class org.apache.pdfbox.text.LegacyPDFStreamEngine
computeFontHeight, showGlyph
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, markedContentPoint, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Field Detail
-
INITIAL_PDF_TO_HTML_BYTES
private static final int INITIAL_PDF_TO_HTML_BYTES
- See Also:
- Constant Field Values
-
fontState
private final PDFText2HTML.FontState fontState
-
-
Method Detail
-
startDocument
protected void startDocument(PDDocument document) throws java.io.IOException
Description copied from class:PDFTextStripperThis method is available for subclasses of this class. It will be called before processing of the document start.- Overrides:
startDocumentin classPDFTextStripper- Parameters:
document- The PDF document that is being processed.- Throws:
java.io.IOException- If an IO error occurs.
-
endDocument
public void endDocument(PDDocument document) throws java.io.IOException
This method is available for subclasses of this class. It will be called after processing of the document finishes.- Overrides:
endDocumentin classPDFTextStripper- Parameters:
document- The PDF document that is being processed.- Throws:
java.io.IOException- If an IO error occurs.
-
getTitle
protected java.lang.String getTitle()
This method will attempt to guess the title of the document using either the document properties or the first lines of text.- Returns:
- returns the title.
-
startArticle
protected void startArticle(boolean isLTR) throws java.io.IOExceptionWrite out the article separator (div tag) with proper text direction information.- Overrides:
startArticlein classPDFTextStripper- Parameters:
isLTR- true if direction of text is left to right- Throws:
java.io.IOException- If there is an error writing to the stream.
-
endArticle
protected void endArticle() throws java.io.IOExceptionWrite out the article separator.- Overrides:
endArticlein classPDFTextStripper- Throws:
java.io.IOException- If there is an error writing to the stream.
-
writeString
protected void writeString(java.lang.String text, java.util.List<TextPosition> textPositions) throws java.io.IOExceptionWrite a string to the output stream, maintain font state, and escape some HTML characters. The font state is only preserved per word.- Overrides:
writeStringin classPDFTextStripper- Parameters:
text- The text to write to the stream.textPositions- the corresponding text positions- Throws:
java.io.IOException- If there is an error writing to the stream.
-
writeString
protected void writeString(java.lang.String chars) throws java.io.IOExceptionWrite a string to the output stream and escape some HTML characters.- Overrides:
writeStringin classPDFTextStripper- Parameters:
chars- String to be written to the stream- Throws:
java.io.IOException- If there is an error writing to the stream.
-
writeParagraphEnd
protected void writeParagraphEnd() throws java.io.IOExceptionWrites the paragraph end "</p>" to the output. Furthermore, it will also clear the font state. Write something (if defined) at the end of a paragraph.- Overrides:
writeParagraphEndin classPDFTextStripper- Throws:
java.io.IOException- if something went wrong
-
escape
private static java.lang.String escape(java.lang.String chars)
Escape some HTML characters.- Parameters:
chars- String to be escaped- Returns:
- returns escaped String.
-
appendEscaped
private static void appendEscaped(java.lang.StringBuilder builder, char character)
-
-