Package org.apache.pdfbox.tools
Class PDFText2Markdown
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.LegacyPDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.apache.pdfbox.tools.PDFText2Markdown
-
- Direct Known Subclasses:
FilteredText2Markdown
public class PDFText2Markdown extends PDFTextStripper
Convert PDF text to Markdown format. Each line in the PDF is converted to a corresponding Markdown paragraph. Bold and italic formatting is also applied based on font properties.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static classPDFText2Markdown.FontStateA helper class to maintain the current font state.
-
Field Summary
Fields Modifier and Type Field Description private PDFText2Markdown.FontStatefontState-
Fields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output
-
-
Constructor Summary
Constructors Constructor Description PDFText2Markdown()Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private static voidappendEscaped(java.lang.StringBuilder builder, char character)protected voidendArticle()Write out the article separator.private static java.lang.Stringescape(java.lang.String chars)Escape some Markdown characters.protected voidstartArticle(boolean isLTR)Write out the article separator with proper text direction information.protected voidwriteParagraphEnd()Writes the Markdown paragraph end to the output.protected voidwriteString(java.lang.String chars)Write a string to the output stream and escape some Markdown characters.protected voidwriteString(java.lang.String text, java.util.List<TextPosition> textPositions)Write a string to the output stream, maintain font state, and escape some Markdown characters.-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
beginMarkedContentSequence, endDocument, endMarkedContentSequence, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeWordSeparator
-
Methods inherited from class org.apache.pdfbox.text.LegacyPDFStreamEngine
computeFontHeight, showGlyph
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, markedContentPoint, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Field Detail
-
fontState
private final PDFText2Markdown.FontState fontState
-
-
Method Detail
-
escape
private static java.lang.String escape(java.lang.String chars)
Escape some Markdown characters.- Parameters:
chars- String to be escaped- Returns:
- returns escaped String.
-
appendEscaped
private static void appendEscaped(java.lang.StringBuilder builder, char character)
-
startArticle
protected void startArticle(boolean isLTR) throws java.io.IOExceptionWrite out the article separator with proper text direction information.- Overrides:
startArticlein classPDFTextStripper- Parameters:
isLTR- true if direction of text is left to right- Throws:
java.io.IOException- If there is an error writing to the stream.
-
endArticle
protected void endArticle() throws java.io.IOExceptionWrite out the article separator.- Overrides:
endArticlein classPDFTextStripper- Throws:
java.io.IOException- If there is an error writing to the stream.
-
writeString
protected void writeString(java.lang.String text, java.util.List<TextPosition> textPositions) throws java.io.IOExceptionWrite a string to the output stream, maintain font state, and escape some Markdown characters. The font state is only preserved per word.- Overrides:
writeStringin classPDFTextStripper- Parameters:
text- The text to write to the stream.textPositions- The corresponding text positions.- Throws:
java.io.IOException- If there is an error writing to the stream.
-
writeString
protected void writeString(java.lang.String chars) throws java.io.IOExceptionWrite a string to the output stream and escape some Markdown characters.- Overrides:
writeStringin classPDFTextStripper- Parameters:
chars- String to be written to the stream.- Throws:
java.io.IOException- If there is an error writing to the stream.
-
writeParagraphEnd
protected void writeParagraphEnd() throws java.io.IOExceptionWrites the Markdown paragraph end to the output. Furthermore, it will also clear the font state.Write something (if defined) at the end of a paragraph.
- Overrides:
writeParagraphEndin classPDFTextStripper- Throws:
java.io.IOException- if something went wrong
-
-