Class PDFText2Markdown

  • Direct Known Subclasses:
    FilteredText2Markdown

    public class PDFText2Markdown
    extends PDFTextStripper
    Convert PDF text to Markdown format. Each line in the PDF is converted to a corresponding Markdown paragraph. Bold and italic formatting is also applied based on font properties.
    • Constructor Detail

      • PDFText2Markdown

        public PDFText2Markdown()
        Constructor.
    • Method Detail

      • escape

        private static java.lang.String escape​(java.lang.String chars)
        Escape some Markdown characters.
        Parameters:
        chars - String to be escaped
        Returns:
        returns escaped String.
      • appendEscaped

        private static void appendEscaped​(java.lang.StringBuilder builder,
                                          char character)
      • startArticle

        protected void startArticle​(boolean isLTR)
                             throws java.io.IOException
        Write out the article separator with proper text direction information.
        Overrides:
        startArticle in class PDFTextStripper
        Parameters:
        isLTR - true if direction of text is left to right
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • endArticle

        protected void endArticle()
                           throws java.io.IOException
        Write out the article separator.
        Overrides:
        endArticle in class PDFTextStripper
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • writeString

        protected void writeString​(java.lang.String text,
                                   java.util.List<TextPosition> textPositions)
                            throws java.io.IOException
        Write a string to the output stream, maintain font state, and escape some Markdown characters. The font state is only preserved per word.
        Overrides:
        writeString in class PDFTextStripper
        Parameters:
        text - The text to write to the stream.
        textPositions - The corresponding text positions.
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • writeString

        protected void writeString​(java.lang.String chars)
                            throws java.io.IOException
        Write a string to the output stream and escape some Markdown characters.
        Overrides:
        writeString in class PDFTextStripper
        Parameters:
        chars - String to be written to the stream.
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • writeParagraphEnd

        protected void writeParagraphEnd()
                                  throws java.io.IOException
        Writes the Markdown paragraph end to the output. Furthermore, it will also clear the font state.

        Write something (if defined) at the end of a paragraph.

        Overrides:
        writeParagraphEnd in class PDFTextStripper
        Throws:
        java.io.IOException - if something went wrong