Class COSParser

  • All Implemented Interfaces:
    ICOSParser
    Direct Known Subclasses:
    BruteForceParser, FDFParser, PDFParser

    public class COSParser
    extends BaseParser
    implements ICOSParser
    COS-Parser which first reads startxref and xref tables in order to know valid objects and parse only these objects. This class is a much enhanced version of QuickParser presented in PDFBOX-1104 by Jeremy Villalobos.
    • Field Detail

      • PDF_DEFAULT_VERSION

        private static final java.lang.String PDF_DEFAULT_VERSION
        See Also:
        Constant Field Values
      • FDF_DEFAULT_VERSION

        private static final java.lang.String FDF_DEFAULT_VERSION
        See Also:
        Constant Field Values
      • XREF_TABLE

        private static final char[] XREF_TABLE
      • STARTXREF

        private static final char[] STARTXREF
      • ENDSTREAM

        private static final byte[] ENDSTREAM
      • ENDOBJ

        private static final byte[] ENDOBJ
      • strmBuf

        private final byte[] strmBuf
      • keyStoreInputStream

        private java.io.InputStream keyStoreInputStream
      • password

        private java.lang.String password
      • keyAlias

        private java.lang.String keyAlias
      • SYSPROP_EOFLOOKUPRANGE

        public static final java.lang.String SYSPROP_EOFLOOKUPRANGE
        The range within the %%EOF marker will be searched. Useful if there are additional characters after %%EOF within the PDF.
        See Also:
        Constant Field Values
      • DEFAULT_TRAIL_BYTECOUNT

        private static final int DEFAULT_TRAIL_BYTECOUNT
        How many trailing bytes to read for EOF marker.
        See Also:
        Constant Field Values
      • EOF_MARKER

        protected static final char[] EOF_MARKER
        EOF-marker.
      • OBJ_MARKER

        protected static final char[] OBJ_MARKER
        obj-marker.
      • fileLen

        protected long fileLen
        file length.
      • isLenient

        private boolean isLenient
        is parser using auto healing capacity ?
      • initialParseDone

        protected boolean initialParseDone
      • trailerWasRebuild

        private boolean trailerWasRebuild
      • decompressedObjects

        private final java.util.Map<java.lang.Long,​java.util.Map<COSObjectKey,​COSBase>> decompressedObjects
        Intermediate cache. Contains all objects of already read compressed object streams. Objects are removed after dereferencing them.
      • readTrailBytes

        private int readTrailBytes
        how many trailing bytes to read for EOF marker.
      • LOG

        private static final org.apache.commons.logging.Log LOG
      • xrefTrailerResolver

        protected XrefTrailerResolver xrefTrailerResolver
        Collects all Xref/trailer objects and resolves them into single object using startxref reference.
    • Constructor Detail

      • COSParser

        public COSParser​(RandomAccessRead source)
                  throws java.io.IOException
        Default constructor.
        Parameters:
        source - input representing the pdf.
        Throws:
        java.io.IOException - if something went wrong
      • COSParser

        public COSParser​(RandomAccessRead source,
                         java.lang.String password,
                         java.io.InputStream keyStore,
                         java.lang.String keyAlias)
                  throws java.io.IOException
        Constructor for encrypted pdfs.
        Parameters:
        source - input representing the pdf.
        password - password to be used for decryption.
        keyStore - key store to be used for decryption when using public key security
        keyAlias - alias to be used for decryption when using public key security
        Throws:
        java.io.IOException - if the source data could not be read
      • COSParser

        public COSParser​(RandomAccessRead source,
                         java.lang.String password,
                         java.io.InputStream keyStore,
                         java.lang.String keyAlias,
                         RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction)
                  throws java.io.IOException
        Constructor for encrypted pdfs.
        Parameters:
        source - input representing the pdf.
        password - password to be used for decryption.
        keyStore - key store to be used for decryption when using public key security
        keyAlias - alias to be used for decryption when using public key security
        streamCacheCreateFunction - a function to create an instance of the stream cache
        Throws:
        java.io.IOException - if the source data could not be read
    • Method Detail

      • setEOFLookupRange

        public void setEOFLookupRange​(int byteCount)
        Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT.

        We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.

        In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

        Parameters:
        byteCount - number of trailing bytes
      • retrieveTrailer

        protected COSDictionary retrieveTrailer()
                                         throws java.io.IOException
        Read the trailer information and provide a COSDictionary containing the trailer information.
        Returns:
        a COSDictionary containing the trailer information
        Throws:
        java.io.IOException - if something went wrong
      • resetTrailerResolver

        protected boolean resetTrailerResolver()
        Indicates whether the xref trailer resolver should be reset or not. Should be overwritten if the xref trailer resolver is needed after the initial parsing.
        Returns:
        true if the xref trailer resolver should be reset
      • parseXref

        private COSDictionary parseXref​(long startXRefOffset)
                                 throws java.io.IOException
        Parses cross reference tables.
        Parameters:
        startXRefOffset - start offset of the first table
        Returns:
        the trailer dictionary
        Throws:
        java.io.IOException - if something went wrong
      • parseXrefObjStream

        private long parseXrefObjStream​(long objByteOffset,
                                        boolean isStandalone)
                                 throws java.io.IOException
        Parses an xref object stream starting with indirect object id.
        Returns:
        value of PREV item in dictionary or -1 if no such item exists
        Throws:
        java.io.IOException
      • getStartxrefOffset

        private long getStartxrefOffset()
                                 throws java.io.IOException
        Looks for and parses startxref. We first look for last '%%EOF' marker (within last DEFAULT_TRAIL_BYTECOUNT bytes (or range set via setEOFLookupRange(int)) and go back to find startxref.
        Returns:
        the offset of StartXref
        Throws:
        java.io.IOException - If something went wrong.
      • lastIndexOf

        protected int lastIndexOf​(char[] pattern,
                                  byte[] buf,
                                  int endOff)
        Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.
        Parameters:
        pattern - pattern to search for
        buf - buffer to search pattern in
        endOff - offset (exclusive) where lookup starts at
        Returns:
        start offset of pattern within buffer or -1 if pattern could not be found
      • isLenient

        public boolean isLenient()
        Return true if parser is lenient. Meaning auto healing capacity of the parser are used.
        Returns:
        true if parser is lenient
      • setLenient

        protected void setLenient​(boolean lenient)
        Change the parser leniency flag. This method can only be called before the parsing of the file.
        Parameters:
        lenient - try to handle malformed PDFs.
      • dereferenceCOSObject

        public COSBase dereferenceCOSObject​(COSObject obj)
                                     throws java.io.IOException
        Description copied from interface: ICOSParser
        Dereference the COSBase object which is referenced by the given COSObject.
        Specified by:
        dereferenceCOSObject in interface ICOSParser
        Parameters:
        obj - the COSObject which references the COSBase object to be dereferenced.
        Returns:
        the referenced object
        Throws:
        java.io.IOException - if something went wrong when dereferencing the COSBase object
      • createRandomAccessReadView

        public RandomAccessReadView createRandomAccessReadView​(long startPosition,
                                                               long streamLength)
                                                        throws java.io.IOException
        Description copied from interface: ICOSParser
        Creates a random access read view starting at the given position with the given length.
        Specified by:
        createRandomAccessReadView in interface ICOSParser
        Parameters:
        startPosition - start position within the underlying random access read
        streamLength - stream length
        Returns:
        the random access read view
        Throws:
        java.io.IOException - if something went wrong when creating the view for the RandomAccessRead
      • parseObjectDynamically

        protected COSBase parseObjectDynamically​(COSObjectKey objKey,
                                                 boolean requireExistingNotCompressedObj)
                                          throws java.io.IOException
        Parse the object for the given object key.
        Parameters:
        objKey - key of object to be parsed
        requireExistingNotCompressedObj - if true the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)
        Returns:
        the parsed object (which is also added to document object)
        Throws:
        java.io.IOException - If an IO error occurs.
      • getObjectOffset

        private java.lang.Long getObjectOffset​(COSObjectKey objKey,
                                               boolean requireExistingNotCompressedObj)
                                        throws java.io.IOException
        Throws:
        java.io.IOException
      • parseFileObject

        private COSBase parseFileObject​(java.lang.Long objOffset,
                                        COSObjectKey objKey)
                                 throws java.io.IOException
        Throws:
        java.io.IOException
      • parseObjectStreamObject

        protected COSBase parseObjectStreamObject​(long objstmObjNr,
                                                  COSObjectKey key)
                                           throws java.io.IOException
        Parse the object with the given key from the object stream with the given number.
        Parameters:
        objstmObjNr - the number of the offset stream
        key - the key of the object to be parsed
        Returns:
        the parsed object
        Throws:
        java.io.IOException - if something went wrong when parsing the object
      • getLength

        private COSNumber getLength​(COSBase lengthBaseObj)
                             throws java.io.IOException
        Returns length value referred to or defined in given object.
        Throws:
        java.io.IOException
      • parseCOSStream

        protected COSStream parseCOSStream​(COSDictionary dic)
                                    throws java.io.IOException
        This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.
        Parameters:
        dic - dictionary that goes with this stream.
        Returns:
        parsed pdf stream.
        Throws:
        java.io.IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
      • readUntilEndStream

        private long readUntilEndStream​(EndstreamFilterStream out)
                                 throws java.io.IOException
        This method will read through the current stream object until we find the keyword "endstream" meaning we're at the end of this object. Some pdf files, however, forget to write some endstream tags and just close off objects with an "endobj" tag so we have to handle this case as well. This method is optimized using buffered IO and reduced number of byte compare operations.
        Parameters:
        out - stream we write out to.
        Throws:
        java.io.IOException - if something went wrong
      • validateStreamLength

        private boolean validateStreamLength​(long streamLength)
                                      throws java.io.IOException
        Throws:
        java.io.IOException
      • checkXRefOffset

        private long checkXRefOffset​(long startXRefOffset)
                              throws java.io.IOException
        Check if the cross reference table/stream can be found at the current offset.
        Parameters:
        startXRefOffset -
        Returns:
        the revised offset
        Throws:
        java.io.IOException
      • checkXRefStreamOffset

        private boolean checkXRefStreamOffset​(long startXRefOffset)
                                       throws java.io.IOException
        Check if the cross reference stream can be found at the current offset.
        Parameters:
        startXRefOffset - the expected start offset of the XRef stream
        Returns:
        the revised offset
        Throws:
        java.io.IOException - if something went wrong
      • calculateXRefFixedOffset

        private long calculateXRefFixedOffset​(long objectOffset)
                                       throws java.io.IOException
        Try to find a fixed offset for the given xref table/stream.
        Parameters:
        objectOffset - the given offset where to look at
        Returns:
        the fixed offset
        Throws:
        java.io.IOException - if something went wrong
      • validateXrefOffsets

        private boolean validateXrefOffsets​(java.util.Map<COSObjectKey,​java.lang.Long> xrefOffset)
                                     throws java.io.IOException
        Throws:
        java.io.IOException
      • checkXrefOffsets

        private void checkXrefOffsets()
                               throws java.io.IOException
        Check the XRef table by dereferencing all objects and fixing the offset if necessary.
        Throws:
        java.io.IOException - if something went wrong.
      • findObjectKey

        private COSObjectKey findObjectKey​(COSObjectKey objectKey,
                                           long offset,
                                           java.util.Map<COSObjectKey,​java.lang.Long> xrefOffset)
                                    throws java.io.IOException
        Check if the given object can be found at the given offset. Returns the provided object key if everything is ok. If the generation number differs it will be fixed and a new object key is returned.
        Parameters:
        objectKey - the key of object we are looking for
        offset - the offset where to look
        xrefOffset - a map with with all known xref entries
        Returns:
        returns the found/fixed object key
        Throws:
        java.io.IOException - if something went wrong
      • getBruteForceParser

        private BruteForceParser getBruteForceParser()
                                              throws java.io.IOException
        Throws:
        java.io.IOException
      • checkPages

        protected void checkPages​(COSDictionary root)
                           throws java.io.IOException
        Check if all entries of the pages dictionary are present. Those which can't be dereferenced are removed.
        Parameters:
        root - the root dictionary of the pdf
        Throws:
        java.io.IOException - if the page tree root is null
      • checkPagesDictionary

        private int checkPagesDictionary​(COSDictionary pagesDict,
                                         java.util.Set<COSObject> set)
      • parseStartXref

        private long parseStartXref()
                             throws java.io.IOException
        This will parse the startxref section from the stream. The startxref value is ignored.
        Returns:
        the startxref value or -1 on parsing error
        Throws:
        java.io.IOException - If an IO error occurs.
      • isString

        private boolean isString​(byte[] string)
                          throws java.io.IOException
        Checks if the given string can be found at the current offset.
        Parameters:
        string - the bytes of the string to look for
        Returns:
        true if the bytes are in place, false if not
        Throws:
        java.io.IOException - if something went wrong
      • isString

        protected boolean isString​(char[] string)
                            throws java.io.IOException
        Checks if the given string can be found at the current offset.
        Parameters:
        string - the bytes of the string to look for
        Returns:
        true if the bytes are in place, false if not
        Throws:
        java.io.IOException - if something went wrong
      • parseTrailer

        private boolean parseTrailer()
                              throws java.io.IOException
        This will parse the trailer from the stream and add it to the state.
        Returns:
        false on parsing error
        Throws:
        java.io.IOException - If an IO error occurs.
      • parsePDFHeader

        protected boolean parsePDFHeader()
                                  throws java.io.IOException
        Parse the header of a pdf.
        Returns:
        true if a PDF header was found
        Throws:
        java.io.IOException - if something went wrong
      • parseFDFHeader

        protected boolean parseFDFHeader()
                                  throws java.io.IOException
        Parse the header of a fdf.
        Returns:
        true if a FDF header was found
        Throws:
        java.io.IOException - if something went wrong
      • parseHeader

        private boolean parseHeader​(java.lang.String headerMarker,
                                    java.lang.String defaultVersion)
                             throws java.io.IOException
        Throws:
        java.io.IOException
      • parseXrefTable

        protected boolean parseXrefTable​(long startByteOffset)
                                  throws java.io.IOException
        This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.
        Parameters:
        startByteOffset - the offset to start at
        Returns:
        false on parsing error
        Throws:
        java.io.IOException - If an IO error occurs.
      • getEncryption

        protected PDEncryption getEncryption()
                                      throws java.io.IOException
        This will get the encryption dictionary. The document must be parsed before this is called.
        Returns:
        The encryption dictionary of the document that was parsed.
        Throws:
        java.io.IOException - If there is an error getting the document.
      • getAccessPermission

        protected AccessPermission getAccessPermission()
                                                throws java.io.IOException
        This will get the AccessPermission. The document must be parsed before this is called.
        Returns:
        The access permission of document that was parsed.
        Throws:
        java.io.IOException - If there is an error getting the document.
      • prepareDecryption

        protected void prepareDecryption()
                                  throws java.io.IOException
        Prepare for decryption.
        Throws:
        InvalidPasswordException - If the password is incorrect.
        java.io.IOException - if something went wrong