Class CompoundCharacterTokenizer


  • public class CompoundCharacterTokenizer
    extends java.lang.Object
    Takes in the given text having compound-glyphs to substitute, and splits it into chunks consisting of parts that should be substituted and the ones that can be processed normally.
    • Field Detail

      • GLYPH_ID_SEPARATOR

        private static final java.lang.String GLYPH_ID_SEPARATOR
        See Also:
        Constant Field Values
      • regexExpression

        private final java.util.regex.Pattern regexExpression
    • Constructor Detail

      • CompoundCharacterTokenizer

        public CompoundCharacterTokenizer​(java.util.Set<java.lang.String> compoundWords)
        Constructor. Calls getRegexFromTokens which returns strings like (_79_99_)|(_80_99_)|(_92_99_) and creates a regexp assigned to regexExpression. See the code in GlyphArraySplitterRegexImpl on how these strings were created.

        It is assumed the compound words are sorted in descending order of length.

        Parameters:
        compoundWords - A set of strings like _79_99_, _80_99_ or _92_99_ .
      • CompoundCharacterTokenizer

        @Deprecated
        public CompoundCharacterTokenizer​(java.util.regex.Pattern pattern)
        Constructor.
        Parameters:
        pattern -
    • Method Detail

      • validateCompoundWords

        private void validateCompoundWords​(java.util.Set<java.lang.String> compoundWords)
        Validate the compound words. They should not be null or empty and should start and end with the GLYPH_ID_SEPARATOR
      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String text)
        Tokenize a string into tokens.
        Parameters:
        text - A string like "_66_71_71_74_79_70_"
        Returns:
        A list of tokens like "_66_", "_71_71_", "74_79_70_". The "_" is sometimes missing at the beginning or end, this has to be cleaned by the caller.
      • getRegexFromTokens

        private java.lang.String getRegexFromTokens​(java.util.Set<java.lang.String> compoundWords)