Package org.apache.fontbox.ttf.gsub
Class CompoundCharacterTokenizer
- java.lang.Object
-
- org.apache.fontbox.ttf.gsub.CompoundCharacterTokenizer
-
public class CompoundCharacterTokenizer extends java.lang.ObjectTakes in the given text having compound-glyphs to substitute, and splits it into chunks consisting of parts that should be substituted and the ones that can be processed normally.
-
-
Field Summary
Fields Modifier and Type Field Description private static java.lang.StringGLYPH_ID_SEPARATORprivate java.util.regex.PatternregexExpression
-
Constructor Summary
Constructors Constructor Description CompoundCharacterTokenizer(java.util.regex.Pattern pattern)Deprecated.CompoundCharacterTokenizer(java.util.Set<java.lang.String> compoundWords)Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private java.lang.StringgetRegexFromTokens(java.util.Set<java.lang.String> compoundWords)java.util.List<java.lang.String>tokenize(java.lang.String text)Tokenize a string into tokens.private voidvalidateCompoundWords(java.util.Set<java.lang.String> compoundWords)Validate the compound words.
-
-
-
Field Detail
-
GLYPH_ID_SEPARATOR
private static final java.lang.String GLYPH_ID_SEPARATOR
- See Also:
- Constant Field Values
-
regexExpression
private final java.util.regex.Pattern regexExpression
-
-
Constructor Detail
-
CompoundCharacterTokenizer
public CompoundCharacterTokenizer(java.util.Set<java.lang.String> compoundWords)
Constructor. Calls getRegexFromTokens which returns strings like (_79_99_)|(_80_99_)|(_92_99_) and creates a regexp assigned to regexExpression. See the code in GlyphArraySplitterRegexImpl on how these strings were created.It is assumed the compound words are sorted in descending order of length.
- Parameters:
compoundWords- A set of strings like _79_99_, _80_99_ or _92_99_ .
-
CompoundCharacterTokenizer
@Deprecated public CompoundCharacterTokenizer(java.util.regex.Pattern pattern)
Deprecated.Constructor.- Parameters:
pattern-
-
-
Method Detail
-
validateCompoundWords
private void validateCompoundWords(java.util.Set<java.lang.String> compoundWords)
Validate the compound words. They should not be null or empty and should start and end with the GLYPH_ID_SEPARATOR
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String text)
Tokenize a string into tokens.- Parameters:
text- A string like "_66_71_71_74_79_70_"- Returns:
- A list of tokens like "_66_", "_71_71_", "74_79_70_". The "_" is sometimes missing at the beginning or end, this has to be cleaned by the caller.
-
getRegexFromTokens
private java.lang.String getRegexFromTokens(java.util.Set<java.lang.String> compoundWords)
-
-