Class DefaultICUTokenizerConfig
- java.lang.Object
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
-
- org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig
-
public class DefaultICUTokenizerConfig extends ICUTokenizerConfig
DefaultICUTokenizerConfigthat is generally applicable to many languages.Generally tokenizes Unicode text according to UAX#29 (
BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
-
-
Field Summary
Fields Modifier and Type Field Description private booleancjkAsWordsprivate static com.ibm.icu.text.BreakIteratorcjkBreakIteratorprivate static com.ibm.icu.text.RuleBasedBreakIteratordefaultBreakIteratorprivate booleanmyanmarAsWordsprivate static com.ibm.icu.text.RuleBasedBreakIteratormyanmarSyllableIteratorstatic java.lang.StringWORD_EMOJIToken type for words that appear to be emoji sequencesstatic java.lang.StringWORD_HANGULToken type for words containing Korean hangulstatic java.lang.StringWORD_HIRAGANAToken type for words containing Japanese hiraganastatic java.lang.StringWORD_IDEOToken type for words containing ideographic charactersstatic java.lang.StringWORD_KATAKANAToken type for words containing Japanese katakanastatic java.lang.StringWORD_LETTERToken type for words that contain lettersstatic java.lang.StringWORD_NUMBERToken type for words that appear to be numbers-
Fields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
EMOJI_SEQUENCE_STATUS
-
-
Constructor Summary
Constructors Constructor Description DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)Creates a new config.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description booleancombineCJ()true if Han, Hiragana, and Katakana scripts should all be returned as Japanesecom.ibm.icu.text.RuleBasedBreakIteratorgetBreakIterator(int script)Return a breakiterator capable of processing a given script.java.lang.StringgetType(int script, int ruleStatus)Return a token type value for a given script and BreakIterator rule status.private static com.ibm.icu.text.RuleBasedBreakIteratorreadBreakIterator(java.lang.String filename)
-
-
-
Field Detail
-
WORD_IDEO
public static final java.lang.String WORD_IDEO
Token type for words containing ideographic characters
-
WORD_HIRAGANA
public static final java.lang.String WORD_HIRAGANA
Token type for words containing Japanese hiragana
-
WORD_KATAKANA
public static final java.lang.String WORD_KATAKANA
Token type for words containing Japanese katakana
-
WORD_HANGUL
public static final java.lang.String WORD_HANGUL
Token type for words containing Korean hangul
-
WORD_LETTER
public static final java.lang.String WORD_LETTER
Token type for words that contain letters
-
WORD_NUMBER
public static final java.lang.String WORD_NUMBER
Token type for words that appear to be numbers
-
WORD_EMOJI
public static final java.lang.String WORD_EMOJI
Token type for words that appear to be emoji sequences
-
cjkBreakIterator
private static final com.ibm.icu.text.BreakIterator cjkBreakIterator
-
defaultBreakIterator
private static final com.ibm.icu.text.RuleBasedBreakIterator defaultBreakIterator
-
myanmarSyllableIterator
private static final com.ibm.icu.text.RuleBasedBreakIterator myanmarSyllableIterator
-
cjkAsWords
private final boolean cjkAsWords
-
myanmarAsWords
private final boolean myanmarAsWords
-
-
Constructor Detail
-
DefaultICUTokenizerConfig
public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.- Parameters:
cjkAsWords- true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.myanmarAsWords- true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
-
-
Method Detail
-
combineCJ
public boolean combineCJ()
Description copied from class:ICUTokenizerConfigtrue if Han, Hiragana, and Katakana scripts should all be returned as Japanese- Specified by:
combineCJin classICUTokenizerConfig
-
getBreakIterator
public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
Description copied from class:ICUTokenizerConfigReturn a breakiterator capable of processing a given script.- Specified by:
getBreakIteratorin classICUTokenizerConfig
-
getType
public java.lang.String getType(int script, int ruleStatus)Description copied from class:ICUTokenizerConfigReturn a token type value for a given script and BreakIterator rule status.- Specified by:
getTypein classICUTokenizerConfig
-
readBreakIterator
private static com.ibm.icu.text.RuleBasedBreakIterator readBreakIterator(java.lang.String filename)
-
-