Package org.apache.lucene.analysis.standard
Fast, general-purpose grammar-based tokenizer
StandardTokenizer
implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.
Unlike UAX29URLEmailTokenizer from the analysis module, URLs and email addresses are
not tokenized as single tokens, but are instead split up into
tokens according to the UAX#29 word break rules.
StandardAnalyzer includes
StandardTokenizer,
LowerCaseFilter
and StopFilter.-
Class Summary Class Description ClassicAnalyzer FiltersClassicTokenizerwithClassicFilter,LowerCaseFilterandStopFilter, using a list of English stop words.ClassicFilter Normalizes tokens extracted withClassicTokenizer.ClassicFilterFactory Factory forClassicFilter.ClassicTokenizer A grammar-based tokenizer constructed with JFlexClassicTokenizerFactory Factory forClassicTokenizer.ClassicTokenizerImpl This class implements the classic lucene StandardTokenizer up until 3.0StandardAnalyzer FiltersStandardTokenizerwithLowerCaseFilterandStopFilter, using a configurable list of stop words.StandardTokenizer A grammar-based tokenizer constructed with JFlex.StandardTokenizerFactory Factory forStandardTokenizer.StandardTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.UAX29URLEmailAnalyzer FiltersUAX29URLEmailTokenizerwithLowerCaseFilterandStopFilter, using a list of English stop words.UAX29URLEmailTokenizer This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.UAX29URLEmailTokenizerFactory Factory forUAX29URLEmailTokenizer.UAX29URLEmailTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.