Class SimplePatternSplitTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.util.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.util.TokenizerFactory
-
- org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizerFactory
-
public class SimplePatternSplitTokenizerFactory extends TokenizerFactory
Factory forSimplePatternSplitTokenizer, for producing tokens by splitting according to the provided regexp.This tokenizer uses Lucene
RegExppattern matching to construct distinct tokens for the input stream. The syntax is more limited thanPatternTokenizer, but the tokenization is quite a bit faster. It takes two arguments:
- "pattern" (required) is the regular expression, according to the syntax described at
RegExp - "maxDeterminizedStates" (optional, default 10000) the limit on total state count for the determined automaton computed from the regexp
The pattern matches the characters that should split tokens, like
String.split, and the matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.For example, to match tokens delimited by simple whitespace characters:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/> </analyzer> </fieldType>- Since:
- 6.5.0
- See Also:
SimplePatternSplitTokenizer
- "pattern" (required) is the regular expression, according to the syntax described at
-
-
Field Summary
Fields Modifier and Type Field Description private Automatondfaprivate intmaxDeterminizedStatesstatic java.lang.StringNAMESPI namestatic java.lang.StringPATTERN-
Fields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description SimplePatternSplitTokenizerFactory(java.util.Map<java.lang.String,java.lang.String> args)Creates a new SimpleSplitPatternTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description SimplePatternSplitTokenizercreate(AttributeFactory factory)Creates a TokenStream of the specified input using the given AttributeFactory-
Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Field Detail
-
NAME
public static final java.lang.String NAME
SPI name- See Also:
- Constant Field Values
-
PATTERN
public static final java.lang.String PATTERN
- See Also:
- Constant Field Values
-
dfa
private final Automaton dfa
-
maxDeterminizedStates
private final int maxDeterminizedStates
-
-
Method Detail
-
create
public SimplePatternSplitTokenizer create(AttributeFactory factory)
Description copied from class:TokenizerFactoryCreates a TokenStream of the specified input using the given AttributeFactory- Specified by:
createin classTokenizerFactory
-
-