Class HMMChineseTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.util.SegmentingTokenizerBase
-
- org.apache.lucene.analysis.cn.smart.HMMChineseTokenizer
-
- All Implemented Interfaces:
java.io.Closeable,java.lang.AutoCloseable
public class HMMChineseTokenizer extends SegmentingTokenizerBase
Tokenizer for Chinese or mixed Chinese-English text.The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description private OffsetAttributeoffsetAttprivate static java.text.BreakIteratorsentenceProtoused for breaking the text into sentencesprivate CharTermAttributetermAttprivate java.util.Iterator<SegToken>tokensprivate TypeAttributetypeAttprivate WordSegmenterwordSegmenter-
Fields inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
buffer, BUFFERMAX, offset
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description HMMChineseTokenizer()Creates a new HMMChineseTokenizerHMMChineseTokenizer(AttributeFactory factory)Creates a new HMMChineseTokenizer, supplying the AttributeFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected booleanincrementWord()Returns true if another word is availablevoidreset()This method is called by a consumer before it begins consumption usingTokenStream.incrementToken().protected voidsetNextSentence(int sentenceStart, int sentenceEnd)Provides the next input sentence for analysis-
Methods inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
end, incrementToken, isSafeEnd
-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
sentenceProto
private static final java.text.BreakIterator sentenceProto
used for breaking the text into sentences
-
termAtt
private final CharTermAttribute termAtt
-
offsetAtt
private final OffsetAttribute offsetAtt
-
typeAtt
private final TypeAttribute typeAtt
-
wordSegmenter
private final WordSegmenter wordSegmenter
-
tokens
private java.util.Iterator<SegToken> tokens
-
-
Constructor Detail
-
HMMChineseTokenizer
public HMMChineseTokenizer()
Creates a new HMMChineseTokenizer
-
HMMChineseTokenizer
public HMMChineseTokenizer(AttributeFactory factory)
Creates a new HMMChineseTokenizer, supplying the AttributeFactory
-
-
Method Detail
-
setNextSentence
protected void setNextSentence(int sentenceStart, int sentenceEnd)Description copied from class:SegmentingTokenizerBaseProvides the next input sentence for analysis- Specified by:
setNextSentencein classSegmentingTokenizerBase
-
incrementWord
protected boolean incrementWord()
Description copied from class:SegmentingTokenizerBaseReturns true if another word is available- Specified by:
incrementWordin classSegmentingTokenizerBase
-
reset
public void reset() throws java.io.IOExceptionDescription copied from class:TokenStreamThis method is called by a consumer before it begins consumption usingTokenStream.incrementToken().Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call
super.reset(), otherwise some internal state will not be correctly reset (e.g.,Tokenizerwill throwIllegalStateExceptionon further usage).- Overrides:
resetin classSegmentingTokenizerBase- Throws:
java.io.IOException
-
-