Class JapaneseNumberFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.ja.JapaneseNumberFilter
-
- All Implemented Interfaces:
java.io.Closeable,java.lang.AutoCloseable
public class JapaneseNumberFilter extends TokenFilter
ATokenFilterthat normalizes Japanese numbers (kansūji) to regular Arabic decimal numbers in half-width characters.Japanese numbers are often written using a combination of kanji and Arabic numbers with various kinds punctuation. For example, 3.2千 means 3200. This filter does this kind of normalization and allows a search for 3200 to match 3.2千 in text, but can also be used to make range facets based on the normalized numbers and so on.
Notice that this analyzer uses a token composition scheme and relies on punctuation tokens being found in the token stream. Please make sure your
JapaneseTokenizerhasdiscardPunctuationset to false. In case punctuation characters, such as . (U+FF0E FULLWIDTH FULL STOP), is removed from the token stream, this filter would find input tokens tokens 3 and 2千 and give outputs 3 and 2000 instead of 3200, which is likely not the intended result. If you want to remove punctuation characters from your index that are not part of normalized numbers, add aStopFilterwith the punctuation you wish to remove afterJapaneseNumberFilterin your analyzer chain.Below are some examples of normalizations this filter supports. The input is untokenized text and the result is the single term attribute emitted for the input.
- 〇〇七 becomes 7
- 一〇〇〇 becomes 1000
- 三千2百2十三 becomes 3223
- 兆六百万五千一 becomes 1000006005001
- 3.2千 becomes 3200
- 1.2万345.67 becomes 12345.67
- 4,647.100 becomes 4647.1
- 15,7 becomes 157 (be aware of this weakness)
Tokens preceded by a token with
PositionIncrementAttributeof zero are left left untouched and emitted as-is.This filter does not use any part-of-speech information for its normalization and the motivation for this is to also support n-grammed token streams in the future.
This filter may in some cases normalize tokens that are not numbers in their context. For example, is 田中京一 is a name and means Tanaka Kyōichi, but 京一 (Kyōichi) out of context can strictly speaking also represent the number 10000000000000001. This filter respects the
KeywordAttribute, which can be used to prevent specific normalizations from happening.Also notice that token attributes such as
PartOfSpeechAttribute,ReadingAttribute,InflectionAttributeandBaseFormAttributeare left unchanged and will inherit the values of the last token used to compose the normalized number and can be wrong. Hence, for 10万 (10000), we will haveReadingAttributeset to マン. This is a known issue and is subject to a future improvement.Japanese formal numbers (daiji), accounting numbers and decimal fractions are currently not supported.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classJapaneseNumberFilter.NumberBufferBuffer that holds a Japanese number string and a position index used as a parsed-to marker-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description private booleanexhaustedprivate static char[]exponentsprivate intfallThroughTokensprivate KeywordAttributekeywordAttrprivate static charNO_NUMERALprivate java.lang.StringBuildernumeralprivate static char[]numeralsprivate OffsetAttributeoffsetAttrprivate PositionIncrementAttributeposIncrAttrprivate PositionLengthAttributeposLengthAttrprivate AttributeSource.Statestateprivate CharTermAttributetermAttr-
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description JapaneseNumberFilter(TokenStream input)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private intarabicNumeralValue(char c)Returns the numeric value for the specified character Arabic numeral.booleanincrementToken()Consumers (i.e.,IndexWriter) use this method to advance the stream to the next token.booleanisArabicNumeral(char c)Arabic numeral predicate.private booleanisDecimalPoint(char c)Decimal point predicateprivate booleanisFullWidthArabicNumeral(char c)Arabic full-width numeral predicateprivate booleanisHalfWidthArabicNumeral(char c)Arabic half-width numeral predicateprivate booleanisKanjiNumeral(char c)Kanji numeral predicate that tests if the provided character is one of 〇, 一, 二, 三, 四, 五, 六, 七, 八, or 九.booleanisNumeral(char c)Numeral predicatebooleanisNumeral(java.lang.String input)Numeral predicatebooleanisNumeralPunctuation(char c)Numeral punctuation predicatebooleanisNumeralPunctuation(java.lang.String input)Numeral punctuation predicateprivate booleanisThousandSeparator(char c)Thousand separator predicateprivate intkanjiNumeralValue(char c)Returns the value for the provided kanji numeral.java.lang.StringnormalizeNumber(java.lang.String number)Normalizes a Japanese numberprivate java.math.BigDecimalparseBasicNumber(JapaneseNumberFilter.NumberBuffer buffer)Parse a basic number, which is a sequence of Arabic numbers or a sequence or 0-9 kanji numerals (〇 to 九).java.math.BigDecimalparseLargeKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)Parse large kanji numerals (ten thousands or larger)private java.math.BigDecimalparseLargePair(JapaneseNumberFilter.NumberBuffer buffer)Parses a pair of large numbers, i.e.java.math.BigDecimalparseMediumKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)Parse medium kanji numerals (tens, hundreds or thousands)private java.math.BigDecimalparseMediumNumber(JapaneseNumberFilter.NumberBuffer buffer)Parses a "medium sized" number, typically less than 10,000(万), but might be larger due to a larger factor from {link parseBasicNumber}.private java.math.BigDecimalparseMediumPair(JapaneseNumberFilter.NumberBuffer buffer)Parses a pair of "medium sized" numbers, i.e.private java.math.BigDecimalparseNumber(JapaneseNumberFilter.NumberBuffer buffer)Parses a Japanese numbervoidreset()This method is called by a consumer before it begins consumption usingTokenStream.incrementToken().-
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
termAttr
private final CharTermAttribute termAttr
-
offsetAttr
private final OffsetAttribute offsetAttr
-
keywordAttr
private final KeywordAttribute keywordAttr
-
posIncrAttr
private final PositionIncrementAttribute posIncrAttr
-
posLengthAttr
private final PositionLengthAttribute posLengthAttr
-
NO_NUMERAL
private static char NO_NUMERAL
-
numerals
private static char[] numerals
-
exponents
private static char[] exponents
-
state
private AttributeSource.State state
-
numeral
private java.lang.StringBuilder numeral
-
fallThroughTokens
private int fallThroughTokens
-
exhausted
private boolean exhausted
-
-
Constructor Detail
-
JapaneseNumberFilter
public JapaneseNumberFilter(TokenStream input)
-
-
Method Detail
-
incrementToken
public final boolean incrementToken() throws java.io.IOExceptionDescription copied from class:TokenStreamConsumers (i.e.,IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriateAttributeImpls with the attributes of the next token.The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use
AttributeSource.captureState()to create a copy of the current attribute state.This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to
AttributeSource.addAttribute(Class)andAttributeSource.getAttribute(Class), references to allAttributeImpls that this stream uses should be retrieved during instantiation.To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in
TokenStream.incrementToken().- Specified by:
incrementTokenin classTokenStream- Returns:
- false for end of stream; true otherwise
- Throws:
java.io.IOException
-
reset
public void reset() throws java.io.IOExceptionDescription copied from class:TokenFilterThis method is called by a consumer before it begins consumption usingTokenStream.incrementToken().Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call
super.reset(), otherwise some internal state will not be correctly reset (e.g.,Tokenizerwill throwIllegalStateExceptionon further usage).NOTE: The default implementation chains the call to the input TokenStream, so be sure to call
super.reset()when overriding this method.- Overrides:
resetin classTokenFilter- Throws:
java.io.IOException
-
normalizeNumber
public java.lang.String normalizeNumber(java.lang.String number)
Normalizes a Japanese number- Parameters:
number- number or normalize- Returns:
- normalized number, or number to normalize on error (no op)
-
parseNumber
private java.math.BigDecimal parseNumber(JapaneseNumberFilter.NumberBuffer buffer)
Parses a Japanese number- Parameters:
buffer- buffer to parse- Returns:
- parsed number, or null on error or end of input
-
parseLargePair
private java.math.BigDecimal parseLargePair(JapaneseNumberFilter.NumberBuffer buffer)
Parses a pair of large numbers, i.e. large kanji factor is 10,000(万)or larger- Parameters:
buffer- buffer to parse- Returns:
- parsed pair, or null on error or end of input
-
parseMediumNumber
private java.math.BigDecimal parseMediumNumber(JapaneseNumberFilter.NumberBuffer buffer)
Parses a "medium sized" number, typically less than 10,000(万), but might be larger due to a larger factor from {link parseBasicNumber}.- Parameters:
buffer- buffer to parse- Returns:
- parsed number, or null on error or end of input
-
parseMediumPair
private java.math.BigDecimal parseMediumPair(JapaneseNumberFilter.NumberBuffer buffer)
Parses a pair of "medium sized" numbers, i.e. large kanji factor is at most 1,000(千)- Parameters:
buffer- buffer to parse- Returns:
- parsed pair, or null on error or end of input
-
parseBasicNumber
private java.math.BigDecimal parseBasicNumber(JapaneseNumberFilter.NumberBuffer buffer)
Parse a basic number, which is a sequence of Arabic numbers or a sequence or 0-9 kanji numerals (〇 to 九).- Parameters:
buffer- buffer to parse- Returns:
- parsed number, or null on error or end of input
-
parseLargeKanjiNumeral
public java.math.BigDecimal parseLargeKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
Parse large kanji numerals (ten thousands or larger)- Parameters:
buffer- buffer to parse- Returns:
- parsed number, or null on error or end of input
-
parseMediumKanjiNumeral
public java.math.BigDecimal parseMediumKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
Parse medium kanji numerals (tens, hundreds or thousands)- Parameters:
buffer- buffer to parse- Returns:
- parsed number or null on error
-
isNumeral
public boolean isNumeral(java.lang.String input)
Numeral predicate- Parameters:
input- string to test- Returns:
- true if and only if input is a numeral
-
isNumeral
public boolean isNumeral(char c)
Numeral predicate- Parameters:
c- character to test- Returns:
- true if and only if c is a numeral
-
isNumeralPunctuation
public boolean isNumeralPunctuation(java.lang.String input)
Numeral punctuation predicate- Parameters:
input- string to test- Returns:
- true if and only if c is a numeral punctuation string
-
isNumeralPunctuation
public boolean isNumeralPunctuation(char c)
Numeral punctuation predicate- Parameters:
c- character to test- Returns:
- true if and only if c is a numeral punctuation character
-
isArabicNumeral
public boolean isArabicNumeral(char c)
Arabic numeral predicate. Both half-width and full-width characters are supported- Parameters:
c- character to test- Returns:
- true if and only if c is an Arabic numeral
-
isHalfWidthArabicNumeral
private boolean isHalfWidthArabicNumeral(char c)
Arabic half-width numeral predicate- Parameters:
c- character to test- Returns:
- true if and only if c is a half-width Arabic numeral
-
isFullWidthArabicNumeral
private boolean isFullWidthArabicNumeral(char c)
Arabic full-width numeral predicate- Parameters:
c- character to test- Returns:
- true if and only if c is a full-width Arabic numeral
-
arabicNumeralValue
private int arabicNumeralValue(char c)
Returns the numeric value for the specified character Arabic numeral. Behavior is undefined if a non-Arabic numeral is provided- Parameters:
c- arabic numeral character- Returns:
- numeral value
-
isKanjiNumeral
private boolean isKanjiNumeral(char c)
Kanji numeral predicate that tests if the provided character is one of 〇, 一, 二, 三, 四, 五, 六, 七, 八, or 九. Larger number kanji gives a false value.- Parameters:
c- character to test- Returns:
- true if and only is character is one of 〇, 一, 二, 三, 四, 五, 六, 七, 八, or 九 (0 to 9)
-
kanjiNumeralValue
private int kanjiNumeralValue(char c)
Returns the value for the provided kanji numeral. Only numeric values for the characters where {link isKanjiNumeral} return true are supported - behavior is undefined for other characters.- Parameters:
c- kanji numeral character- Returns:
- numeral value
- See Also:
isKanjiNumeral(char)
-
isDecimalPoint
private boolean isDecimalPoint(char c)
Decimal point predicate- Parameters:
c- character to test- Returns:
- true if and only if c is a decimal point
-
isThousandSeparator
private boolean isThousandSeparator(char c)
Thousand separator predicate- Parameters:
c- character to test- Returns:
- true if and only if c is a thousand separator predicate
-
-