Class AbstractDictionary
- java.lang.Object
-
- org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
-
- Direct Known Subclasses:
BigramDictionary,WordDictionary
abstract class AbstractDictionary extends java.lang.ObjectSmartChineseAnalyzer abstract dictionary implementation.
Contains methods for dealing with GB2312 encoding.
-
-
Field Summary
Fields Modifier and Type Field Description static intCHAR_NUM_IN_FILEDictionary data contains 6768 Chinese characters with frequency statistics.static intGB2312_CHAR_NUMLast Chinese Character in GB2312 (87 * 94).static intGB2312_FIRST_CHARFirst Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.
-
Constructor Summary
Constructors Constructor Description AbstractDictionary()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.StringgetCCByGB2312Id(int ccid)Transcode from GB2312 ID to UnicodeshortgetGB2312Id(char ch)Transcode from Unicode to GB2312longhash1(char c)32-bit FNV Hash Functionlonghash1(char[] carray)32-bit FNV Hash Functioninthash2(char c)djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.inthash2(char[] carray)djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.
-
-
-
Field Detail
-
GB2312_FIRST_CHAR
public static final int GB2312_FIRST_CHAR
First Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.- See Also:
- Constant Field Values
-
GB2312_CHAR_NUM
public static final int GB2312_CHAR_NUM
Last Chinese Character in GB2312 (87 * 94). Characters in GB2312 are arranged in a grid of 94 * 94, 88-94 are unassigned.- See Also:
- Constant Field Values
-
CHAR_NUM_IN_FILE
public static final int CHAR_NUM_IN_FILE
Dictionary data contains 6768 Chinese characters with frequency statistics.- See Also:
- Constant Field Values
-
-
Method Detail
-
getCCByGB2312Id
public java.lang.String getCCByGB2312Id(int ccid)
Transcode from GB2312 ID to Unicode
GB2312 is divided into a 94 * 94 grid, containing 7445 characters consisting of 6763 Chinese characters and 682 symbols. Some regions are unassigned (reserved).
- Parameters:
ccid- GB2312 id- Returns:
- unicode String
-
getGB2312Id
public short getGB2312Id(char ch)
Transcode from Unicode to GB2312- Parameters:
ch- input character in Unicode, or character in Basic Latin range.- Returns:
- position in GB2312
-
hash1
public long hash1(char c)
32-bit FNV Hash Function- Parameters:
c- input character- Returns:
- hashcode
-
hash1
public long hash1(char[] carray)
32-bit FNV Hash Function- Parameters:
carray- character array- Returns:
- hashcode
-
hash2
public int hash2(char c)
djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.- Parameters:
c- character- Returns:
- hashcode
-
hash2
public int hash2(char[] carray)
djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.- Parameters:
carray- character array- Returns:
- hashcode
-
-