Class Dictionary
- java.lang.Object
-
- org.apache.lucene.analysis.hunspell.Dictionary
-
public class Dictionary extends java.lang.ObjectIn-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static classDictionary.DoubleASCIIFlagParsingStrategyImplementation ofDictionary.FlagParsingStrategythat assumes each flag is encoded as two ASCII characters whose codes must be combined into a single character.(package private) static classDictionary.FlagParsingStrategyAbstraction of the process of parsing flags taken from the affix and dic filesprivate static classDictionary.NumFlagParsingStrategyImplementation ofDictionary.FlagParsingStrategythat assumes each flag is encoded in its numerical form.private static classDictionary.SimpleFlagParsingStrategySimple implementation ofDictionary.FlagParsingStrategythat treats the chars in each String as a individual flags.
-
Field Summary
Fields Modifier and Type Field Description (package private) byte[]affixDataprivate static java.lang.StringALIAS_KEYprivate intaliasCountprivate java.lang.String[]aliases(package private) booleanalternateCasing(package private) static java.util.Map<java.lang.String,java.lang.String>CHARSET_ALIASES(package private) intcircumfixprivate static java.lang.StringCIRCUMFIX_KEY(package private) booleancomplexPrefixesprivate static java.lang.StringCOMPLEXPREFIXES_KEYprivate intcurrentAffixprivate static java.nio.file.PathDEFAULT_TEMP_DIR(package private) static java.util.regex.PatternENCODING_PATTERNpattern accepts optional BOM + SET + any whitespaceprivate static java.lang.StringFLAG_KEY(package private) charFLAG_SEPARATOR(package private) BytesRefHashflagLookupprivate Dictionary.FlagParsingStrategyflagParsingStrategy(package private) booleanfullStripprivate static java.lang.StringFULLSTRIP_KEY(package private) booleanhasStemExceptions(package private) FST<CharsRef>iconvprivate static java.lang.StringICONV_KEYprivate char[]ignoreprivate static java.lang.StringIGNORE_KEY(package private) booleanignoreCase(package private) intkeepcaseprivate static java.lang.StringKEEPCASE_KEYprivate static java.lang.StringLANG_KEY(package private) java.lang.Stringlanguageprivate static java.lang.StringLONG_FLAG_TYPEprivate static java.lang.StringMORPH_ALIAS_KEY(package private) charMORPH_SEPARATORprivate intmorphAliasCountprivate java.lang.String[]morphAliases(package private) intneedaffixprivate static java.lang.StringNEEDAFFIX_KEY(package private) booleanneedsInputCleaning(package private) booleanneedsOutputCleaning(package private) static char[]NOFLAGSprivate static java.lang.StringNUM_FLAG_TYPE(package private) FST<CharsRef>oconvprivate static java.lang.StringOCONV_KEY(package private) intonlyincompoundprivate static java.lang.StringONLYINCOMPOUND_KEY(package private) java.util.ArrayList<CharacterRunAutomaton>patternsprivate static java.lang.StringPREFIX_CONDITION_REGEX_PATTERNprivate static java.lang.StringPREFIX_KEY(package private) FST<IntsRef>prefixesprivate static java.lang.StringPSEUDOROOT_KEYprivate intstemExceptionCountprivate java.lang.String[]stemExceptions(package private) char[]stripData(package private) int[]stripOffsetsprivate static java.lang.StringSUFFIX_CONDITION_REGEX_PATTERNprivate static java.lang.StringSUFFIX_KEY(package private) FST<IntsRef>suffixesprivate java.nio.file.PathtempPath(package private) booleantwoStageAffixprivate static java.lang.StringUTF8_FLAG_TYPE(package private) FST<IntsRef>words
-
Constructor Summary
Constructors Constructor Description Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.io.InputStream dictionary)Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.util.List<java.io.InputStream> dictionaries, boolean ignoreCase)Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private FST<IntsRef>affixFST(java.util.TreeMap<java.lang.String,java.util.List<java.lang.Integer>> affixes)(package private) static voidapplyMappings(FST<CharsRef> fst, java.lang.StringBuilder sb)(package private) charcaseFold(char c)folds single character (according to LANG if present)(package private) java.lang.CharSequencecleanInput(java.lang.CharSequence input, java.lang.StringBuilder reuse)(package private) static char[]decodeFlags(BytesRef b)(package private) static voidencodeFlags(BytesRefBuilder b, char[] flags)(package private) static java.lang.StringescapeDash(java.lang.String re)private java.lang.StringgetAliasValue(int id)(package private) static java.nio.file.PathgetDefaultTempDir()Returns the default temporary directory.(package private) static java.lang.StringgetDictionaryEncoding(java.io.InputStream affix)Parses the encoding specified in the affix file readable through the provided InputStream(package private) static Dictionary.FlagParsingStrategygetFlagParsingStrategy(java.lang.String flagLine)Determines the appropriateDictionary.FlagParsingStrategybased on the FLAG definition line taken from the affix filebooleangetIgnoreCase()Returns true if this dictionary was constructed with theignoreCaseoptionprivate java.nio.charset.CharsetDecodergetJavaEncoding(java.lang.String encoding)Retrieves the CharsetDecoder for the given encoding.(package private) java.lang.StringgetStemException(int id)(package private) static booleanhasFlag(char[] flags, char flag)(package private) static intindexOfSpaceOrTab(java.lang.String text, int start)(package private) IntsReflookup(FST<IntsRef> fst, char[] word, int offset, int length)(package private) IntsReflookupPrefix(char[] word, int offset, int length)(package private) IntsReflookupSuffix(char[] word, int offset, int length)(package private) IntsReflookupWord(char[] word, int offset, int length)Looks up Hunspell word forms from the dictionary(package private) static intmorphBoundary(java.lang.String line)private voidparseAffix(java.util.TreeMap<java.lang.String,java.util.List<java.lang.Integer>> affixes, java.lang.String header, java.io.LineNumberReader reader, java.lang.String conditionPattern, java.util.Map<java.lang.String,java.lang.Integer> seenPatterns, java.util.Map<java.lang.String,java.lang.Integer> seenStrips)Parses a specific affix rule putting the result into the provided affix mapprivate voidparseAlias(java.lang.String line)private FST<CharsRef>parseConversions(java.io.LineNumberReader reader, int num)private voidparseMorphAlias(java.lang.String line)private java.lang.StringparseStemException(java.lang.String morphData)private voidreadAffixFile(java.io.InputStream affixStream, java.nio.charset.CharsetDecoder decoder)Reads the affix file through the provided InputStream, building up the prefix and suffix mapsprivate voidreadDictionaryFiles(Directory tempDir, java.lang.String tempFileNamePrefix, java.util.List<java.io.InputStream> dictionaries, java.nio.charset.CharsetDecoder decoder, Builder<IntsRef> words)Reads the dictionary file through the provided InputStreams, building up the words mapstatic voidsetDefaultTempDir(java.nio.file.Path tempDir)Used by test framework(package private) java.lang.StringunescapeEntry(java.lang.String entry)
-
-
-
Field Detail
-
NOFLAGS
static final char[] NOFLAGS
-
ALIAS_KEY
private static final java.lang.String ALIAS_KEY
- See Also:
- Constant Field Values
-
MORPH_ALIAS_KEY
private static final java.lang.String MORPH_ALIAS_KEY
- See Also:
- Constant Field Values
-
PREFIX_KEY
private static final java.lang.String PREFIX_KEY
- See Also:
- Constant Field Values
-
SUFFIX_KEY
private static final java.lang.String SUFFIX_KEY
- See Also:
- Constant Field Values
-
FLAG_KEY
private static final java.lang.String FLAG_KEY
- See Also:
- Constant Field Values
-
COMPLEXPREFIXES_KEY
private static final java.lang.String COMPLEXPREFIXES_KEY
- See Also:
- Constant Field Values
-
CIRCUMFIX_KEY
private static final java.lang.String CIRCUMFIX_KEY
- See Also:
- Constant Field Values
-
IGNORE_KEY
private static final java.lang.String IGNORE_KEY
- See Also:
- Constant Field Values
-
ICONV_KEY
private static final java.lang.String ICONV_KEY
- See Also:
- Constant Field Values
-
OCONV_KEY
private static final java.lang.String OCONV_KEY
- See Also:
- Constant Field Values
-
FULLSTRIP_KEY
private static final java.lang.String FULLSTRIP_KEY
- See Also:
- Constant Field Values
-
LANG_KEY
private static final java.lang.String LANG_KEY
- See Also:
- Constant Field Values
-
KEEPCASE_KEY
private static final java.lang.String KEEPCASE_KEY
- See Also:
- Constant Field Values
-
NEEDAFFIX_KEY
private static final java.lang.String NEEDAFFIX_KEY
- See Also:
- Constant Field Values
-
PSEUDOROOT_KEY
private static final java.lang.String PSEUDOROOT_KEY
- See Also:
- Constant Field Values
-
ONLYINCOMPOUND_KEY
private static final java.lang.String ONLYINCOMPOUND_KEY
- See Also:
- Constant Field Values
-
NUM_FLAG_TYPE
private static final java.lang.String NUM_FLAG_TYPE
- See Also:
- Constant Field Values
-
UTF8_FLAG_TYPE
private static final java.lang.String UTF8_FLAG_TYPE
- See Also:
- Constant Field Values
-
LONG_FLAG_TYPE
private static final java.lang.String LONG_FLAG_TYPE
- See Also:
- Constant Field Values
-
PREFIX_CONDITION_REGEX_PATTERN
private static final java.lang.String PREFIX_CONDITION_REGEX_PATTERN
- See Also:
- Constant Field Values
-
SUFFIX_CONDITION_REGEX_PATTERN
private static final java.lang.String SUFFIX_CONDITION_REGEX_PATTERN
- See Also:
- Constant Field Values
-
patterns
java.util.ArrayList<CharacterRunAutomaton> patterns
-
flagLookup
BytesRefHash flagLookup
-
stripData
char[] stripData
-
stripOffsets
int[] stripOffsets
-
affixData
byte[] affixData
-
currentAffix
private int currentAffix
-
flagParsingStrategy
private Dictionary.FlagParsingStrategy flagParsingStrategy
-
aliases
private java.lang.String[] aliases
-
aliasCount
private int aliasCount
-
morphAliases
private java.lang.String[] morphAliases
-
morphAliasCount
private int morphAliasCount
-
stemExceptions
private java.lang.String[] stemExceptions
-
stemExceptionCount
private int stemExceptionCount
-
hasStemExceptions
boolean hasStemExceptions
-
tempPath
private final java.nio.file.Path tempPath
-
ignoreCase
boolean ignoreCase
-
complexPrefixes
boolean complexPrefixes
-
twoStageAffix
boolean twoStageAffix
-
circumfix
int circumfix
-
keepcase
int keepcase
-
needaffix
int needaffix
-
onlyincompound
int onlyincompound
-
ignore
private char[] ignore
-
needsInputCleaning
boolean needsInputCleaning
-
needsOutputCleaning
boolean needsOutputCleaning
-
fullStrip
boolean fullStrip
-
language
java.lang.String language
-
alternateCasing
boolean alternateCasing
-
ENCODING_PATTERN
static final java.util.regex.Pattern ENCODING_PATTERN
pattern accepts optional BOM + SET + any whitespace
-
CHARSET_ALIASES
static final java.util.Map<java.lang.String,java.lang.String> CHARSET_ALIASES
-
FLAG_SEPARATOR
final char FLAG_SEPARATOR
- See Also:
- Constant Field Values
-
MORPH_SEPARATOR
final char MORPH_SEPARATOR
- See Also:
- Constant Field Values
-
DEFAULT_TEMP_DIR
private static java.nio.file.Path DEFAULT_TEMP_DIR
-
-
Constructor Detail
-
Dictionary
public Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.io.InputStream dictionary) throws java.io.IOException, java.text.ParseException
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir- Directory to use for offline sortingtempFileNamePrefix- prefix to use to generate temp file namesaffix- InputStream for reading the hunspell affix file (won't be closed).dictionary- InputStream for reading the hunspell dictionary file (won't be closed).- Throws:
java.io.IOException- Can be thrown while reading from the InputStreamsjava.text.ParseException- Can be thrown if the content of the files does not meet expected formats
-
Dictionary
public Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.util.List<java.io.InputStream> dictionaries, boolean ignoreCase) throws java.io.IOException, java.text.ParseException
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir- Directory to use for offline sortingtempFileNamePrefix- prefix to use to generate temp file namesaffix- InputStream for reading the hunspell affix file (won't be closed).dictionaries- InputStream for reading the hunspell dictionary files (won't be closed).- Throws:
java.io.IOException- Can be thrown while reading from the InputStreamsjava.text.ParseException- Can be thrown if the content of the files does not meet expected formats
-
-
Method Detail
-
lookupWord
IntsRef lookupWord(char[] word, int offset, int length)
Looks up Hunspell word forms from the dictionary
-
lookupPrefix
IntsRef lookupPrefix(char[] word, int offset, int length)
-
lookupSuffix
IntsRef lookupSuffix(char[] word, int offset, int length)
-
readAffixFile
private void readAffixFile(java.io.InputStream affixStream, java.nio.charset.CharsetDecoder decoder) throws java.io.IOException, java.text.ParseExceptionReads the affix file through the provided InputStream, building up the prefix and suffix maps- Parameters:
affixStream- InputStream to read the content of the affix file fromdecoder- CharsetDecoder to decode the content of the file- Throws:
java.io.IOException- Can be thrown while reading from the InputStreamjava.text.ParseException
-
affixFST
private FST<IntsRef> affixFST(java.util.TreeMap<java.lang.String,java.util.List<java.lang.Integer>> affixes) throws java.io.IOException
- Throws:
java.io.IOException
-
escapeDash
static java.lang.String escapeDash(java.lang.String re)
-
parseAffix
private void parseAffix(java.util.TreeMap<java.lang.String,java.util.List<java.lang.Integer>> affixes, java.lang.String header, java.io.LineNumberReader reader, java.lang.String conditionPattern, java.util.Map<java.lang.String,java.lang.Integer> seenPatterns, java.util.Map<java.lang.String,java.lang.Integer> seenStrips) throws java.io.IOException, java.text.ParseExceptionParses a specific affix rule putting the result into the provided affix map- Parameters:
affixes- Map where the result of the parsing will be putheader- Header line of the affix rulereader- BufferedReader to read the content of the rule fromconditionPattern-String.format(String, Object...)pattern to be used to generate the condition regex patternseenPatterns- map from condition -> index of patterns, for deduplication.- Throws:
java.io.IOException- Can be thrown while reading the rulejava.text.ParseException
-
parseConversions
private FST<CharsRef> parseConversions(java.io.LineNumberReader reader, int num) throws java.io.IOException, java.text.ParseException
- Throws:
java.io.IOExceptionjava.text.ParseException
-
getDictionaryEncoding
static java.lang.String getDictionaryEncoding(java.io.InputStream affix) throws java.io.IOException, java.text.ParseExceptionParses the encoding specified in the affix file readable through the provided InputStream- Parameters:
affix- InputStream for reading the affix file- Returns:
- Encoding specified in the affix file
- Throws:
java.io.IOException- Can be thrown while reading from the InputStreamjava.text.ParseException- Thrown if the first non-empty non-comment line read from the file does not adhere to the formatSET <encoding>
-
getJavaEncoding
private java.nio.charset.CharsetDecoder getJavaEncoding(java.lang.String encoding)
Retrieves the CharsetDecoder for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and MICROSOFT-CP1251 etc are allowed...- Parameters:
encoding- Encoding to retrieve the CharsetDecoder for- Returns:
- CharSetDecoder for the given encoding
-
getFlagParsingStrategy
static Dictionary.FlagParsingStrategy getFlagParsingStrategy(java.lang.String flagLine)
Determines the appropriateDictionary.FlagParsingStrategybased on the FLAG definition line taken from the affix file- Parameters:
flagLine- Line containing the flag information- Returns:
- FlagParsingStrategy that handles parsing flags in the way specified in the FLAG definition
-
unescapeEntry
java.lang.String unescapeEntry(java.lang.String entry)
-
morphBoundary
static int morphBoundary(java.lang.String line)
-
indexOfSpaceOrTab
static int indexOfSpaceOrTab(java.lang.String text, int start)
-
readDictionaryFiles
private void readDictionaryFiles(Directory tempDir, java.lang.String tempFileNamePrefix, java.util.List<java.io.InputStream> dictionaries, java.nio.charset.CharsetDecoder decoder, Builder<IntsRef> words) throws java.io.IOException
Reads the dictionary file through the provided InputStreams, building up the words map- Parameters:
dictionaries- InputStreams to read the dictionary file throughdecoder- CharsetDecoder used to decode the contents of the file- Throws:
java.io.IOException- Can be thrown while reading from the file
-
decodeFlags
static char[] decodeFlags(BytesRef b)
-
encodeFlags
static void encodeFlags(BytesRefBuilder b, char[] flags)
-
parseAlias
private void parseAlias(java.lang.String line)
-
getAliasValue
private java.lang.String getAliasValue(int id)
-
getStemException
java.lang.String getStemException(int id)
-
parseMorphAlias
private void parseMorphAlias(java.lang.String line)
-
parseStemException
private java.lang.String parseStemException(java.lang.String morphData)
-
hasFlag
static boolean hasFlag(char[] flags, char flag)
-
cleanInput
java.lang.CharSequence cleanInput(java.lang.CharSequence input, java.lang.StringBuilder reuse)
-
caseFold
char caseFold(char c)
folds single character (according to LANG if present)
-
applyMappings
static void applyMappings(FST<CharsRef> fst, java.lang.StringBuilder sb) throws java.io.IOException
- Throws:
java.io.IOException
-
getIgnoreCase
public boolean getIgnoreCase()
Returns true if this dictionary was constructed with theignoreCaseoption
-
setDefaultTempDir
public static void setDefaultTempDir(java.nio.file.Path tempDir)
Used by test framework
-
getDefaultTempDir
static java.nio.file.Path getDefaultTempDir() throws java.io.IOExceptionReturns the default temporary directory. By default, java.io.tmpdir. If not accessible or not available, an IOException is thrown- Throws:
java.io.IOException
-
-