Class FreeTextSuggester
- java.lang.Object
-
- org.apache.lucene.search.suggest.Lookup
-
- org.apache.lucene.search.suggest.analyzing.FreeTextSuggester
-
- All Implemented Interfaces:
Accountable
public class FreeTextSuggester extends Lookup implements Accountable
Builds an ngram model from the text sent tobuild(org.apache.lucene.search.suggest.InputIterator)and predicts based on the last grams-1 tokens in the request sent tolookup(java.lang.CharSequence, boolean, int). This tries to handle the "long tail" of suggestions for when the incoming query is a never before seen query string.Likely this suggester would only be used as a fallback, when the primary suggester fails to find any suggestions.
Note that the weight for each suggestion is unused, and the suggestions are the analyzed forms (so your analysis process should normally be very "light").
This uses the stupid backoff language model to smooth scores across ngram models; see "Large language models in machine translation", http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.1126 for details.
From
lookup(java.lang.CharSequence, boolean, int), the key of each result is the ngram token; the value is Long.MAX_VALUE * score (fixed point, cast to long). Divide by Long.MAX_VALUE to get the score back, which ranges from 0.0 to 1.0. onlyMorePopular is unused.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.search.suggest.Lookup
Lookup.LookupPriorityQueue, Lookup.LookupResult
-
-
Field Summary
Fields Modifier and Type Field Description static doubleALPHAThe constant used for backoff smoothing; during lookup, this means that if a given trigram did not occur, and we backoff to the bigram, the overall score will be 0.4 times what the bigram model would have assigned.static java.lang.StringCODEC_NAMECodec name used in the header for the saved model.private longcountNumber of entries the lookup was built withstatic intDEFAULT_GRAMSBy default we use a bigram model.static byteDEFAULT_SEPARATORThe default character used to join multiple tokens into a single ngram token.private FST<java.lang.Long>fstHolds 1gram, 2gram, 3gram models as a single FST.private intgramsprivate AnalyzerindexAnalyzerAnalyzer that will be used for analyzing suggestions at index time.private AnalyzerqueryAnalyzerAnalyzer that will be used for analyzing suggestions at query time.private byteseparatorprivate longtotTokensstatic intVERSION_CURRENTCurrent version of the the saved model file format.static intVERSION_STARTInitial version of the the saved model file format.(package private) static java.util.Comparator<java.lang.Long>weightComparator-
Fields inherited from class org.apache.lucene.search.suggest.Lookup
CHARSEQUENCE_COMPARATOR
-
Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE
-
-
Constructor Summary
Constructors Constructor Description FreeTextSuggester(Analyzer analyzer)Instantiate, using the provided analyzer for both indexing and lookup, using bigram model by default.FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer)Instantiate, using the provided indexing and lookup analyzers, using bigram model by default.FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams)Instantiate, using the provided indexing and lookup analyzers, with the specified model (2 = bigram, 3 = trigram, etc.).FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams, byte separator)Instantiate, using the provided indexing and lookup analyzers, and specified model (2 = bigram, 3 = trigram ,etc.).
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private AnalyzeraddShingles(Analyzer other)voidbuild(InputIterator iterator)Builds up a new internalLookuprepresentation based on the givenInputIterator.voidbuild(InputIterator iterator, double ramBufferSizeMB)Build the suggest index, using up to the specified amount of temporary RAM while building.private intcountGrams(BytesRef token)private longdecodeWeight(java.lang.Long output)cost -> weightprivate longencodeWeight(long ngramCount)weight -> costjava.lang.Objectget(java.lang.CharSequence key)Returns the weight associated with an input string, or null if it does not exist.java.util.Collection<Accountable>getChildResources()Returns nested resources of this class.longgetCount()Get the number of entries the lookup was built withbooleanload(DataInput input)Discard current lookup data and load it from a previously saved copy.java.util.List<Lookup.LookupResult>lookup(java.lang.CharSequence key, boolean onlyMorePopular, int num)Look up a key and return possible completion for this key.java.util.List<Lookup.LookupResult>lookup(java.lang.CharSequence key, int num)Lookup, without any context.java.util.List<Lookup.LookupResult>lookup(java.lang.CharSequence key, java.util.Set<BytesRef> contexts, boolean onlyMorePopular, int num)Look up a key and return possible completion for this key.java.util.List<Lookup.LookupResult>lookup(java.lang.CharSequence key, java.util.Set<BytesRef> contexts, int num)Retrieve suggestions.private java.lang.LonglookupPrefix(FST<java.lang.Long> fst, FST.BytesReader bytesReader, BytesRef scratch, FST.Arc<java.lang.Long> arc)longramBytesUsed()Returns byte size of the underlying FST.booleanstore(DataOutput output)Persist the constructed lookup data to a directory.
-
-
-
Field Detail
-
CODEC_NAME
public static final java.lang.String CODEC_NAME
Codec name used in the header for the saved model.- See Also:
- Constant Field Values
-
VERSION_START
public static final int VERSION_START
Initial version of the the saved model file format.- See Also:
- Constant Field Values
-
VERSION_CURRENT
public static final int VERSION_CURRENT
Current version of the the saved model file format.- See Also:
- Constant Field Values
-
DEFAULT_GRAMS
public static final int DEFAULT_GRAMS
By default we use a bigram model.- See Also:
- Constant Field Values
-
ALPHA
public static final double ALPHA
The constant used for backoff smoothing; during lookup, this means that if a given trigram did not occur, and we backoff to the bigram, the overall score will be 0.4 times what the bigram model would have assigned.- See Also:
- Constant Field Values
-
fst
private FST<java.lang.Long> fst
Holds 1gram, 2gram, 3gram models as a single FST.
-
indexAnalyzer
private final Analyzer indexAnalyzer
Analyzer that will be used for analyzing suggestions at index time.
-
totTokens
private long totTokens
-
queryAnalyzer
private final Analyzer queryAnalyzer
Analyzer that will be used for analyzing suggestions at query time.
-
grams
private final int grams
-
separator
private final byte separator
-
count
private long count
Number of entries the lookup was built with
-
DEFAULT_SEPARATOR
public static final byte DEFAULT_SEPARATOR
The default character used to join multiple tokens into a single ngram token. The input tokens produced by the analyzer must not contain this character.- See Also:
- Constant Field Values
-
weightComparator
static final java.util.Comparator<java.lang.Long> weightComparator
-
-
Constructor Detail
-
FreeTextSuggester
public FreeTextSuggester(Analyzer analyzer)
Instantiate, using the provided analyzer for both indexing and lookup, using bigram model by default.
-
FreeTextSuggester
public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer)
Instantiate, using the provided indexing and lookup analyzers, using bigram model by default.
-
FreeTextSuggester
public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams)
Instantiate, using the provided indexing and lookup analyzers, with the specified model (2 = bigram, 3 = trigram, etc.).
-
FreeTextSuggester
public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams, byte separator)
Instantiate, using the provided indexing and lookup analyzers, and specified model (2 = bigram, 3 = trigram ,etc.). The separator is passed toShingleFilter.setTokenSeparator(java.lang.String)to join multiple tokens into a single ngram token; it must be an ascii (7-bit-clean) byte. No input tokens should have this byte, otherwiseIllegalArgumentExceptionis thrown.
-
-
Method Detail
-
ramBytesUsed
public long ramBytesUsed()
Returns byte size of the underlying FST.- Specified by:
ramBytesUsedin interfaceAccountable
-
getChildResources
public java.util.Collection<Accountable> getChildResources()
Description copied from interface:AccountableReturns nested resources of this class. The result should be a point-in-time snapshot (to avoid race conditions).- Specified by:
getChildResourcesin interfaceAccountable- See Also:
Accountables
-
build
public void build(InputIterator iterator) throws java.io.IOException
Description copied from class:LookupBuilds up a new internalLookuprepresentation based on the givenInputIterator. The implementation might re-sort the data internally.
-
build
public void build(InputIterator iterator, double ramBufferSizeMB) throws java.io.IOException
Build the suggest index, using up to the specified amount of temporary RAM while building. Note that the weights for the suggestions are ignored.- Throws:
java.io.IOException
-
store
public boolean store(DataOutput output) throws java.io.IOException
Description copied from class:LookupPersist the constructed lookup data to a directory. Optional operation.- Specified by:
storein classLookup- Parameters:
output-DataOutputto write the data to.- Returns:
- true if successful, false if unsuccessful or not supported.
- Throws:
java.io.IOException- when fatal IO error occurs.
-
load
public boolean load(DataInput input) throws java.io.IOException
Description copied from class:LookupDiscard current lookup data and load it from a previously saved copy. Optional operation.
-
lookup
public java.util.List<Lookup.LookupResult> lookup(java.lang.CharSequence key, boolean onlyMorePopular, int num)
Description copied from class:LookupLook up a key and return possible completion for this key.- Overrides:
lookupin classLookup- Parameters:
key- lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.onlyMorePopular- return only more popular resultsnum- maximum number of results to return- Returns:
- a list of possible completions, with their relative weight (e.g. popularity)
-
lookup
public java.util.List<Lookup.LookupResult> lookup(java.lang.CharSequence key, int num)
Lookup, without any context.
-
lookup
public java.util.List<Lookup.LookupResult> lookup(java.lang.CharSequence key, java.util.Set<BytesRef> contexts, boolean onlyMorePopular, int num)
Description copied from class:LookupLook up a key and return possible completion for this key.- Specified by:
lookupin classLookup- Parameters:
key- lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.contexts- contexts to filter the lookup by, or null if all contexts are allowed; if the suggestion contains any of the contexts, it's a matchonlyMorePopular- return only more popular resultsnum- maximum number of results to return- Returns:
- a list of possible completions, with their relative weight (e.g. popularity)
-
getCount
public long getCount()
Description copied from class:LookupGet the number of entries the lookup was built with
-
countGrams
private int countGrams(BytesRef token)
-
lookup
public java.util.List<Lookup.LookupResult> lookup(java.lang.CharSequence key, java.util.Set<BytesRef> contexts, int num) throws java.io.IOException
Retrieve suggestions.- Throws:
java.io.IOException
-
encodeWeight
private long encodeWeight(long ngramCount)
weight -> cost
-
decodeWeight
private long decodeWeight(java.lang.Long output)
cost -> weight
-
lookupPrefix
private java.lang.Long lookupPrefix(FST<java.lang.Long> fst, FST.BytesReader bytesReader, BytesRef scratch, FST.Arc<java.lang.Long> arc) throws java.io.IOException
- Throws:
java.io.IOException
-
get
public java.lang.Object get(java.lang.CharSequence key)
Returns the weight associated with an input string, or null if it does not exist.
-
-