Class Builder<T>
- java.lang.Object
-
- org.apache.lucene.util.fst.Builder<T>
-
public class Builder<T> extends java.lang.ObjectBuilds a minimal FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
The parameterized type T is the output type. See the subclasses of
Outputs.FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classBuilder.Arc<T>Expert: holds a pending (seen but not yet serialized) arc.(package private) static classBuilder.CompiledNode(package private) static classBuilder.FixedLengthArcsBufferReusable buffer for building nodes with fixed length arcs (binary search or direct addressing).(package private) static interfaceBuilder.Nodestatic classBuilder.UnCompiledNode<T>Expert: holds a pending (seen but not yet serialized) Node.
-
Field Summary
Fields Modifier and Type Field Description (package private) booleanallowFixedLengthArcs(package private) longarcCount(package private) longbinarySearchNodeCount(package private) BytesStorebytesprivate NodeHash<T>dedupHash(package private) static floatDIRECT_ADDRESSING_MAX_OVERSIZING_FACTORDefault oversizing factor used to decide whether to encode a node with direct addressing or binary search.(package private) longdirectAddressingExpansionCredit(package private) floatdirectAddressingMaxOversizingFactor(package private) longdirectAddressingNodeCountprivate booleandoShareNonSingletonNodes(package private) Builder.FixedLengthArcsBufferfixedLengthArcsBufferprivate Builder.UnCompiledNode<T>[]frontier(package private) FST<T>fst(package private) longlastFrozenNodeprivate IntsRefBuilderlastInputprivate intminSuffixCount1private intminSuffixCount2private TNO_OUTPUT(package private) longnodeCount(package private) int[]numBytesPerArc(package private) int[]numLabelBytesPerArcprivate intshareMaxTailLength
-
Constructor Summary
Constructors Constructor Description Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, boolean allowFixedLengthArcs, int bytesPageBits)Instantiates an FST/FSA builder with all the possible tuning and construction tweaks.Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs)Instantiates an FST/FSA builder without any pruning.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidadd(IntsRef input, T output)Add the next input/output pair.private voidcompileAllTargets(Builder.UnCompiledNode<T> node, int tailLength)private Builder.CompiledNodecompileNode(Builder.UnCompiledNode<T> nodeIn, int tailLength)FST<T>finish()Returns final FST.private voidfreezeTail(int prefixLenPlus1)longfstRamBytesUsed()longgetArcCount()floatgetDirectAddressingMaxOversizingFactor()longgetMappedStateCount()longgetNodeCount()longgetTermCount()Builder<T>setDirectAddressingMaxOversizingFactor(float factor)Overrides the default the maximum oversizing of fixed array allowed to enable direct addressing of arcs instead of binary search.private booleanvalidOutput(T output)
-
-
-
Field Detail
-
DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
static final float DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
Default oversizing factor used to decide whether to encode a node with direct addressing or binary search. Default is 1: ensure no oversizing on average.This factor does not determine whether to encode a node with a list of variable length arcs or with fixed length arcs. It only determines the effective encoding of a node that is already known to be encoded with fixed length arcs. See
FST.shouldExpandNodeWithFixedLengthArcs()andFST.shouldExpandNodeWithDirectAddressing().For English words we measured 217K nodes, only 3.27% nodes are encoded with fixed length arcs, and 99.99% of them with direct addressing. Overall FST memory reduced by 1.67%.
For worst case we measured 168K nodes, 50% of them are encoded with fixed length arcs, and 14% of them with direct encoding. Overall FST memory reduced by 0.8%.
Use
TestFstDirectAddressing.main()andTestFstDirectAddressing.testWorstCaseForDirectAddressing()to evaluate a change.
-
NO_OUTPUT
private final T NO_OUTPUT
-
minSuffixCount1
private final int minSuffixCount1
-
minSuffixCount2
private final int minSuffixCount2
-
doShareNonSingletonNodes
private final boolean doShareNonSingletonNodes
-
shareMaxTailLength
private final int shareMaxTailLength
-
lastInput
private final IntsRefBuilder lastInput
-
frontier
private Builder.UnCompiledNode<T>[] frontier
-
lastFrozenNode
long lastFrozenNode
-
numBytesPerArc
int[] numBytesPerArc
-
numLabelBytesPerArc
int[] numLabelBytesPerArc
-
fixedLengthArcsBuffer
final Builder.FixedLengthArcsBuffer fixedLengthArcsBuffer
-
arcCount
long arcCount
-
nodeCount
long nodeCount
-
binarySearchNodeCount
long binarySearchNodeCount
-
directAddressingNodeCount
long directAddressingNodeCount
-
allowFixedLengthArcs
boolean allowFixedLengthArcs
-
directAddressingMaxOversizingFactor
float directAddressingMaxOversizingFactor
-
directAddressingExpansionCredit
long directAddressingExpansionCredit
-
bytes
BytesStore bytes
-
-
Constructor Detail
-
Builder
public Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs)
Instantiates an FST/FSA builder without any pruning. A shortcut toBuilder(FST.INPUT_TYPE, int, int, boolean, boolean, int, Outputs, boolean, int)with pruning options turned off.
-
Builder
public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, boolean allowFixedLengthArcs, int bytesPageBits)
Instantiates an FST/FSA builder with all the possible tuning and construction tweaks. Read parameter documentation carefully.- Parameters:
inputType- The input type (transition labels). Can be anything fromFST.INPUT_TYPEenumeration. Shorter types will consume less memory. Strings (character sequences) are represented asFST.INPUT_TYPE.BYTE4(full unicode codepoints).minSuffixCount1- If pruning the input graph during construction, this threshold is used for telling if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node is kept.minSuffixCount2- (Note: only Mike McCandless knows what this one is really doing...)doShareSuffix- Iftrue, the shared suffixes will be compacted into unique paths. This requires an additional RAM-intensive hash map for lookups in memory. Setting this parameter tofalsecreates a single suffix path for all input sequences. This will result in a larger FST, but requires substantially less memory and CPU during building.doShareNonSingletonNodes- Only used if doShareSuffix is true. Set this to true to ensure FST is fully minimal, at cost of more CPU and more RAM during building.shareMaxTailLength- Only used if doShareSuffix is true. Set this to Integer.MAX_VALUE to ensure FST is fully minimal, at cost of more CPU and more RAM during building.outputs- The output type for each input sequence. Applies only if building an FST. For FSA, useNoOutputs.getSingleton()andNoOutputs.getNoOutput()as the singleton output object.allowFixedLengthArcs- Pass false to disable the fixed length arc optimization (binary search or direct addressing) while building the FST; this will make the resulting FST smaller but slower to traverse.bytesPageBits- How many bits wide to make each byte[] block in the BytesStore; if you know the FST will be large then make this larger. For example 15 bits = 32768 byte pages.
-
-
Method Detail
-
setDirectAddressingMaxOversizingFactor
public Builder<T> setDirectAddressingMaxOversizingFactor(float factor)
Overrides the default the maximum oversizing of fixed array allowed to enable direct addressing of arcs instead of binary search.Setting this factor to a negative value (e.g. -1) effectively disables direct addressing, only binary search nodes will be created.
- See Also:
DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
-
getDirectAddressingMaxOversizingFactor
public float getDirectAddressingMaxOversizingFactor()
-
getTermCount
public long getTermCount()
-
getNodeCount
public long getNodeCount()
-
getArcCount
public long getArcCount()
-
getMappedStateCount
public long getMappedStateCount()
-
compileNode
private Builder.CompiledNode compileNode(Builder.UnCompiledNode<T> nodeIn, int tailLength) throws java.io.IOException
- Throws:
java.io.IOException
-
freezeTail
private void freezeTail(int prefixLenPlus1) throws java.io.IOException- Throws:
java.io.IOException
-
add
public void add(IntsRef input, T output) throws java.io.IOException
Add the next input/output pair. The provided input must be sorted after the previous one according toIntsRef.compareTo(org.apache.lucene.util.IntsRef). It's also OK to add the same input twice in a row with different outputs, as long asOutputsimplements theOutputs.merge(T, T)method. Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not. So if your outputs are changeable (egByteSequenceOutputsorIntSequenceOutputs) then you cannot reuse across calls.- Throws:
java.io.IOException
-
validOutput
private boolean validOutput(T output)
-
finish
public FST<T> finish() throws java.io.IOException
Returns final FST. NOTE: this will return null if nothing is accepted by the FST.- Throws:
java.io.IOException
-
compileAllTargets
private void compileAllTargets(Builder.UnCompiledNode<T> node, int tailLength) throws java.io.IOException
- Throws:
java.io.IOException
-
fstRamBytesUsed
public long fstRamBytesUsed()
-
-