Package org.apache.lucene.util.fst
Class FST<T>
- java.lang.Object
-
- org.apache.lucene.util.fst.FST<T>
-
- All Implemented Interfaces:
Accountable
public final class FST<T> extends java.lang.Object implements Accountable
Represents an finite state machine (FST), using a compact byte[] format.The format is similar to what's used by Morfologik (https://github.com/morfologik/morfologik-stemming).
See the
package documentationfor some simple examples.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classFST.Arc<T>Represents a single arc.static classFST.BytesReaderReads bytes stored in an FST.static classFST.INPUT_TYPESpecifies allowed range of each int input label for this FST.
-
Field Summary
Fields Modifier and Type Field Description private static longARC_SHALLOW_RAM_BYTES_USEDstatic byteARCS_FOR_BINARY_SEARCHValue of the arc flags to declare a node with fixed length arcs designed for binary search.(package private) static byteARCS_FOR_DIRECT_ADDRESSINGValue of the arc flags to declare a node with fixed length arcs and bit table designed for direct addressing.private static longBASE_RAM_BYTES_USEDprivate static intBIT_ARC_HAS_FINAL_OUTPUTstatic intBIT_ARC_HAS_OUTPUTThis flag is set if the arc has an output.private static intBIT_FINAL_ARC(package private) static intBIT_LAST_ARCprivate static intBIT_STOP_NODE(package private) static intBIT_TARGET_NEXT(package private) BytesStorebytesABytesStore, used during building, or during reading when the FST is very large (more than 1 GB).private static intDEFAULT_MAX_BLOCK_BITSprivate static floatDIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTORMaximum oversizing factor allowed for direct addressing compared to binary search when expansion credits allow the oversizing.(package private) TemptyOutputstatic intEND_LABELIf arc has this label then that arc is final/acceptedprivate static java.lang.StringFILE_FORMAT_NAMEprivate static longFINAL_END_NODE(package private) static intFIXED_LENGTH_ARC_DEEP_NUM_ARCS(package private) static intFIXED_LENGTH_ARC_SHALLOW_DEPTH(package private) static intFIXED_LENGTH_ARC_SHALLOW_NUM_ARCSprivate FSTStorefstStore(package private) FST.INPUT_TYPEinputTypeprivate static longNON_FINAL_END_NODEOutputs<T>outputsprivate longstartNodeprivate static intVERSION_CURRENTprivate static intVERSION_START-
Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE
-
-
Constructor Summary
Constructors Constructor Description FST(DataInput metaIn, DataInput in, Outputs<T> outputs)Load a previously saved FST.FST(DataInput metaIn, DataInput in, Outputs<T> outputs, FSTStore fstStore)Load a previously saved FST; maxBlockBits allows you to control the size of the byte[] pages used to hold the FST bytes.FST(FST.INPUT_TYPE inputType, Outputs<T> outputs, int bytesPageBits)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description (package private) longaddNode(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn)FST.Arc<T>findTargetArc(int labelToMatch, FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in)Finds an arc leaving the incoming arc, replacing the arc in place.(package private) voidfinish(long newStartNode)private static booleanflag(int flags, int bit)FST.BytesReadergetBytesReader()Returns aFST.BytesReaderfor this FST, positioned at position 0.TgetEmptyOutput()FST.Arc<T>getFirstArc(FST.Arc<T> arc)Fills virtual 'start' arc, ie, an empty incoming arc to the FST's start nodeprivate static intgetNumPresenceBytes(int labelRange)Gets the number of bytes required to flag the presence of each arc in the given label range, one bit per arc.(package private) booleanisExpandedTarget(FST.Arc<T> follow, FST.BytesReader in)Returns whetherarc's target points to a node in expanded format (fixed length arcs).longramBytesUsed()Return the memory usage of this object in bytes.static <T> FST<T>read(java.nio.file.Path path, Outputs<T> outputs)Reads an automaton from a file.private FST.Arc<T>readArc(FST.Arc<T> arc, FST.BytesReader in)Reads an arc.FST.Arc<T>readArcByDirectAddressing(FST.Arc<T> arc, FST.BytesReader in, int rangeIndex)Reads a present direct addressing node arc, with the provided index in the label range.private FST.Arc<T>readArcByDirectAddressing(FST.Arc<T> arc, FST.BytesReader in, int rangeIndex, int presenceIndex)Reads a present direct addressing node arc, with the provided index in the label range and its corresponding presence index (which is the count of presence bits before it).FST.Arc<T>readArcByIndex(FST.Arc<T> arc, FST.BytesReader in, int idx)(package private) static <T> FST.Arc<T>readEndArc(FST.Arc<T> follow, FST.Arc<T> arc)FST.Arc<T>readFirstRealTargetArc(long nodeAddress, FST.Arc<T> arc, FST.BytesReader in)FST.Arc<T>readFirstTargetArc(FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in)Follow thefollowarc and read the first arc of its target; this changes the providedarc(2nd arg) in-place and returns it.intreadLabel(DataInput in)Reads one BYTE1/2/4 label from the providedDataInput.FST.Arc<T>readLastArcByDirectAddressing(FST.Arc<T> arc, FST.BytesReader in)Reads the last arc of a direct addressing node.(package private) FST.Arc<T>readLastTargetArc(FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in)Follows thefollowarc and reads the last arc of its target; this changes the providedarc(2nd arg) in-place and returns it.FST.Arc<T>readNextArc(FST.Arc<T> arc, FST.BytesReader in)In-place read; returns the arc.(package private) intreadNextArcLabel(FST.Arc<T> arc, FST.BytesReader in)Peeks at next arc's label; does not alter arc.FST.Arc<T>readNextRealArc(FST.Arc<T> arc, FST.BytesReader in)Never returns null, but you should never call this if arc.isLast() is true.private voidreadPresenceBytes(FST.Arc<T> arc, FST.BytesReader in)Reads the presence bits of a direct-addressing node.private longreadUnpackedNodeTarget(FST.BytesReader in)voidsave(java.nio.file.Path path)Writes an automaton to a file.voidsave(DataOutput metaOut, DataOutput out)private voidseekToNextNode(FST.BytesReader in)(package private) voidsetEmptyOutput(T v)private booleanshouldExpandNodeWithDirectAddressing(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn, int numBytesPerArc, int maxBytesPerArcWithoutLabel, int labelRange)Returns whether the given node should be expanded with direct addressing instead of binary search.private booleanshouldExpandNodeWithFixedLengthArcs(Builder<T> builder, Builder.UnCompiledNode<T> node)Returns whether the given node should be expanded with fixed length arcs.static <T> booleantargetHasArcs(FST.Arc<T> arc)returns true if the node at this address has any outgoing arcsjava.lang.StringtoString()private voidwriteLabel(DataOutput out, int v)private voidwriteNodeForBinarySearch(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArc)private voidwriteNodeForDirectAddressing(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArcWithoutLabel, int labelRange)private voidwritePresenceBits(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn, long dest, int numPresenceBytes)-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface org.apache.lucene.util.Accountable
getChildResources
-
-
-
-
Field Detail
-
BASE_RAM_BYTES_USED
private static final long BASE_RAM_BYTES_USED
-
ARC_SHALLOW_RAM_BYTES_USED
private static final long ARC_SHALLOW_RAM_BYTES_USED
-
BIT_FINAL_ARC
private static final int BIT_FINAL_ARC
- See Also:
- Constant Field Values
-
BIT_LAST_ARC
static final int BIT_LAST_ARC
- See Also:
- Constant Field Values
-
BIT_TARGET_NEXT
static final int BIT_TARGET_NEXT
- See Also:
- Constant Field Values
-
BIT_STOP_NODE
private static final int BIT_STOP_NODE
- See Also:
- Constant Field Values
-
BIT_ARC_HAS_OUTPUT
public static final int BIT_ARC_HAS_OUTPUT
This flag is set if the arc has an output.- See Also:
- Constant Field Values
-
BIT_ARC_HAS_FINAL_OUTPUT
private static final int BIT_ARC_HAS_FINAL_OUTPUT
- See Also:
- Constant Field Values
-
ARCS_FOR_BINARY_SEARCH
public static final byte ARCS_FOR_BINARY_SEARCH
Value of the arc flags to declare a node with fixed length arcs designed for binary search.- See Also:
- Constant Field Values
-
ARCS_FOR_DIRECT_ADDRESSING
static final byte ARCS_FOR_DIRECT_ADDRESSING
Value of the arc flags to declare a node with fixed length arcs and bit table designed for direct addressing.- See Also:
- Constant Field Values
-
FIXED_LENGTH_ARC_SHALLOW_DEPTH
static final int FIXED_LENGTH_ARC_SHALLOW_DEPTH
-
FIXED_LENGTH_ARC_SHALLOW_NUM_ARCS
static final int FIXED_LENGTH_ARC_SHALLOW_NUM_ARCS
-
FIXED_LENGTH_ARC_DEEP_NUM_ARCS
static final int FIXED_LENGTH_ARC_DEEP_NUM_ARCS
-
DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTOR
private static final float DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTOR
Maximum oversizing factor allowed for direct addressing compared to binary search when expansion credits allow the oversizing. This factor prevents expansions that are obviously too costly even if there are sufficient credits.
-
FILE_FORMAT_NAME
private static final java.lang.String FILE_FORMAT_NAME
- See Also:
- Constant Field Values
-
VERSION_START
private static final int VERSION_START
- See Also:
- Constant Field Values
-
VERSION_CURRENT
private static final int VERSION_CURRENT
- See Also:
- Constant Field Values
-
FINAL_END_NODE
private static final long FINAL_END_NODE
- See Also:
- Constant Field Values
-
NON_FINAL_END_NODE
private static final long NON_FINAL_END_NODE
- See Also:
- Constant Field Values
-
END_LABEL
public static final int END_LABEL
If arc has this label then that arc is final/accepted- See Also:
- Constant Field Values
-
inputType
final FST.INPUT_TYPE inputType
-
emptyOutput
T emptyOutput
-
bytes
final BytesStore bytes
ABytesStore, used during building, or during reading when the FST is very large (more than 1 GB). If the FST is less than 1 GB then bytesArray is set instead.
-
fstStore
private final FSTStore fstStore
-
startNode
private long startNode
-
DEFAULT_MAX_BLOCK_BITS
private static final int DEFAULT_MAX_BLOCK_BITS
-
-
Method Detail
-
flag
private static boolean flag(int flags, int bit)
-
ramBytesUsed
public long ramBytesUsed()
Description copied from interface:AccountableReturn the memory usage of this object in bytes. Negative values are illegal.- Specified by:
ramBytesUsedin interfaceAccountable
-
toString
public java.lang.String toString()
- Overrides:
toStringin classjava.lang.Object
-
finish
void finish(long newStartNode) throws java.io.IOException- Throws:
java.io.IOException
-
getEmptyOutput
public T getEmptyOutput()
-
setEmptyOutput
void setEmptyOutput(T v)
-
save
public void save(DataOutput metaOut, DataOutput out) throws java.io.IOException
- Throws:
java.io.IOException
-
save
public void save(java.nio.file.Path path) throws java.io.IOExceptionWrites an automaton to a file.- Throws:
java.io.IOException
-
read
public static <T> FST<T> read(java.nio.file.Path path, Outputs<T> outputs) throws java.io.IOException
Reads an automaton from a file.- Throws:
java.io.IOException
-
writeLabel
private void writeLabel(DataOutput out, int v) throws java.io.IOException
- Throws:
java.io.IOException
-
readLabel
public int readLabel(DataInput in) throws java.io.IOException
Reads one BYTE1/2/4 label from the providedDataInput.- Throws:
java.io.IOException
-
targetHasArcs
public static <T> boolean targetHasArcs(FST.Arc<T> arc)
returns true if the node at this address has any outgoing arcs
-
addNode
long addNode(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn) throws java.io.IOException
- Throws:
java.io.IOException
-
shouldExpandNodeWithFixedLengthArcs
private boolean shouldExpandNodeWithFixedLengthArcs(Builder<T> builder, Builder.UnCompiledNode<T> node)
Returns whether the given node should be expanded with fixed length arcs. Nodes will be expanded depending on their depth (distance from the root node) and their number of arcs.Nodes with fixed length arcs use more space, because they encode all arcs with a fixed number of bytes, but they allow either binary search or direct addressing on the arcs (instead of linear scan) on lookup by arc label.
-
shouldExpandNodeWithDirectAddressing
private boolean shouldExpandNodeWithDirectAddressing(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn, int numBytesPerArc, int maxBytesPerArcWithoutLabel, int labelRange)
Returns whether the given node should be expanded with direct addressing instead of binary search.Prefer direct addressing for performance if it does not oversize binary search byte size too much, so that the arcs can be directly addressed by label.
-
writeNodeForBinarySearch
private void writeNodeForBinarySearch(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArc)
-
writeNodeForDirectAddressing
private void writeNodeForDirectAddressing(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArcWithoutLabel, int labelRange)
-
writePresenceBits
private void writePresenceBits(Builder<T> builder, Builder.UnCompiledNode<T> nodeIn, long dest, int numPresenceBytes)
-
getNumPresenceBytes
private static int getNumPresenceBytes(int labelRange)
Gets the number of bytes required to flag the presence of each arc in the given label range, one bit per arc.
-
readPresenceBytes
private void readPresenceBytes(FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
Reads the presence bits of a direct-addressing node. Actually we don't read them here, we just keep the pointer to the bit-table start and we skip them.- Throws:
java.io.IOException
-
getFirstArc
public FST.Arc<T> getFirstArc(FST.Arc<T> arc)
Fills virtual 'start' arc, ie, an empty incoming arc to the FST's start node
-
readLastTargetArc
FST.Arc<T> readLastTargetArc(FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
Follows thefollowarc and reads the last arc of its target; this changes the providedarc(2nd arg) in-place and returns it.- Returns:
- Returns the second argument
(
arc). - Throws:
java.io.IOException
-
readUnpackedNodeTarget
private long readUnpackedNodeTarget(FST.BytesReader in) throws java.io.IOException
- Throws:
java.io.IOException
-
readFirstTargetArc
public FST.Arc<T> readFirstTargetArc(FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
Follow thefollowarc and read the first arc of its target; this changes the providedarc(2nd arg) in-place and returns it.- Returns:
- Returns the second argument (
arc). - Throws:
java.io.IOException
-
readFirstRealTargetArc
public FST.Arc<T> readFirstRealTargetArc(long nodeAddress, FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
- Throws:
java.io.IOException
-
isExpandedTarget
boolean isExpandedTarget(FST.Arc<T> follow, FST.BytesReader in) throws java.io.IOException
Returns whetherarc's target points to a node in expanded format (fixed length arcs).- Throws:
java.io.IOException
-
readNextArc
public FST.Arc<T> readNextArc(FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
In-place read; returns the arc.- Throws:
java.io.IOException
-
readNextArcLabel
int readNextArcLabel(FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
Peeks at next arc's label; does not alter arc. Do not call this if arc.isLast()!- Throws:
java.io.IOException
-
readArcByIndex
public FST.Arc<T> readArcByIndex(FST.Arc<T> arc, FST.BytesReader in, int idx) throws java.io.IOException
- Throws:
java.io.IOException
-
readArcByDirectAddressing
public FST.Arc<T> readArcByDirectAddressing(FST.Arc<T> arc, FST.BytesReader in, int rangeIndex) throws java.io.IOException
Reads a present direct addressing node arc, with the provided index in the label range.- Parameters:
rangeIndex- The index of the arc in the label range. It must be present. The real arc offset is computed based on the presence bits of the direct addressing node.- Throws:
java.io.IOException
-
readArcByDirectAddressing
private FST.Arc<T> readArcByDirectAddressing(FST.Arc<T> arc, FST.BytesReader in, int rangeIndex, int presenceIndex) throws java.io.IOException
Reads a present direct addressing node arc, with the provided index in the label range and its corresponding presence index (which is the count of presence bits before it).- Throws:
java.io.IOException
-
readLastArcByDirectAddressing
public FST.Arc<T> readLastArcByDirectAddressing(FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
Reads the last arc of a direct addressing node. This method is equivalent to callreadArcByDirectAddressing(Arc, BytesReader, int)withrangeIndexequal toarc.numArcs() - 1, but it is faster.- Throws:
java.io.IOException
-
readNextRealArc
public FST.Arc<T> readNextRealArc(FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
Never returns null, but you should never call this if arc.isLast() is true.- Throws:
java.io.IOException
-
readArc
private FST.Arc<T> readArc(FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
Reads an arc.
Precondition: The arc flags byte has already been read and set; the given BytesReader is positioned just after the arc flags byte.- Throws:
java.io.IOException
-
findTargetArc
public FST.Arc<T> findTargetArc(int labelToMatch, FST.Arc<T> follow, FST.Arc<T> arc, FST.BytesReader in) throws java.io.IOException
Finds an arc leaving the incoming arc, replacing the arc in place. This returns null if the arc was not found, else the incoming arc.- Throws:
java.io.IOException
-
seekToNextNode
private void seekToNextNode(FST.BytesReader in) throws java.io.IOException
- Throws:
java.io.IOException
-
getBytesReader
public FST.BytesReader getBytesReader()
Returns aFST.BytesReaderfor this FST, positioned at position 0.
-
-