Package org.apache.lucene.codecs.memory
Class FSTTermsWriter
- java.lang.Object
-
- org.apache.lucene.codecs.FieldsConsumer
-
- org.apache.lucene.codecs.memory.FSTTermsWriter
-
- All Implemented Interfaces:
java.io.Closeable,java.lang.AutoCloseable
public class FSTTermsWriter extends FieldsConsumer
FST-based term dict, using metadata as FST output. The FST directly holds the mapping between <term, metadata>. Term metadata consists of three parts: 1. term statistics: docFreq, totalTermFreq; 2. monotonic long[], e.g. the pointer to the postings list for that term; 3. generic byte[], e.g. other information need by postings reader.File:
- .tst: Term Dictionary
Term Dictionary
The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).
Typically the metadata is separated into two parts:
- Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
- Generic byte array: Used to store non-monotonic metadata.
- TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
- FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
- TermFST -->
FST<TermData> - TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
- Header -->
IndexHeader - DirOffset -->
Uint64 - DocFreq, LongsSize, BytesSize, NumFields,
FieldNumber, DocCount -->
VInt - TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta -->
VLong
Notes:
- The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
- The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
- The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
- Since LongsSize is per-field fixed, it is only written once in field summary.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static classFSTTermsWriter.FieldMetaData(package private) classFSTTermsWriter.TermsWriter
-
Field Summary
Fields Modifier and Type Field Description (package private) FieldInfosfieldInfos(package private) java.util.List<FSTTermsWriter.FieldMetaData>fields(package private) intmaxDoc(package private) IndexOutputout(package private) PostingsWriterBasepostingsWriter(package private) static java.lang.StringTERMS_CODEC_NAME(package private) static java.lang.StringTERMS_EXTENSIONstatic intTERMS_VERSION_CURRENTstatic intTERMS_VERSION_START
-
Constructor Summary
Constructors Constructor Description FSTTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()voidwrite(Fields fields, NormsProducer norms)Write all fields, terms and postings.private voidwriteTrailer(IndexOutput out, long dirStart)-
Methods inherited from class org.apache.lucene.codecs.FieldsConsumer
merge
-
-
-
-
Field Detail
-
TERMS_EXTENSION
static final java.lang.String TERMS_EXTENSION
- See Also:
- Constant Field Values
-
TERMS_CODEC_NAME
static final java.lang.String TERMS_CODEC_NAME
- See Also:
- Constant Field Values
-
TERMS_VERSION_START
public static final int TERMS_VERSION_START
- See Also:
- Constant Field Values
-
TERMS_VERSION_CURRENT
public static final int TERMS_VERSION_CURRENT
- See Also:
- Constant Field Values
-
postingsWriter
final PostingsWriterBase postingsWriter
-
fieldInfos
final FieldInfos fieldInfos
-
out
IndexOutput out
-
maxDoc
final int maxDoc
-
fields
final java.util.List<FSTTermsWriter.FieldMetaData> fields
-
-
Constructor Detail
-
FSTTermsWriter
public FSTTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter) throws java.io.IOException
- Throws:
java.io.IOException
-
-
Method Detail
-
writeTrailer
private void writeTrailer(IndexOutput out, long dirStart) throws java.io.IOException
- Throws:
java.io.IOException
-
write
public void write(Fields fields, NormsProducer norms) throws java.io.IOException
Description copied from class:FieldsConsumerWrite all fields, terms and postings. This the "pull" API, allowing you to iterate more than once over the postings, somewhat analogous to using a DOM API to traverse an XML tree.Notes:
- You must compute index statistics, including each Term's docFreq and totalTermFreq, as well as the summary sumTotalTermFreq, sumTotalDocFreq and docCount.
- You must skip terms that have no docs and fields that have no terms, even though the provided Fields API will expose them; this typically requires lazily writing the field or term until you've actually seen the first term or document.
- The provided Fields instance is limited: you cannot call any methods that return statistics/counts; you cannot pass a non-null live docs when pulling docs/positions enums.
- Specified by:
writein classFieldsConsumer- Throws:
java.io.IOException
-
close
public void close() throws java.io.IOException- Specified by:
closein interfacejava.lang.AutoCloseable- Specified by:
closein interfacejava.io.Closeable- Specified by:
closein classFieldsConsumer- Throws:
java.io.IOException
-
-