Class SloppyPhraseMatcher
- java.lang.Object
-
- org.apache.lucene.search.PhraseMatcher
-
- org.apache.lucene.search.SloppyPhraseMatcher
-
final class SloppyPhraseMatcher extends PhraseMatcher
Find all slop-valid position-combinations (matches) encountered while traversing/hopping the PhrasePositions.
The sloppy frequency contribution of a match depends on the distance:
- highest freq for distance=0 (exact match).
- freq gets lower as distance gets higher.
Example: for query "a b"~2, a document "x a b a y" can be matched twice: once for "a b" (distance=0), and once for "b a" (distance=2).
Possibly not all valid combinations are encountered, because for efficiency we always propagate the least PhrasePosition. This allows to base on PriorityQueue and move forward faster. As result, for example, document "a b c b a" would score differently for queries "a b c"~4 and "c b a"~4, although they really are equivalent. Similarly, for doc "a b c b a f g", query "c b"~2 would get same score as "g f"~2, although "c b"~2 could be matched twice. We may want to fix this in the future (currently not, for performance reasons).
-
-
Field Summary
Fields Modifier and Type Field Description private DocIdSetIteratorapproximationprivate booleancaptureLeadMatchprivate booleancheckedRptsprivate intendprivate booleanhasMultiTermRptsprivate booleanhasRptsprivate ImpactsDISIimpactsApproximationprivate intleadEndOffsetprivate intleadOffsetprivate intleadOrdprivate intleadPositionprivate intmatchLengthprivate intnumPostingsprivate PhrasePositions[]phrasePositionsprivate booleanpositionedprivate PhraseQueuepqprivate PhrasePositions[][]rptGroupsprivate PhrasePositions[]rptStackprivate intslop
-
Constructor Summary
Constructors Constructor Description SloppyPhraseMatcher(PhraseQuery.PostingsAndFreq[] postings, int slop, ScoreMode scoreMode, Similarity.SimScorer scorer, float matchCost, boolean captureLeadMatch)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private booleanadvancePP(PhrasePositions pp)advance a PhrasePosition and update 'end', return false if exhaustedprivate booleanadvanceRepeatGroups()At initialization (each doc), each repetition group is sorted by (query) offset.private booleanadvanceRpts(PhrasePositions pp)pp was just advanced.(package private) DocIdSetIteratorapproximation()Approximation that only matches documents that have all terms.private voidcaptureLead(PhrasePositions pp)private intcollide(PhrasePositions pp)index of a pp2 colliding with pp, or -1 if noneintendOffset()The end offset of the current matchintendPosition()The end position of the current matchprivate voidfillQueue()Fill the queue (all pps are already placedprivate java.util.ArrayList<java.util.ArrayList<PhrasePositions>>gatherRptGroups(java.util.LinkedHashMap<Term,java.lang.Integer> rptTerms)Detect repetition groups.(package private) ImpactsDISIimpactsApproximation()Approximation that is aware of impacts.private booleaninitComplex()with repeats: not so simple.private booleaninitFirstTime()initialize with checking for repeats.private booleaninitPhrasePositions()Initialize PhrasePositions in place.private voidinitSimple()no repeats: simplest case, and most common.private PhrasePositionslesser(PhrasePositions pp, PhrasePositions pp2)compare two pps, but only by position and offset(package private) floatmaxFreq()An upper bound on the number of possible matches on this documentbooleannextMatch()Find the next match on the current document, returningfalseif there are none.private voidplaceFirstPositions()move all PPs to their first positionprivate java.util.ArrayList<FixedBitSet>ppTermsBitSets(PhrasePositions[] rpp, java.util.HashMap<Term,java.lang.Integer> tord)bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is setprivate PhrasePositions[]repeatingPPs(java.util.HashMap<Term,java.lang.Integer> rptTerms)find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRptsprivate java.util.LinkedHashMap<Term,java.lang.Integer>repeatingTerms()find repeating terms and assign them ordinal valuesvoidreset()Called afterPhraseMatcher.approximation()has been advanced(package private) floatsloppyWeight()The slop-adjusted weight of the current match The sum of the slop-adjusted weights is used as the freq for scoringprivate voidsortRptGroups(java.util.ArrayList<java.util.ArrayList<PhrasePositions>> rgs)sort each repetition group by (query) offset.intstartOffset()The start offset of the current matchintstartPosition()The start position of the current matchprivate java.util.HashMap<Term,java.lang.Integer>termGroups(java.util.LinkedHashMap<Term,java.lang.Integer> tord, java.util.ArrayList<FixedBitSet> bb)map each term to the single group that contains itprivate inttpPos(PhrasePositions pp)Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset)private voidunionTermGroups(java.util.ArrayList<FixedBitSet> bb)union (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different terms-
Methods inherited from class org.apache.lucene.search.PhraseMatcher
getMatchCost
-
-
-
-
Field Detail
-
phrasePositions
private final PhrasePositions[] phrasePositions
-
slop
private final int slop
-
numPostings
private final int numPostings
-
pq
private final PhraseQueue pq
-
captureLeadMatch
private final boolean captureLeadMatch
-
approximation
private final DocIdSetIterator approximation
-
impactsApproximation
private final ImpactsDISI impactsApproximation
-
end
private int end
-
leadPosition
private int leadPosition
-
leadOffset
private int leadOffset
-
leadEndOffset
private int leadEndOffset
-
leadOrd
private int leadOrd
-
hasRpts
private boolean hasRpts
-
checkedRpts
private boolean checkedRpts
-
hasMultiTermRpts
private boolean hasMultiTermRpts
-
rptGroups
private PhrasePositions[][] rptGroups
-
rptStack
private PhrasePositions[] rptStack
-
positioned
private boolean positioned
-
matchLength
private int matchLength
-
-
Constructor Detail
-
SloppyPhraseMatcher
SloppyPhraseMatcher(PhraseQuery.PostingsAndFreq[] postings, int slop, ScoreMode scoreMode, Similarity.SimScorer scorer, float matchCost, boolean captureLeadMatch)
-
-
Method Detail
-
approximation
DocIdSetIterator approximation()
Description copied from class:PhraseMatcherApproximation that only matches documents that have all terms.- Specified by:
approximationin classPhraseMatcher
-
impactsApproximation
ImpactsDISI impactsApproximation()
Description copied from class:PhraseMatcherApproximation that is aware of impacts.- Specified by:
impactsApproximationin classPhraseMatcher
-
maxFreq
float maxFreq() throws java.io.IOExceptionDescription copied from class:PhraseMatcherAn upper bound on the number of possible matches on this document- Specified by:
maxFreqin classPhraseMatcher- Throws:
java.io.IOException
-
reset
public void reset() throws java.io.IOExceptionDescription copied from class:PhraseMatcherCalled afterPhraseMatcher.approximation()has been advanced- Specified by:
resetin classPhraseMatcher- Throws:
java.io.IOException
-
sloppyWeight
float sloppyWeight()
Description copied from class:PhraseMatcherThe slop-adjusted weight of the current match The sum of the slop-adjusted weights is used as the freq for scoring- Specified by:
sloppyWeightin classPhraseMatcher
-
nextMatch
public boolean nextMatch() throws java.io.IOExceptionDescription copied from class:PhraseMatcherFind the next match on the current document, returningfalseif there are none.- Specified by:
nextMatchin classPhraseMatcher- Throws:
java.io.IOException
-
captureLead
private void captureLead(PhrasePositions pp) throws java.io.IOException
- Throws:
java.io.IOException
-
startPosition
public int startPosition()
Description copied from class:PhraseMatcherThe start position of the current match- Specified by:
startPositionin classPhraseMatcher
-
endPosition
public int endPosition()
Description copied from class:PhraseMatcherThe end position of the current match- Specified by:
endPositionin classPhraseMatcher
-
startOffset
public int startOffset() throws java.io.IOExceptionDescription copied from class:PhraseMatcherThe start offset of the current match- Specified by:
startOffsetin classPhraseMatcher- Throws:
java.io.IOException
-
endOffset
public int endOffset() throws java.io.IOExceptionDescription copied from class:PhraseMatcherThe end offset of the current match- Specified by:
endOffsetin classPhraseMatcher- Throws:
java.io.IOException
-
advancePP
private boolean advancePP(PhrasePositions pp) throws java.io.IOException
advance a PhrasePosition and update 'end', return false if exhausted- Throws:
java.io.IOException
-
advanceRpts
private boolean advanceRpts(PhrasePositions pp) throws java.io.IOException
pp was just advanced. If that caused a repeater collision, resolve by advancing the lesser of the two colliding pps. Note that there can only be one collision, as by the initialization there were no collisions before pp was advanced.- Throws:
java.io.IOException
-
lesser
private PhrasePositions lesser(PhrasePositions pp, PhrasePositions pp2)
compare two pps, but only by position and offset
-
collide
private int collide(PhrasePositions pp)
index of a pp2 colliding with pp, or -1 if none
-
initPhrasePositions
private boolean initPhrasePositions() throws java.io.IOExceptionInitialize PhrasePositions in place. A one time initialization for this scorer (on first doc matching all terms):- Check if there are repetitions
- If there are, find groups of repetitions.
- no repetitions: "ho my"~2
- repetitions: "ho my my"~2
- repetitions: "my ho my"~2
- Returns:
- false if PPs are exhausted (and so current doc will not be a match)
- Throws:
java.io.IOException
-
initSimple
private void initSimple() throws java.io.IOExceptionno repeats: simplest case, and most common. It is important to keep this piece of the code simple and efficient- Throws:
java.io.IOException
-
initComplex
private boolean initComplex() throws java.io.IOExceptionwith repeats: not so simple.- Throws:
java.io.IOException
-
placeFirstPositions
private void placeFirstPositions() throws java.io.IOExceptionmove all PPs to their first position- Throws:
java.io.IOException
-
fillQueue
private void fillQueue()
Fill the queue (all pps are already placed
-
advanceRepeatGroups
private boolean advanceRepeatGroups() throws java.io.IOExceptionAt initialization (each doc), each repetition group is sorted by (query) offset. This provides the start condition: no collisions.Case 1: no multi-term repeats
It is sufficient to advance each pp in the group by one less than its group index. So lesser pp is not advanced, 2nd one advance once, 3rd one advanced twice, etc.Case 2: multi-term repeats
- Returns:
- false if PPs are exhausted.
- Throws:
java.io.IOException
-
initFirstTime
private boolean initFirstTime() throws java.io.IOExceptioninitialize with checking for repeats. Heavy work, but done only for the first candidate doc.If there are repetitions, check if multi-term postings (MTP) are involved.
Without MTP, once PPs are placed in the first candidate doc, repeats (and groups) are visible.
With MTP, a more complex check is needed, up-front, as there may be "hidden collisions".
For example P1 has {A,B}, P1 has {B,C}, and the first doc is: "A C B". At start, P1 would point to "A", p2 to "C", and it will not be identified that P1 and P2 are repetitions of each other.The more complex initialization has two parts:
(1) identification of repetition groups.
(2) advancing repeat groups at the start of the doc.
For (1), a possible solution is to just create a single repetition group, made of all repeating pps. But this would slow down the check for collisions, as all pps would need to be checked. Instead, we compute "connected regions" on the bipartite graph of postings and terms.- Throws:
java.io.IOException
-
sortRptGroups
private void sortRptGroups(java.util.ArrayList<java.util.ArrayList<PhrasePositions>> rgs)
sort each repetition group by (query) offset. Done only once (at first doc) and allows to initialize faster for each doc.
-
gatherRptGroups
private java.util.ArrayList<java.util.ArrayList<PhrasePositions>> gatherRptGroups(java.util.LinkedHashMap<Term,java.lang.Integer> rptTerms) throws java.io.IOException
Detect repetition groups. Done once - for first doc- Throws:
java.io.IOException
-
tpPos
private final int tpPos(PhrasePositions pp)
Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset)
-
repeatingTerms
private java.util.LinkedHashMap<Term,java.lang.Integer> repeatingTerms()
find repeating terms and assign them ordinal values
-
repeatingPPs
private PhrasePositions[] repeatingPPs(java.util.HashMap<Term,java.lang.Integer> rptTerms)
find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRpts
-
ppTermsBitSets
private java.util.ArrayList<FixedBitSet> ppTermsBitSets(PhrasePositions[] rpp, java.util.HashMap<Term,java.lang.Integer> tord)
bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is set
-
unionTermGroups
private void unionTermGroups(java.util.ArrayList<FixedBitSet> bb)
union (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different terms
-
termGroups
private java.util.HashMap<Term,java.lang.Integer> termGroups(java.util.LinkedHashMap<Term,java.lang.Integer> tord, java.util.ArrayList<FixedBitSet> bb) throws java.io.IOException
map each term to the single group that contains it- Throws:
java.io.IOException
-
-