Package org.apache.lucene.classification
Class BM25NBClassifier
- java.lang.Object
-
- org.apache.lucene.classification.BM25NBClassifier
-
- All Implemented Interfaces:
Classifier<BytesRef>
public class BM25NBClassifier extends java.lang.Object implements Classifier<BytesRef>
A classifier approximating naive bayes classifier by using pure queries on BM25.
-
-
Field Summary
Fields Modifier and Type Field Description private AnalyzeranalyzerAnalyzerto be used for tokenizing unseen input textprivate java.lang.StringclassFieldNamename of the field to be used as a class / category outputprivate IndexReaderindexReaderIndexReaderused to access theClassifier's indexprivate IndexSearcherindexSearcherIndexSearcherto run searches on the index for retrieving frequenciesprivate QueryqueryQueryused to eventually filter the document set to be used to classifyprivate java.lang.String[]textFieldNamesnames of the fields to be used as input text
-
Constructor Summary
Constructors Constructor Description BM25NBClassifier(IndexReader indexReader, Analyzer analyzer, Query query, java.lang.String classFieldName, java.lang.String... textFieldNames)Creates a new NaiveBayes classifier.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ClassificationResult<BytesRef>assignClass(java.lang.String inputDocument)Assign a class (with score) to the given text Stringprivate java.util.List<ClassificationResult<BytesRef>>assignClassNormalizedList(java.lang.String inputDocument)Calculate probabilities for all classes for a given input textprivate doublecalculateLogLikelihood(java.lang.String[] tokens, Term term)private doublecalculateLogPrior(Term term)java.util.List<ClassificationResult<BytesRef>>getClasses(java.lang.String text)Get all the classes (sorted by score, descending) assigned to the given text String.java.util.List<ClassificationResult<BytesRef>>getClasses(java.lang.String text, int max)Get the firstmaxclasses (sorted by score, descending) assigned to the given text String.private doublegetTermProbForClass(Term classTerm, java.lang.String... words)private java.util.ArrayList<ClassificationResult<BytesRef>>normClassificationResults(java.util.List<ClassificationResult<BytesRef>> assignedClasses)Normalize the classification results based on the max score availableprivate java.lang.String[]tokenize(java.lang.String text)tokenize aStringon this classifier's text fields and analyzer
-
-
-
Field Detail
-
indexReader
private final IndexReader indexReader
IndexReaderused to access theClassifier's index
-
textFieldNames
private final java.lang.String[] textFieldNames
names of the fields to be used as input text
-
classFieldName
private final java.lang.String classFieldName
name of the field to be used as a class / category output
-
indexSearcher
private final IndexSearcher indexSearcher
IndexSearcherto run searches on the index for retrieving frequencies
-
-
Constructor Detail
-
BM25NBClassifier
public BM25NBClassifier(IndexReader indexReader, Analyzer analyzer, Query query, java.lang.String classFieldName, java.lang.String... textFieldNames)
Creates a new NaiveBayes classifier.- Parameters:
indexReader- the reader on the index to be used for classificationanalyzer- anAnalyzerused to analyze unseen textquery- aQueryto eventually filter the docs used for training the classifier, ornullif all the indexed docs should be usedclassFieldName- the name of the field used as the output for the classifier NOTE: must not be heavely analyzed as the returned class will be a token indexed for this fieldtextFieldNames- the name of the fields used as the inputs for the classifier, NO boosting supported per field
-
-
Method Detail
-
assignClass
public ClassificationResult<BytesRef> assignClass(java.lang.String inputDocument) throws java.io.IOException
Description copied from interface:ClassifierAssign a class (with score) to the given text String- Specified by:
assignClassin interfaceClassifier<BytesRef>- Parameters:
inputDocument- a String containing text to be classified- Returns:
- a
ClassificationResultholding assigned class of typeTand score - Throws:
java.io.IOException- If there is a low-level I/O error.
-
getClasses
public java.util.List<ClassificationResult<BytesRef>> getClasses(java.lang.String text) throws java.io.IOException
Description copied from interface:ClassifierGet all the classes (sorted by score, descending) assigned to the given text String.- Specified by:
getClassesin interfaceClassifier<BytesRef>- Parameters:
text- a String containing text to be classified- Returns:
- the whole list of
ClassificationResult, the classes and scores. Returnsnullif the classifier can't make lists. - Throws:
java.io.IOException- If there is a low-level I/O error.
-
getClasses
public java.util.List<ClassificationResult<BytesRef>> getClasses(java.lang.String text, int max) throws java.io.IOException
Description copied from interface:ClassifierGet the firstmaxclasses (sorted by score, descending) assigned to the given text String.- Specified by:
getClassesin interfaceClassifier<BytesRef>- Parameters:
text- a String containing text to be classifiedmax- the number of return list elements- Returns:
- the whole list of
ClassificationResult, the classes and scores. Cut for "max" number of elements. Returnsnullif the classifier can't make lists. - Throws:
java.io.IOException- If there is a low-level I/O error.
-
assignClassNormalizedList
private java.util.List<ClassificationResult<BytesRef>> assignClassNormalizedList(java.lang.String inputDocument) throws java.io.IOException
Calculate probabilities for all classes for a given input text- Parameters:
inputDocument- the input text as aString- Returns:
- a
ListofClassificationResult, one for each existing class - Throws:
java.io.IOException- if assigning probabilities fails
-
normClassificationResults
private java.util.ArrayList<ClassificationResult<BytesRef>> normClassificationResults(java.util.List<ClassificationResult<BytesRef>> assignedClasses)
Normalize the classification results based on the max score available- Parameters:
assignedClasses- the list of assigned classes- Returns:
- the normalized results
-
tokenize
private java.lang.String[] tokenize(java.lang.String text) throws java.io.IOExceptiontokenize aStringon this classifier's text fields and analyzer- Parameters:
text- theStringrepresenting an input text (to be classified)- Returns:
- a
Stringarray of the resulting tokens - Throws:
java.io.IOException- if tokenization fails
-
calculateLogLikelihood
private double calculateLogLikelihood(java.lang.String[] tokens, Term term) throws java.io.IOException- Throws:
java.io.IOException
-
getTermProbForClass
private double getTermProbForClass(Term classTerm, java.lang.String... words) throws java.io.IOException
- Throws:
java.io.IOException
-
calculateLogPrior
private double calculateLogPrior(Term term) throws java.io.IOException
- Throws:
java.io.IOException
-
-