Package org.apache.lucene.classification
Class SimpleNaiveBayesClassifier
- java.lang.Object
-
- org.apache.lucene.classification.SimpleNaiveBayesClassifier
-
- All Implemented Interfaces:
Classifier<BytesRef>
- Direct Known Subclasses:
CachingNaiveBayesClassifier,SimpleNaiveBayesDocumentClassifier
public class SimpleNaiveBayesClassifier extends java.lang.Object implements Classifier<BytesRef>
A simplistic Lucene based NaiveBayes classifier, seehttp://en.wikipedia.org/wiki/Naive_Bayes_classifier
-
-
Field Summary
Fields Modifier and Type Field Description protected AnalyzeranalyzerAnalyzerto be used for tokenizing unseen input textprotected java.lang.StringclassFieldNamename of the field to be used as a class / category outputprotected IndexReaderindexReaderIndexReaderused to access theClassifier's indexprotected IndexSearcherindexSearcherIndexSearcherto run searches on the index for retrieving frequenciesprotected QueryqueryQueryused to eventually filter the document set to be used to classifyprotected java.lang.String[]textFieldNamesnames of the fields to be used as input text
-
Constructor Summary
Constructors Constructor Description SimpleNaiveBayesClassifier(IndexReader indexReader, Analyzer analyzer, Query query, java.lang.String classFieldName, java.lang.String... textFieldNames)Creates a new NaiveBayes classifier.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ClassificationResult<BytesRef>assignClass(java.lang.String inputDocument)Assign a class (with score) to the given text Stringprotected java.util.List<ClassificationResult<BytesRef>>assignClassNormalizedList(java.lang.String inputDocument)Calculate probabilities for all classes for a given input textprivate doublecalculateLogLikelihood(java.lang.String[] tokenizedText, Term term, int docsWithClass)private doublecalculateLogPrior(Term term, int docsWithClassSize)protected intcountDocsWithClass()count the number of documents in the index having at least a value for the 'class' fieldprivate intdocCount(Term term)java.util.List<ClassificationResult<BytesRef>>getClasses(java.lang.String text)Get all the classes (sorted by score, descending) assigned to the given text String.java.util.List<ClassificationResult<BytesRef>>getClasses(java.lang.String text, int max)Get the firstmaxclasses (sorted by score, descending) assigned to the given text String.private doublegetTextTermFreqForClass(Term term)Returns the average number of unique terms times the number of docs belonging to the input classprivate intgetWordFreqForClass(java.lang.String word, Term term)Returns the number of documents of the input class ( from the whole index or from a subset) that contains the word ( in a specific field or in all the fields if no one selected)protected java.util.ArrayList<ClassificationResult<BytesRef>>normClassificationResults(java.util.List<ClassificationResult<BytesRef>> assignedClasses)Normalize the classification results based on the max score availableprotected java.lang.String[]tokenize(java.lang.String text)tokenize aStringon this classifier's text fields and analyzer
-
-
-
Field Detail
-
indexReader
protected final IndexReader indexReader
IndexReaderused to access theClassifier's index
-
textFieldNames
protected final java.lang.String[] textFieldNames
names of the fields to be used as input text
-
classFieldName
protected final java.lang.String classFieldName
name of the field to be used as a class / category output
-
indexSearcher
protected final IndexSearcher indexSearcher
IndexSearcherto run searches on the index for retrieving frequencies
-
-
Constructor Detail
-
SimpleNaiveBayesClassifier
public SimpleNaiveBayesClassifier(IndexReader indexReader, Analyzer analyzer, Query query, java.lang.String classFieldName, java.lang.String... textFieldNames)
Creates a new NaiveBayes classifier.- Parameters:
indexReader- the reader on the index to be used for classificationanalyzer- anAnalyzerused to analyze unseen textquery- aQueryto eventually filter the docs used for training the classifier, ornullif all the indexed docs should be usedclassFieldName- the name of the field used as the output for the classifier NOTE: must not be havely analyzed as the returned class will be a token indexed for this fieldtextFieldNames- the name of the fields used as the inputs for the classifier, NO boosting supported per field
-
-
Method Detail
-
assignClass
public ClassificationResult<BytesRef> assignClass(java.lang.String inputDocument) throws java.io.IOException
Description copied from interface:ClassifierAssign a class (with score) to the given text String- Specified by:
assignClassin interfaceClassifier<BytesRef>- Parameters:
inputDocument- a String containing text to be classified- Returns:
- a
ClassificationResultholding assigned class of typeTand score - Throws:
java.io.IOException- If there is a low-level I/O error.
-
getClasses
public java.util.List<ClassificationResult<BytesRef>> getClasses(java.lang.String text) throws java.io.IOException
Description copied from interface:ClassifierGet all the classes (sorted by score, descending) assigned to the given text String.- Specified by:
getClassesin interfaceClassifier<BytesRef>- Parameters:
text- a String containing text to be classified- Returns:
- the whole list of
ClassificationResult, the classes and scores. Returnsnullif the classifier can't make lists. - Throws:
java.io.IOException- If there is a low-level I/O error.
-
getClasses
public java.util.List<ClassificationResult<BytesRef>> getClasses(java.lang.String text, int max) throws java.io.IOException
Description copied from interface:ClassifierGet the firstmaxclasses (sorted by score, descending) assigned to the given text String.- Specified by:
getClassesin interfaceClassifier<BytesRef>- Parameters:
text- a String containing text to be classifiedmax- the number of return list elements- Returns:
- the whole list of
ClassificationResult, the classes and scores. Cut for "max" number of elements. Returnsnullif the classifier can't make lists. - Throws:
java.io.IOException- If there is a low-level I/O error.
-
assignClassNormalizedList
protected java.util.List<ClassificationResult<BytesRef>> assignClassNormalizedList(java.lang.String inputDocument) throws java.io.IOException
Calculate probabilities for all classes for a given input text- Parameters:
inputDocument- the input text as aString- Returns:
- a
ListofClassificationResult, one for each existing class - Throws:
java.io.IOException- if assigning probabilities fails
-
countDocsWithClass
protected int countDocsWithClass() throws java.io.IOExceptioncount the number of documents in the index having at least a value for the 'class' field- Returns:
- the no. of documents having a value for the 'class' field
- Throws:
java.io.IOException- if accessing to term vectors or search fails
-
tokenize
protected java.lang.String[] tokenize(java.lang.String text) throws java.io.IOExceptiontokenize aStringon this classifier's text fields and analyzer- Parameters:
text- theStringrepresenting an input text (to be classified)- Returns:
- a
Stringarray of the resulting tokens - Throws:
java.io.IOException- if tokenization fails
-
calculateLogLikelihood
private double calculateLogLikelihood(java.lang.String[] tokenizedText, Term term, int docsWithClass) throws java.io.IOException- Throws:
java.io.IOException
-
getTextTermFreqForClass
private double getTextTermFreqForClass(Term term) throws java.io.IOException
Returns the average number of unique terms times the number of docs belonging to the input class- Parameters:
term- the term representing the class- Returns:
- the average number of unique terms
- Throws:
java.io.IOException- if a low level I/O problem happens
-
getWordFreqForClass
private int getWordFreqForClass(java.lang.String word, Term term) throws java.io.IOExceptionReturns the number of documents of the input class ( from the whole index or from a subset) that contains the word ( in a specific field or in all the fields if no one selected)- Parameters:
word- the token produced by the analyzerterm- the term representing the class- Returns:
- the number of documents of the input class
- Throws:
java.io.IOException- if a low level I/O problem happens
-
calculateLogPrior
private double calculateLogPrior(Term term, int docsWithClassSize) throws java.io.IOException
- Throws:
java.io.IOException
-
docCount
private int docCount(Term term) throws java.io.IOException
- Throws:
java.io.IOException
-
normClassificationResults
protected java.util.ArrayList<ClassificationResult<BytesRef>> normClassificationResults(java.util.List<ClassificationResult<BytesRef>> assignedClasses)
Normalize the classification results based on the max score available- Parameters:
assignedClasses- the list of assigned classes- Returns:
- the normalized results
-
-