Class TextExtractor
- java.lang.Object
-
- TextExtractor
-
- All Implemented Interfaces:
CharStreamSource
public class TextExtractor extends java.lang.Object implements CharStreamSource
Extracts the textual content from HTML markup.The output is ideal for feeding into a text search engine such as Apache Lucene, especially when the
IncludeAttributesproperty has been set totrue.Use one of the following methods to obtain the output:
The process removes all of the tags and decodes the result, collapsing all white space. A space character is included in the output where a normal tag is present in the source, unless the tag belongs to an inline-level element. An exception to this is the
BRelement, which is also converted to a space despite being an inline-level element.Text inside
SCRIPTandSTYLEelements contained within this segment is ignored.Setting the
ExcludeNonHTMLElementsproperty results in the exclusion of any content within a non-HTML element.See the
excludeElement(StartTag)method for details on how to implement a more complex mechanism to determine whether the content of eachElementis to be excluded from the output.All tags that are not normal tags, such as server tags, comments etc., are removed from the output without adding white space to the output.
Note that segments on which the
Segment.ignoreWhenParsing()method has been called are treated as text rather than markup, resulting in their inclusion in the output. To remove specific segments before extracting the text, create anOutputDocumentand call itsremove(Segment)orreplaceWithSpaces(int begin, int end)method for each segment to be removed. Then create a new source document usingnew Source(outputDocument.toString())and perform the text extraction on this new source object.Extracting the text from an entire
Sourceobject performs a full sequential parse automatically.To perform a simple rendering of HTML markup into text, which is more readable than the output of this class, use the
Rendererclass instead.- Example:
- Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".
-
-
Constructor Summary
Constructors Constructor Description TextExtractor(Segment segment)Constructs a newTextExtractorbased on the specifiedSegment.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidappendTo(java.lang.Appendable appendable)Appends the output to the specifiedAppendableobject.booleanexcludeElement(StartTag startTag)Indicates whether the text inside theElementof the specified start tag should be excluded from the output.booleangetConvertNonBreakingSpaces()Indicates whether non-breaking space ( ) character entity references are converted to spaces.longgetEstimatedMaximumOutputLength()Returns the estimated maximum number of characters in the output, or-1if no estimate is available.booleangetExcludeNonHTMLElements()Indicates whether the content of non-HTML elements is excluded from the output.booleangetIncludeAttributes()Indicates whether any attribute values are included in the output.booleanincludeAttribute(StartTag startTag, Attribute attribute)TextExtractorsetConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)Sets whether non-breaking space ( ) character entity references are converted to spaces.TextExtractorsetExcludeNonHTMLElements(boolean excludeNonHTMLElements)Sets whether the content of non-HTML elements is excluded from the output.TextExtractorsetIncludeAttributes(boolean includeAttributes)Sets whether any attribute values are included in the output.java.lang.StringtoString()Returns the output as a string.voidwriteTo(java.io.Writer writer)Writes the output to the specifiedWriter.
-
-
-
Constructor Detail
-
TextExtractor
public TextExtractor(Segment segment)
Constructs a newTextExtractorbased on the specifiedSegment.- Parameters:
segment- the segment from which the text will be extracted.- See Also:
Segment.getTextExtractor()
-
-
Method Detail
-
writeTo
public void writeTo(java.io.Writer writer) throws java.io.IOExceptionDescription copied from interface:CharStreamSourceWrites the output to the specifiedWriter.- Specified by:
writeToin interfaceCharStreamSource- Parameters:
writer- the destinationjava.io.Writerfor the output.- Throws:
java.io.IOException- if an I/O exception occurs.
-
appendTo
public void appendTo(java.lang.Appendable appendable) throws java.io.IOExceptionDescription copied from interface:CharStreamSourceAppends the output to the specifiedAppendableobject.- Specified by:
appendToin interfaceCharStreamSource- Parameters:
appendable- the destinationjava.lang.Appendableobject for the output.- Throws:
java.io.IOException- if an I/O exception occurs.
-
getEstimatedMaximumOutputLength
public long getEstimatedMaximumOutputLength()
Description copied from interface:CharStreamSourceReturns the estimated maximum number of characters in the output, or-1if no estimate is available.The returned value should be used as a guide for efficiency purposes only, for example to set an initial
StringBuildercapacity. There is no guarantee that the length of the output is indeed less than this value, as classes implementing this method often use assumptions based on typical usage to calculate the estimate.Although implementations of this method should never return a value less than -1, users of this method must not assume that this will always be the case. Standard practice is to interpret any negative value as meaning that no estimate is available.
- Specified by:
getEstimatedMaximumOutputLengthin interfaceCharStreamSource- Returns:
- the estimated maximum number of characters in the output, or
-1if no estimate is available.
-
toString
public java.lang.String toString()
Description copied from interface:CharStreamSourceReturns the output as a string.- Specified by:
toStringin interfaceCharStreamSource- Overrides:
toStringin classjava.lang.Object- Returns:
- the output as a string.
-
setConvertNonBreakingSpaces
public TextExtractor setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
Sets whether non-breaking space ( ) character entity references are converted to spaces.The default value is that of the static
Config.ConvertNonBreakingSpacesproperty at the time theTextExtractoris instantiated.- Parameters:
convertNonBreakingSpaces- specifies whether non-breaking space ( ) character entity references are converted to spaces.- Returns:
- this
TextExtractorinstance, allowing multiple property setting methods to be chained in a single statement. - See Also:
getConvertNonBreakingSpaces()
-
getConvertNonBreakingSpaces
public boolean getConvertNonBreakingSpaces()
Indicates whether non-breaking space ( ) character entity references are converted to spaces.See the
setConvertNonBreakingSpaces(boolean)method for a full description of this property.- Returns:
trueif non-breaking space ( ) character entity references are converted to spaces, otherwisefalse.
-
setIncludeAttributes
public TextExtractor setIncludeAttributes(boolean includeAttributes)
Sets whether any attribute values are included in the output.If the value of this property is
true, then each attribute still has to match the conditions implemented in theincludeAttribute(StartTag,Attribute)method in order for its value to be included in the output.The default value is
false.- Parameters:
includeAttributes- specifies whether any attribute values are included in the output.- Returns:
- this
TextExtractorinstance, allowing multiple property setting methods to be chained in a single statement. - See Also:
getIncludeAttributes()
-
getIncludeAttributes
public boolean getIncludeAttributes()
Indicates whether any attribute values are included in the output.See the
setIncludeAttributes(boolean)method for a full description of this property.- Returns:
trueif any attribute values are included in the output, otherwisefalse.
-
includeAttribute
public boolean includeAttribute(StartTag startTag, Attribute attribute)
Indicates whether the value of the specified attribute in the specified start tag is included in the output.This method is ignored if the
IncludeAttributesproperty is set tofalse, in which case no attribute values are included in the output.If the
IncludeAttributesproperty is set totrue, every attribute of every start tag encountered in the segment is checked using this method to determine whether the value of the attribute should be included in the output.The default implementation of this method returns
trueif the name of the specified attribute is one of title, alt, label, summary, content*, or href, but the method can be overridden in a subclass to perform a check of arbitrary complexity on each attribute.* The value of a content attribute is only included if a name attribute is also present in the specified start tag, as the content attribute of a
METAtag only contains human readable text if the name attribute is used as opposed to an http-equiv attribute.- Example:
-
To include only the value of title and
alt attributes:
final Set includeAttributeNames=new HashSet(Arrays.asList(new String[] {"title","alt"}));
TextExtractor textExtractor=new TextExtractor(segment) {
public boolean includeAttribute(StartTag startTag, Attribute attribute) {
return includeAttributeNames.contains(attribute.getKey());
}
};
textExtractor.setIncludeAttributes(true);
String extractedText=textExtractor.toString();
- Parameters:
startTag- the start tag of the element to check for inclusion.- Returns:
if the text inside the Elementof the specified start tag should be excluded from the output, otherwisefalse.
-
setExcludeNonHTMLElements
public TextExtractor setExcludeNonHTMLElements(boolean excludeNonHTMLElements)
Sets whether the content of non-HTML elements is excluded from the output.The default value is
false, meaning that content from all elements meeting the other criteria is included.- Parameters:
excludeNonHTMLElements- specifies whether content non-HTML elements is excluded from the output.- Returns:
- this
TextExtractorinstance, allowing multiple property setting methods to be chained in a single statement. - See Also:
getExcludeNonHTMLElements()
-
getExcludeNonHTMLElements
public boolean getExcludeNonHTMLElements()
Indicates whether the content of non-HTML elements is excluded from the output.See the
setExcludeNonHTMLElements(boolean)method for a full description of this property.- Returns:
trueif the content of non-HTML elements is excluded from the output, otherwisefalse.
-
excludeElement
public boolean excludeElement(StartTag startTag)
Indicates whether the text inside theElementof the specified start tag should be excluded from the output.During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its associated element should be excluded from the output.
The default implementation of this method is to always return
false, so that every element is included, but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.All elements nested inside an excluded element are also implicitly excluded, as are all
SCRIPTandSTYLEelements. Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.- Example:
-
To extract the text from a
segment, excluding any text inside elements with the attributeclass="NotIndexed":
TextExtractor textExtractor=new TextExtractor(segment) {
public boolean excludeElement(StartTag startTag) {
return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class"));
}
};
String extractedText=textExtractor.toString();
- Parameters:
startTag- the start tag of the element to check for inclusion.- Returns:
if the text inside the Elementof the specified start tag should be excluded from the output, otherwisefalse.
-
-