Class Segment
- java.lang.Object
-
- Segment
-
- All Implemented Interfaces:
java.lang.CharSequence,java.lang.Comparable<Segment>
- Direct Known Subclasses:
Attribute,CharacterReference,Element,FormControl,SequentialListSegment,Source,Tag
public class Segment extends java.lang.Object implements java.lang.Comparable<Segment>, java.lang.CharSequence
Represents a segment of aSourcedocument.Many of the tag search methods are defined in this class.
The span of a segment is defined by the combination of its begin and end character positions.
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description charcharAt(int index)Returns the character at the specified index.intcompareTo(Segment segment)Compares thisSegmentobject to another object.booleanencloses(int pos)Indicates whether this segment encloses the specified character position in the source document.booleanencloses(Segment segment)Indicates whether thisSegmentencloses the specifiedSegment.booleanequals(java.lang.Object object)Compares the specified object with thisSegmentfor equality.java.util.List<CharacterReference>getAllCharacterReferences()Returns a list of allCharacterReferenceobjects that are enclosed by this segment.java.util.List<Element>getAllElements()java.util.List<Element>getAllElements(java.lang.String name)java.util.List<Element>getAllElements(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)java.util.List<Element>getAllElements(java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)java.util.List<Element>getAllElements(StartTagType startTagType)java.util.List<Element>getAllElementsByClass(java.lang.String className)java.util.List<StartTag>getAllStartTags()java.util.List<StartTag>getAllStartTags(java.lang.String name)java.util.List<StartTag>getAllStartTags(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)java.util.List<StartTag>getAllStartTags(java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)java.util.List<StartTag>getAllStartTags(StartTagType startTagType)java.util.List<StartTag>getAllStartTagsByClass(java.lang.String className)java.util.List<Tag>getAllTags()java.util.List<Tag>getAllTags(TagType tagType)intgetBegin()Returns the character position in theSourcedocument at which this segment begins, inclusive.java.util.List<Element>getChildElements()Returns a list of the immediate children of this segment in the document element hierarchy.java.lang.StringgetDebugInfo()Returns a string representation of this object useful for debugging purposes.intgetEnd()Returns the character position in theSourcedocument immediately after the end of this segment.ElementgetFirstElement()ElementgetFirstElement(java.lang.String name)ElementgetFirstElement(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)ElementgetFirstElement(java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)ElementgetFirstElementByClass(java.lang.String className)StartTaggetFirstStartTag()StartTaggetFirstStartTag(java.lang.String name)StartTaggetFirstStartTag(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)StartTaggetFirstStartTag(java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)StartTaggetFirstStartTag(StartTagType startTagType)StartTaggetFirstStartTagByClass(java.lang.String className)java.util.List<FormControl>getFormControls()Returns a list of theFormControlobjects that are enclosed by this segment.FormFieldsgetFormFields()Returns theFormFieldsobject representing all form fields that are enclosed by this segment.intgetMaxDepthIndicator()Returns an indication of the maximum depth of nested elements within this segment.java.util.Iterator<Segment>getNodeIterator()Returns an iterator over every tag, character reference and plain text segment contained within this segment.RenderergetRenderer()Performs a simple rendering of the HTML markup in this segment into text.RowColumnVectorgetRowColumnVector()Returns aRowColumnVectorobject representing the row and column number of the start of this segment in the source document.SourcegetSource()Returns theSourcedocument containing this segment.java.util.List<Segment>getStyleURISegments()TextExtractorgetTextExtractor()Extracts the textual content from the HTML markup of this segment.java.util.List<Attribute>getURIAttributes()inthashCode()Returns a hash code value for the segment.voidignoreWhenParsing()Causes the this segment to be ignored when parsing.booleanisWhiteSpace()Indicates whether this segment consists entirely of white space.static booleanisWhiteSpace(char ch)Indicates whether the specified character is white space.intlength()Returns the length of the segment.AttributesparseAttributes()Parses anyAttributeswithin this segment.java.lang.CharSequencesubSequence(int beginIndex, int endIndex)Returns a new character sequence that is a subsequence of this sequence.java.lang.StringtoString()Returns the source text of this segment as aString.
-
-
-
Method Detail
-
getSource
public final Source getSource()
Returns theSourcedocument containing this segment.If a
StreamedSourceis in use, this method throws anUnsupportedOperationException.- Returns:
- the
Sourcedocument containing this segment.
-
getBegin
public final int getBegin()
Returns the character position in theSourcedocument at which this segment begins, inclusive.Use the
Source.getRowColumnVector(int pos)method to determine the row and column numbers corresponding to this character position.- Returns:
- the character position in the
Sourcedocument at which this segment begins, inclusive.
-
getEnd
public final int getEnd()
Returns the character position in theSourcedocument immediately after the end of this segment.The character at the position specified by this property is not included in the segment.
- Returns:
- the character position in the
Sourcedocument immediately after the end of this segment. - See Also:
getBegin()
-
equals
public final boolean equals(java.lang.Object object)
Compares the specified object with thisSegmentfor equality.Returns
trueif and only if the specified object is also aSegment, and both segments have the sameSource, and the same begin and end positions.- Overrides:
equalsin classjava.lang.Object- Parameters:
object- the object to be compared for equality with thisSegment.- Returns:
trueif the specified object is equal to thisSegment, otherwisefalse.
-
hashCode
public int hashCode()
Returns a hash code value for the segment.The current implementation returns the sum of the begin and end positions, although this is not guaranteed in future versions.
- Overrides:
hashCodein classjava.lang.Object- Returns:
- a hash code value for the segment.
-
length
public int length()
Returns the length of the segment. This is defined as the number of characters between the begin and end positions.- Specified by:
lengthin interfacejava.lang.CharSequence- Returns:
- the length of the segment.
-
encloses
public final boolean encloses(Segment segment)
Indicates whether thisSegmentencloses the specifiedSegment.This is the case if
getBegin()<=segment.getBegin()&&getEnd()>=segment.getEnd().Note that a segment encloses itself.
- Parameters:
segment- the segment to be tested for being enclosed by this segment.- Returns:
trueif thisSegmentencloses the specifiedSegment, otherwisefalse.
-
encloses
public final boolean encloses(int pos)
Indicates whether this segment encloses the specified character position in the source document.This is the case if
getBegin()<= pos <getEnd().- Parameters:
pos- the position in theSourcedocument.- Returns:
trueif this segment encloses the specified character position in the source document, otherwisefalse.
-
toString
public java.lang.String toString()
Returns the source text of this segment as aString.The returned
Stringis newly created with every call to this method, unless this segment is itself an instance ofSource.- Specified by:
toStringin interfacejava.lang.CharSequence- Overrides:
toStringin classjava.lang.Object- Returns:
- the source text of this segment as a
String.
-
getRenderer
public Renderer getRenderer()
Performs a simple rendering of the HTML markup in this segment into text.The output can be configured by setting any number of properties on the returned
Rendererinstance before obtaining its output.- Returns:
- an instance of
Rendererbased on this segment. - See Also:
getTextExtractor()
-
getTextExtractor
public TextExtractor getTextExtractor()
Extracts the textual content from the HTML markup of this segment.The output can be configured by setting properties on the returned
TextExtractorinstance before obtaining its output.- Returns:
- an instance of
TextExtractorbased on this segment. - See Also:
getRenderer()
-
getNodeIterator
public java.util.Iterator<Segment> getNodeIterator()
Returns an iterator over every tag, character reference and plain text segment contained within this segment.See the
Source.iterator()method for a detailed description.- Example:
-
The following code demonstrates the typical usage of this method to make an exact copy of this segment to
writer(assuming no server tags are present):for (Iterator<Segment> nodeIterator=segment.getNoteIterator(); nodeIterator.hasNext();) { Segment nodeSegment=nodeIterator.next(); if (nodeSegment instanceof Tag) { Tag tag=(Tag)nodeSegment; // HANDLE TAG // Uncomment the following line to ensure each tag is valid XML: // writer.write(tag.tidy()); continue; } else if (nodeSegment instanceof CharacterReference) { CharacterReference characterReference=(CharacterReference)nodeSegment; // HANDLE CHARACTER REFERENCE // Uncomment the following line to decode all character references instead of copying them verbatim: // characterReference.appendCharTo(writer); continue; } else { // HANDLE PLAIN TEXT } // unless specific handling has prevented getting to here, simply output the segment as is: writer.write(nodeSegment.toString()); }
- Returns:
- an iterator over every tag, character reference and plain text segment contained within this segment.
-
getAllTags
public java.util.List<Tag> getAllTags()
Returns a list of allTagobjects that are enclosed by this segment.The
Source.fullSequentialParse()method should be called after construction of theSourceobject if this method is to be used on a large proportion of the source. It is called automatically if this method is called on theSourceobject itself.See the
Tagclass documentation for more details about the behaviour of this method.
-
getAllTags
public java.util.List<Tag> getAllTags(TagType tagType)
Returns a list of allTagobjects of the specified type that are enclosed by this segment.See the
Tagclass documentation for more details about the behaviour of this method.Specifying a
nullargument to thetagTypeparameter is equivalent togetAllTags().- Parameters:
tagType- the type of tags to get.- Returns:
- a list of all
Tagobjects of the specified type that are enclosed by this segment. - See Also:
getAllStartTags(StartTagType)
-
getAllStartTags
public java.util.List<StartTag> getAllStartTags()
Returns a list of allStartTagobjects that are enclosed by this segment.The
Source.fullSequentialParse()method should be called after construction of theSourceobject if this method is to be used on a large proportion of the source. It is called automatically if this method is called on theSourceobject itself.See the
Tagclass documentation for more details about the behaviour of this method.
-
getAllStartTags
public java.util.List<StartTag> getAllStartTags(StartTagType startTagType)
Returns a list of allStartTagobjects of the specified type that are enclosed by this segment.See the
Tagclass documentation for more details about the behaviour of this method.Specifying a
nullargument to thestartTagTypeparameter is equivalent togetAllStartTags().
-
getAllStartTags
public java.util.List<StartTag> getAllStartTags(java.lang.String name)
Returns a list of all normalStartTagobjects with the specified name that are enclosed by this segment.See the
Tagclass documentation for more details about the behaviour of this method.Specifying a
nullargument to thenameparameter is equivalent togetAllStartTags(), which may include non-normal start tags.This method also returns unregistered tags if the specified name is not a valid XML tag name.
-
getAllStartTags
public java.util.List<StartTag> getAllStartTags(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)
Returns a list of allStartTagobjects with the specified attribute name/value pair that are enclosed by this segment.See the
Tagclass documentation for more details about the behaviour of this method.- Parameters:
attributeName- the attribute name (case insensitive) to search for, must not benull.value- the value of the specified attribute to search for, must not benull.valueCaseSensitive- specifies whether the attribute value matching is case sensitive.- Returns:
- a list of all
StartTagobjects with the specified attribute name/value pair that are enclosed by this segment. - See Also:
getAllStartTags(String attributeName, Pattern valueRegexPattern)
-
getAllStartTags
public java.util.List<StartTag> getAllStartTags(java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)
Returns a list of allStartTagobjects with the specified attribute name and value pattern that are enclosed by this segment.Specifying a
nullargument to thevalueRegexPatternparameter performs the search on the attribute name only, without regard to the attribute value. This will also match an attribute that has no value at all.See the
Tagclass documentation for more details about the behaviour of this method.- Parameters:
attributeName- the attribute name (case insensitive) to search for, must not benull.valueRegexPattern- the regular expression pattern that must match the attribute value, may benull.- Returns:
- a list of all
StartTagobjects with the specified attribute name and value pattern that are enclosed by this segment. - See Also:
getAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
-
getAllStartTagsByClass
public java.util.List<StartTag> getAllStartTagsByClass(java.lang.String className)
Returns a list of allStartTagobjects with the specified class that are enclosed by this segment.This matches start tags with a
classattribute that contains the specified class name, either as an exact match or where the specified class name is one of multiple class names separated by white space in the attribute value.See the
Tagclass documentation for more details about the behaviour of this method.
-
getChildElements
public java.util.List<Element> getChildElements()
Returns a list of the immediate children of this segment in the document element hierarchy.The returned list may include an element that extends beyond the end of this segment, as long as it begins within this segment.
An element found at the start of this segment is included in the list. Note however that if this segment is an
Element, the overridingElement.getChildElements()method is called instead, which only returns the children of the element.Calling
getChildElements()on anElementis much more efficient than calling it on aSegment.The objects in the list are all of type
Element.The
Source.fullSequentialParse()method should be called after construction of theSourceobject if this method is to be used on a large proportion of the source. It is called automatically if this method is called on theSourceobject itself.See the
Source.getChildElements()method for more details.- Returns:
- the a list of the immediate children of this segment in the document element hierarchy, guaranteed not
null. - See Also:
Element.getParentElement()
-
getAllElements
public java.util.List<Element> getAllElements()
Returns a list of allElementobjects that are enclosed by this segment.The
Source.fullSequentialParse()method should be called after construction of theSourceobject if this method is to be used on a large proportion of the source. It is called automatically if this method is called on theSourceobject itself.The elements returned correspond exactly with the start tags returned in the
getAllStartTags()method.If this segment is itself an
Element, the result includes this element in the list.
-
getAllElements
public java.util.List<Element> getAllElements(java.lang.String name)
Returns a list of allElementobjects with the specified name that are enclosed by this segment.The elements returned correspond with the start tags returned in the
getAllStartTags(String name)method, except that elements which are not entirely enclosed by this segment are excluded.Specifying a
nullargument to thenameparameter is equivalent togetAllElements(), which may include elements of non-normal tags.This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.
If this segment is itself an
Elementwith the specified name, the result includes this element in the list.
-
getAllElements
public java.util.List<Element> getAllElements(StartTagType startTagType)
Returns a list of allElementobjects with start tags of the specified type that are enclosed by this segment.The elements returned correspond with the start tags returned in the
getAllTags(TagType)method, except that elements which are not entirely enclosed by this segment are excluded.If this segment is itself an
Elementwith the specified type, the result includes this element in the list.
-
getAllElements
public java.util.List<Element> getAllElements(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)
Returns a list of allElementobjects with the specified attribute name/value pair that are enclosed by this segment.The elements returned correspond with the start tags returned in the
getAllStartTags(String attributeName, String value, boolean valueCaseSensitive)method, except that elements which are not entirely enclosed by this segment are excluded.If this segment is itself an
Elementwith the specified name/value pair, the result includes this element in the list.- Parameters:
attributeName- the attribute name (case insensitive) to search for, must not benull.value- the value of the specified attribute to search for, must not benull.valueCaseSensitive- specifies whether the attribute value matching is case sensitive.- Returns:
- a list of all
Elementobjects with the specified attribute name/value pair that are enclosed by this segment. - See Also:
getAllElements(String attributeName, Pattern valueRegexPattern)
-
getAllElements
public java.util.List<Element> getAllElements(java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)
Returns a list of allElementobjects with the specified attribute name and value pattern that are enclosed by this segment.The elements returned correspond with the start tags returned in the
getAllStartTags(String attributeName, Pattern valueRegexPattern)method, except that elements which are not entirely enclosed by this segment are excluded.Specifying a
nullargument to thevalueRegexPatternparameter performs the search on the attribute name only, without regard to the attribute value. This will also match an attribute that has no value at all.If this segment is itself an
Elementwith the specified attribute name and value pattern, the result includes this element in the list.- Parameters:
attributeName- the attribute name (case insensitive) to search for, must not benull.valueRegexPattern- the regular expression pattern that must match the attribute value, may benull.- Returns:
- a list of all
Elementobjects with the specified attribute name and value pattern that are enclosed by this segment. - See Also:
getAllElements(String attributeName, String value, boolean valueCaseSensitive)
-
getAllElementsByClass
public java.util.List<Element> getAllElementsByClass(java.lang.String className)
Returns a list of allElementobjects with the specified class that are enclosed by this segment.This matches elements with a
classattribute that contains the specified class name, either as an exact match or where the specified class name is one of multiple class names separated by white space in the attribute value.The elements returned correspond with the start tags returned in the
getAllStartTagsByClass(String className)method, except that elements which are not entirely enclosed by this segment are excluded.If this segment is itself an
Elementwith the specified class, the result includes this element in the list.
-
getAllCharacterReferences
public java.util.List<CharacterReference> getAllCharacterReferences()
Returns a list of allCharacterReferenceobjects that are enclosed by this segment.- Returns:
- a list of all
CharacterReferenceobjects that are enclosed by this segment.
-
getURIAttributes
public java.util.List<Attribute> getURIAttributes()
Returns a list of all attributes enclosed by this segment that have URI values.According to the HTML 4.01 specification, the following attributes have URI values:
HTML element name Attribute name Ahref APPLETcodebase APPLETarchive AREAhref BASEhref BLOCKQUOTEcite BODYbackground FORMaction FRAMElongdesc FRAMEsrc DELcite HEADprofile IFRAMElongdesc IFRAMEsrc IMGlongdesc IMGsrc IMGusemap INPUTsrc INPUTusemap INScite LINKhref OBJECTarchive OBJECTclassid OBJECTcodebase OBJECTdata OBJECTusemap Qcite SCRIPTsrc Attributes from other elements may also be returned if the attribute name matches one of those in the list above.
This method is often used in conjunction with the
getStyleURISegments()method in order to find all URIs in a document.The attributes are returned in order of appearance.
- Returns:
- a list of all attributes enclosed by this segment that have URI values.
- See Also:
getStyleURISegments()
-
getStyleURISegments
public java.util.List<Segment> getStyleURISegments()
Returns a list of all URI segments inside the CSS ofSTYLEelements andstyleattribute values enclosed by this segment.If this segment does not contain any tags, the entire segment is assumed to be CSS.
The URI segments are found by searching the CSS for the functional notation "
url()" as described in section 4.3.4 of the CSS2 specification.The segments are returned in order of appearance.
- Returns:
- a list of all URI segments inside
STYLEelements andstyleattribute values enclosed by this segment. - See Also:
getURIAttributes()
-
getFirstStartTag
public final StartTag getFirstStartTag()
Returns the firstStartTagenclosed by this segment.This is functionally equivalent to
getAllStartTags().iterator().next(), but does not search beyond the first start tag and returnsnullif no such start tag exists.
-
getFirstStartTag
public final StartTag getFirstStartTag(StartTagType startTagType)
Returns the firstStartTagof the specified type enclosed by this segment.This is functionally equivalent to
getAllStartTags(startTagType).iterator().next(), but does not search beyond the first start tag and returnsnullif no such start tag exists.
-
getFirstStartTag
public final StartTag getFirstStartTag(java.lang.String name)
Returns the first normalStartTagenclosed by this segment.This is functionally equivalent to
getAllStartTags(name).iterator().next(), but does not search beyond the first start tag and returnsnullif no such start tag exists.Specifying a
nullargument to thenameparameter is equivalent togetFirstStartTag().
-
getFirstStartTag
public final StartTag getFirstStartTag(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)
Returns the firstStartTagwith the specified attribute name/value pair enclosed by this segment.This is functionally equivalent to
getAllStartTags(attributeName,value,valueCaseSensitive).iterator().next(), but does not search beyond the first start tag and returnsnullif no such start tag exists.- Parameters:
attributeName- the attribute name (case insensitive) to search for, must not benull.value- the value of the specified attribute to search for, must not benull.valueCaseSensitive- specifies whether the attribute value matching is case sensitive.- Returns:
- the first
StartTagwith the specified attribute name/value pair enclosed by this segment, ornullif none exists. - See Also:
getFirstStartTag(String attributeName, Pattern valueRegexPattern)
-
getFirstStartTag
public final StartTag getFirstStartTag(java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)
Returns the firstStartTagwith the specified attribute name and value pattern that is enclosed by this segment.This is functionally equivalent to
getAllStartTags(attributeName,valueRegexPattern).iterator().next(), but does not search beyond the first start tag and returnsnullif no such start tag exists.- Parameters:
attributeName- the attribute name (case insensitive) to search for, must not benull.valueRegexPattern- the regular expression pattern that must match the attribute value, may benull.- Returns:
- the first
StartTagwith the specified attribute name and value pattern that is enclosed by this segment, ornullif none exists. - See Also:
getFirstStartTag(String attributeName, String value, boolean valueCaseSensitive)
-
getFirstStartTagByClass
public final StartTag getFirstStartTagByClass(java.lang.String className)
Returns the firstStartTagwith the specified class that is enclosed by this segment.This is functionally equivalent to
getAllStartTagsByClass(className).iterator().next(), but does not search beyond the first start tag and returnsnullif no such start tag exists.
-
getFirstElement
public final Element getFirstElement()
Returns the firstElementenclosed by this segment.This is functionally equivalent to
getAllElements().iterator().next(), but does not search beyond the first enclosed element and returnsnullif no such element exists.If this segment is itself an
Element, this element is returned, not the first child element.
-
getFirstElement
public final Element getFirstElement(java.lang.String name)
Returns the first normalElementwith the specified name enclosed by this segment.This is functionally equivalent to
getAllElements(name).iterator().next(), but does not search beyond the first enclosed element and returnsnullif no such element exists.Specifying a
nullargument to thenameparameter is equivalent togetFirstElement().If this segment is itself an
Elementwith the specified name, this element is returned.
-
getFirstElement
public final Element getFirstElement(java.lang.String attributeName, java.lang.String value, boolean valueCaseSensitive)
Returns the firstElementwith the specified attribute name/value pair enclosed by this segment.This is functionally equivalent to
getAllElements(attributeName,value,valueCaseSensitive).iterator().next(), but does not search beyond the first enclosed element and returnsnullif no such element exists.If this segment is itself an
Elementwith the specified attribute name/value pair, this element is returned.- Parameters:
attributeName- the attribute name (case insensitive) to search for, must not benull.value- the value of the specified attribute to search for, must not benull.valueCaseSensitive- specifies whether the attribute value matching is case sensitive.- Returns:
- the first
Elementwith the specified attribute name/value pair enclosed by this segment, ornullif none exists. - See Also:
getFirstElement(String attributeName, Pattern valueRegexPattern)
-
getFirstElement
public final Element getFirstElement(java.lang.String attributeName, java.util.regex.Pattern valueRegexPattern)
Returns the firstElementwith the specified attribute name and value pattern that is enclosed by this segment.This is functionally equivalent to
getAllElements(attributeName,valueRegexPattern).iterator().next(), but does not search beyond the first enclosed element and returnsnullif no such element exists.If this segment is itself an
Elementwith the specified attribute name and value pattern, this element is returned.- Parameters:
attributeName- the attribute name (case insensitive) to search for, must not benull.valueRegexPattern- the regular expression pattern that must match the attribute value, may benull.- Returns:
- the first
Elementwith the specified attribute name and value pattern that is enclosed by this segment, ornullif none exists. - See Also:
getFirstElement(String attributeName, String value, boolean valueCaseSensitive)
-
getFirstElementByClass
public final Element getFirstElementByClass(java.lang.String className)
Returns the firstElementwith the specified class that is enclosed by this segment.This is functionally equivalent to
getAllElementsByClass(className).iterator().next(), but does not search beyond the first enclosed element and returnsnullif no such element exists.If this segment is itself an
Elementwith the specified class, this element is returned.
-
getFormControls
public java.util.List<FormControl> getFormControls()
Returns a list of theFormControlobjects that are enclosed by this segment.- Returns:
- a list of the
FormControlobjects that are enclosed by this segment.
-
getFormFields
public FormFields getFormFields()
Returns theFormFieldsobject representing all form fields that are enclosed by this segment.This is equivalent to
new FormFields(getFormControls()).- Returns:
- the
FormFieldsobject representing all form fields that are enclosed by this segment. - See Also:
getFormControls()
-
parseAttributes
public Attributes parseAttributes()
Parses anyAttributeswithin this segment. This method is only used in the unusual situation where attributes exist outside of a start tag. TheStartTag.getAttributes()method should be used in normal situations.This is equivalent to
source.parseAttributes(getBegin(),getEnd()).- Returns:
- the
Attributeswithin this segment, ornullif too many errors occur while parsing.
-
ignoreWhenParsing
public void ignoreWhenParsing()
Causes the this segment to be ignored when parsing.Ignored segments are treated as blank spaces by the parsing mechanism, but are included as normal text in all other functions.
This method was originally the only means of preventing server tags located inside normal tags from interfering with the parsing of the tags (such as where an attribute of a normal tag uses a server tag to dynamically set its value), as well as preventing non-server tags from being recognised inside server tags.
It is not necessary to use this method to ignore server tags located inside normal tags, as the attributes parser automatically ignores any server tags.
It is not necessary to use this method to ignore non-server tags inside server tags, or the contents of
SCRIPTelements, as the parser does this automatically when performing a full sequential parse.This leaves only very few scenarios where calling this method still provides a significant benefit.
One such case is where XML-style server tags are used inside normal tags. Here is an example using an XML-style JSP tag:
The first double-quote of<a href="<i18n:resource path="/Portal"/>?BACK=TRUE">back</a>"/Portal"will be interpreted as the end quote for thehrefattribute, as there is no way for the parser to recognise theil8n:resourceelement as a server tag. Such use of XML-style server tags inside normal tags is generally seen as bad practice, but it is nevertheless valid JSP. The only way to ensure that this library is able to parse the normal tag surrounding it is to find these server tags first and call theignoreWhenParsingmethod to ignore them before parsing the rest of the document.It is important to understand the difference between ignoring the segment when parsing and removing the segment completely. Any text inside a segment that is ignored when parsing is treated by most functions as content, and as such is included in the output of tools such as
TextExtractorandRenderer.To remove segments completely, create an
OutputDocumentand call itsremove(Segment)orreplaceWithSpaces(int begin, int end)method for each segment. Then create a new source document usingnew Source(outputDocument.toString())and perform the desired operations on this new source object.Calling this method after the
Source.fullSequentialParse()method has been called is not permitted and throws anIllegalStateException.Any tags appearing in this segment that are found before this method is called will remain in the tag cache, and so will continue to be found by the tag search methods. If this is undesirable, the
Source.clearCache()method can be called to remove them from the cache. Calling theSource.fullSequentialParse()method after this method clears the cache automatically.For best performance, this method should be called on all segments that need to be ignored without calling any of the tag search methods in between.
-
compareTo
public int compareTo(Segment segment)
Compares thisSegmentobject to another object.If the argument is not a
Segment, aClassCastExceptionis thrown.A segment is considered to be before another segment if its begin position is earlier, or in the case that both segments begin at the same position, its end position is earlier.
Segments that begin and end at the same position are considered equal for the purposes of this comparison, even if they relate to different source documents.
Note: this class has a natural ordering that is inconsistent with equals. This means that this method may return zero in some cases where calling the
equals(Object)method with the same argument returnsfalse.- Specified by:
compareToin interfacejava.lang.Comparable<Segment>- Parameters:
segment- the segment to be compared- Returns:
- a negative integer, zero, or a positive integer as this segment is before, equal to, or after the specified segment.
- Throws:
java.lang.ClassCastException- if the argument is not aSegment
-
isWhiteSpace
public final boolean isWhiteSpace()
Indicates whether this segment consists entirely of white space.- Returns:
trueif this segment consists entirely of white space, otherwisefalse.
-
getMaxDepthIndicator
public int getMaxDepthIndicator()
Returns an indication of the maximum depth of nested elements within this segment.A high return value can indicate that the segment contains a large number of incorrectly nested tags that could result in a
StackOverflowExceptionif its content is parsed.The usefulness of this method is debatable as a
StackOverflowExceptionis a recoverable error that can be easily caught. The use of this method to pre-detect and avoid a stack overflow may save some memory and processing resources in certain circumstances, but the cost of calling this method to check every segment or document will very often exceed any benefit.It is up to the application developer to determine what return value constitutes an unreasonable level of nesting given the stack space allocated to the application and other factors.
Note that the return value is an approximation only and is usually greater than the actual maximum element depth that would be reported by calling the
Element.getDepth()method on the most nested element.- Returns:
- an indication of the maximum depth of nested elements within this segment.
-
isWhiteSpace
public static final boolean isWhiteSpace(char ch)
Indicates whether the specified character is white space.The HTML 4.01 specification section 9.1 specifies the following white space characters:
- space (U+0020)
- tab (U+0009)
- form feed (U+000C)
- line feed (U+000A)
- carriage return (U+000D)
- zero-width space (U+200B)
Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not recognise them as white space and renders them as an unprintable character (empty square). Even zero-width spaces included using the numeric character reference
​are rendered this way.- Parameters:
ch- the character to test.- Returns:
trueif the specified character is white space, otherwisefalse.
-
getRowColumnVector
public RowColumnVector getRowColumnVector()
Returns aRowColumnVectorobject representing the row and column number of the start of this segment in the source document.- Returns:
- a
RowColumnVectorobject representing the row and column number of the start of this segment in the source document. - See Also:
Source.getRowColumnVector(int pos)
-
getDebugInfo
public java.lang.String getDebugInfo()
Returns a string representation of this object useful for debugging purposes.- Returns:
- a string representation of this object useful for debugging purposes.
-
charAt
public char charAt(int index)
Returns the character at the specified index.This is logically equivalent to
toString().charAt(index)for valid argument values0 <= index < length().However because this implementation works directly on the underlying document source string, it should not be assumed that an
IndexOutOfBoundsExceptionis thrown for an invalid argument value.- Specified by:
charAtin interfacejava.lang.CharSequence- Parameters:
index- the index of the character.- Returns:
- the character at the specified index.
-
subSequence
public java.lang.CharSequence subSequence(int beginIndex, int endIndex)Returns a new character sequence that is a subsequence of this sequence.This is logically equivalent to
toString().subSequence(beginIndex,endIndex)for valid values ofbeginIndexandendIndex.However because this implementation works directly on the underlying document source text, it should not be assumed that an
IndexOutOfBoundsExceptionis thrown for invalid argument values as described in theString.subSequence(int,int)method.- Specified by:
subSequencein interfacejava.lang.CharSequence- Parameters:
beginIndex- the begin index, inclusive.endIndex- the end index, exclusive.- Returns:
- a new character sequence that is a subsequence of this sequence.
-
-