public final class Tokenizers extends Object
All methods return immutable objects provided the arguments are also immutable.
| Modifier and Type | Method and Description |
|---|---|
static Tokenizer |
chain(List<Tokenizer> tokenizers)
Chains tokenizers together.
|
static Tokenizer |
chain(Tokenizer tokenizer,
Tokenizer... tokenizers)
Chains tokenizers together.
|
static Tokenizer |
filter(Tokenizer tokenizer,
com.google.common.base.Predicate<String> predicate)
Constructs a new filtering tokenizer.
|
static Tokenizer |
pattern(Pattern pattern)
Returns a tokenizer that splits a string into tokens around the pattern
as if calling
pattern.split(input,-1). |
static Tokenizer |
pattern(String regex)
Returns a tokenizer that splits a string into tokens around the pattern
as if calling
Pattern.compile(regex).split(input,-1). |
static Tokenizer |
qGram(int q)
Returns a basic q-gram tokenizer for a variable q.
|
static Tokenizer |
qGramWithFilter(int q)
Returns a basic q-gram tokenizer for a variable q.
|
static Tokenizer |
qGramWithPadding(int q)
Returns a basic q-gram tokenizer for a variable q.The input is padded
with q-1 special characters before being tokenized.
|
static Tokenizer |
qGramWithPadding(int q,
String padding)
Returns a basic q-gram tokenizer for a variable Q.The Q-Gram is extended
beyond the length of the string with padding.
|
static Tokenizer |
qGramWithPadding(int q,
String startPadding,
String endPadding)
Returns a basic q-gram tokenizer for a variable Q.The Q-Gram is extended
beyond the length of the string with padding.
|
static Tokenizer |
transform(Tokenizer tokenizer,
com.google.common.base.Function<String,String> function)
Constructs a new transforming tokenizer.
|
static Tokenizer |
whitespace()
Returns a tokenizer that splits a string into tokens around whitespace.
|
public static Tokenizer chain(List<Tokenizer> tokenizers)
tokenizers - a non-empty list of tokenizerspublic static Tokenizer chain(Tokenizer tokenizer, Tokenizer... tokenizers)
tokenizer - the first tokenizertokenizers - a the other tokenizerspublic static Tokenizer filter(Tokenizer tokenizer, com.google.common.base.Predicate<String> predicate)
tokenizer - delegate tokenizerpredicate - for tokens to keeppublic static Tokenizer pattern(Pattern pattern)
pattern.split(input,-1).pattern - to split the the string aroundpublic static Tokenizer pattern(String regex)
Pattern.compile(regex).split(input,-1).regex - to split the the string aroundpublic static Tokenizer qGram(int q)
q - size of the tokenspublic static Tokenizer qGramWithFilter(int q)
q - size of the tokenspublic static Tokenizer qGramWithPadding(int q)
# as the
default padding.q - size of the tokenspublic static Tokenizer qGramWithPadding(int q, String padding)
q - size of the tokenspadding - padding to padd start and end of string withpublic static Tokenizer qGramWithPadding(int q, String startPadding, String endPadding)
q - size of the tokensstartPadding - padding to padd startof string withendPadding - padding to padd end of string withpublic static Tokenizer transform(Tokenizer tokenizer, com.google.common.base.Function<String,String> function)
tokenizer - delegate tokenizerfunction - to transform tokenspublic static Tokenizer whitespace()
To create tokenizer that returns leading and trailing empty tokens use
Tokenizers.pattern("\\s+")
Copyright © 2014–2018. All rights reserved.