RussianLetterTokenizerpublic class RussianLetterTokenizer extends CharTokenizer A RussianLetterTokenizer is a tokenizer that extends LetterTokenizer by additionally looking up letters
in a given "russian charset". The problem with LeterTokenizer is that it uses Character.isLetter() method,
which doesn't know how to detect letters in encodings like CP1252 and KOI8
(well-known problems with 0xD7 and 0xF7 chars) |
Fields Summary |
---|
private char[] | charsetConstruct a new LetterTokenizer. |
Constructors Summary |
---|
public RussianLetterTokenizer(Reader in, char[] charset)
super(in);
this.charset = charset;
|
Methods Summary |
---|
protected boolean | isTokenChar(char c)Collects only characters which satisfy
{@link Character#isLetter(char)}.
if (Character.isLetter(c))
return true;
for (int i = 0; i < charset.length; i++)
{
if (c == charset[i])
return true;
}
return false;
|
|