FileDocCategorySizeDatePackage
RussianLetterTokenizer.javaAPI DocApache Lucene 2.1.02029Wed Feb 14 10:46:28 GMT 2007org.apache.lucene.analysis.ru

RussianLetterTokenizer

public class RussianLetterTokenizer extends CharTokenizer
A RussianLetterTokenizer is a tokenizer that extends LetterTokenizer by additionally looking up letters in a given "russian charset". The problem with LeterTokenizer is that it uses Character.isLetter() method, which doesn't know how to detect letters in encodings like CP1252 and KOI8 (well-known problems with 0xD7 and 0xF7 chars)
author
Boris Okner, b.okner@rogers.com
version
$Id: RussianLetterTokenizer.java 472959 2006-11-09 16:21:50Z yonik $

Fields Summary
private char[]
charset
Construct a new LetterTokenizer.
Constructors Summary
public RussianLetterTokenizer(Reader in, char[] charset)

        super(in);
        this.charset = charset;
    
Methods Summary
protected booleanisTokenChar(char c)
Collects only characters which satisfy {@link Character#isLetter(char)}.

        if (Character.isLetter(c))
            return true;
        for (int i = 0; i < charset.length; i++)
        {
            if (c == charset[i])
                return true;
        }
        return false;