FileDocCategorySizeDatePackage
RussianAnalyzer.javaAPI DocApache Lucene 2.1.07333Wed Feb 14 10:46:28 GMT 2007org.apache.lucene.analysis.ru

RussianAnalyzer

public final class RussianAnalyzer extends Analyzer
Analyzer for Russian language. Supports an external list of stopwords (words that will not be indexed at all). A default set of stopwords is used unless an alternative list is specified.
author
Boris Okner, b.okner@rogers.com
version
$Id: RussianAnalyzer.java 472959 2006-11-09 16:21:50Z yonik $

Fields Summary
private static final char
A
private static final char
B
private static final char
V
private static final char
G
private static final char
D
private static final char
E
private static final char
ZH
private static final char
Z
private static final char
I
private static final char
I_
private static final char
K
private static final char
L
private static final char
M
private static final char
N
private static final char
O
private static final char
P
private static final char
R
private static final char
S
private static final char
T
private static final char
U
private static final char
X
private static final char
CH
private static final char
SH
private static final char
SHCH
private static final char
Y
private static final char
SOFT
private static final char
AE
private static final char
IU
private static final char
IA
private static char[]
RUSSIAN_STOP_WORDS
List of typical Russian stopwords.
private Set
stopSet
Contains the stopwords used with the StopFilter.
private char[]
charset
Charset for Russian letters. Represents encoding for 32 lowercase Russian letters. Predefined charsets can be taken from RussianCharSets class
Constructors Summary
public RussianAnalyzer()



      
        charset = RussianCharsets.UnicodeRussian;
        stopSet = StopFilter.makeStopSet(
                    makeStopWords(RussianCharsets.UnicodeRussian));
    
public RussianAnalyzer(char[] charset)
Builds an analyzer.

        this.charset = charset;
        stopSet = StopFilter.makeStopSet(makeStopWords(charset));
    
public RussianAnalyzer(char[] charset, String[] stopwords)
Builds an analyzer with the given stop words.

        this.charset = charset;
        stopSet = StopFilter.makeStopSet(stopwords);
    
public RussianAnalyzer(char[] charset, Hashtable stopwords)
Builds an analyzer with the given stop words.

todo
create a Set version of this ctor

        this.charset = charset;
        stopSet = new HashSet(stopwords.keySet());
    
Methods Summary
private static java.lang.String[]makeStopWords(char[] charset)

        String[] res = new String[RUSSIAN_STOP_WORDS.length];
        for (int i = 0; i < res.length; i++)
        {
            char[] theStopWord = RUSSIAN_STOP_WORDS[i];
            // translate the word, using the charset
            StringBuffer theWord = new StringBuffer();
            for (int j = 0; j < theStopWord.length; j++)
            {
                theWord.append(charset[theStopWord[j]]);
            }
            res[i] = theWord.toString();
        }
        return res;
    
public org.apache.lucene.analysis.TokenStreamtokenStream(java.lang.String fieldName, java.io.Reader reader)
Creates a TokenStream which tokenizes all the text in the provided Reader.

return
A TokenStream build from a RussianLetterTokenizer filtered with RussianLowerCaseFilter, StopFilter, and RussianStemFilter

        TokenStream result = new RussianLetterTokenizer(reader, charset);
        result = new RussianLowerCaseFilter(result, charset);
        result = new StopFilter(result, stopSet);
        result = new RussianStemFilter(result, charset);
        return result;