File Doc Category Size Date Package
ChineseFilter.java API Doc Apache Lucene 1.9 2866 Mon Feb 20 09:18:48 GMT 2006 org.apache.lucene.analysis.cn

ChineseFilter

java.lang.Object
- org.apache.lucene.analysis.TokenStream
  - org.apache.lucene.analysis.TokenFilter

public final class ChineseFilter extends TokenFilter

Title: ChineseFilter Description: Filter with a stop word table Rule: No digital is allowed. English word/token should larger than 1 character. One Chinese character as one Chinese word. TO DO: 1. Add Chinese stop words, such as \ue400 2. Dictionary based Chinese word extraction 3. Intelligent Chinese word extraction Copyright: Copyright (c) 2001 Company:

author: Yiyi Sun
version: 1.0

Fields Summary
public static final String[]
STOP_WORDS
private Hashtable
stopTable
Constructors Summary
public ChineseFilter(TokenStream in)
super(in); stopTable = new Hashtable(STOP_WORDS.length); for (int i = 0; i < STOP_WORDS.length; i++) stopTable.put(STOP_WORDS[i], STOP_WORDS[i]);
Methods Summary
public final org.apache.lucene.analysis.Token next()
for (Token token = input.next(); token != null; token = input.next()) { String text = token.termText(); // why not key off token type here assuming ChineseTokenizer comes first? if (stopTable.get(text) == null) { switch (Character.getType(text.charAt(0))) { case Character.LOWERCASE_LETTER: case Character.UPPERCASE_LETTER: // English word/token should larger than 1 character. if (text.length()>1) { return token; } break; case Character.OTHER_LETTER: // One Chinese character as one Chinese word. // Chinese word extraction to be added later here. return token; } } } return null;