FileDocCategorySizeDatePackage
CharTokenizer.javaAPI DocApache Lucene 2.1.02936Wed Feb 14 10:46:38 GMT 2007org.apache.lucene.analysis

CharTokenizer

public abstract class CharTokenizer extends Tokenizer
An abstract base class for simple, character-oriented tokenizers.

Fields Summary
private int
offset
private int
bufferIndex
private int
dataLen
private static final int
MAX_WORD_LEN
private static final int
IO_BUFFER_SIZE
private final char[]
buffer
private final char[]
ioBuffer
Constructors Summary
public CharTokenizer(Reader input)

    super(input);
  
Methods Summary
protected abstract booleanisTokenChar(char c)
Returns true iff a character should be included in a token. This tokenizer generates as tokens adjacent sequences of characters which satisfy this predicate. Characters for which this is false are used to define token boundaries and are not included in tokens.

public final org.apache.lucene.analysis.Tokennext()
Returns the next token in the stream, or null at EOS.

    int length = 0;
    int start = offset;
    while (true) {
      final char c;

      offset++;
      if (bufferIndex >= dataLen) {
        dataLen = input.read(ioBuffer);
        bufferIndex = 0;
      }
      ;
      if (dataLen == -1) {
        if (length > 0)
          break;
        else
          return null;
      } else
        c = ioBuffer[bufferIndex++];

      if (isTokenChar(c)) {               // if it's a token char

        if (length == 0)			           // start of token
          start = offset - 1;

        buffer[length++] = normalize(c); // buffer it, normalized

        if (length == MAX_WORD_LEN)		   // buffer overflow!
          break;

      } else if (length > 0)             // at non-Letter w/ chars
        break;                           // return 'em

    }

    return new Token(new String(buffer, 0, length), start, start + length);
  
protected charnormalize(char c)
Called on each token character to normalize it before it is added to the token. The default implementation does nothing. Subclasses may use this to, e.g., lowercase tokens.


                                               
      

                                 
      
    return c;