FileDocCategorySizeDatePackage
GermanStemmer.javaAPI DocApache Lucene 1.4.39248Sun May 30 22:24:20 BST 2004org.apache.lucene.analysis.de

GermanStemmer

public class GermanStemmer extends Object
A stemmer for German words. The algorithm is based on the report "A Fast and Simple Stemming Algorithm for German Words" by Jörg Caumanns (joerg.caumanns@isst.fhg.de).
author
Gerhard Schwarz
version
$Id: GermanStemmer.java,v 1.11 2004/05/30 20:24:20 otis Exp $

Fields Summary
private StringBuffer
sb
Buffer for the terms while stemming them.
private int
substCount
Amount of characters that are removed with substitute() while stemming.
Constructors Summary
Methods Summary
private booleanisStemmable(java.lang.String term)
Checks if a term could be stemmed.

return
true if, and only if, the given term consists in letters.

      for ( int c = 0; c < term.length(); c++ ) {
        if ( !Character.isLetter( term.charAt( c ) ) )
          return false;
      }
      return true;
    
private voidoptimize(java.lang.StringBuffer buffer)
Does some optimizations on the term. This optimisations are contextual.

      // Additional step for female plurals of professions and inhabitants.
      if ( buffer.length() > 5 && buffer.substring( buffer.length() - 5, buffer.length() ).equals( "erin*" ) ) {
        buffer.deleteCharAt( buffer.length() -1 );
        strip( buffer );
      }
      // Additional step for irregular plural nouns like "Matrizen -> Matrix".
      if ( buffer.charAt( buffer.length() - 1 ) == ( 'z" ) ) {
        buffer.setCharAt( buffer.length() - 1, 'x" );
      }
    
private voidremoveParticleDenotion(java.lang.StringBuffer buffer)
Removes a particle denotion ("ge") from a term.

      if ( buffer.length() > 4 ) {
        for ( int c = 0; c < buffer.length() - 3; c++ ) {
          if ( buffer.substring( c, c + 4 ).equals( "gege" ) ) {
            buffer.delete( c, c + 2 );
            return;
          }
        }
      }
    
private voidresubstitute(java.lang.StringBuffer buffer)
Undoes the changes made by substitute(). That are character pairs and character combinations. Umlauts will remain as their corresponding vowel, as "Ã