FileDocCategorySizeDatePackage
GermanStemmer.javaAPI DocApache Lucene 1.99354Mon Feb 20 09:18:54 GMT 2006org.apache.lucene.analysis.de

GermanStemmer

public class GermanStemmer extends Object
A stemmer for German words. The algorithm is based on the report "A Fast and Simple Stemming Algorithm for German Words" by Jörg Caumanns (joerg.caumanns at isst.fhg.de).
author
Gerhard Schwarz
version
$Id: GermanStemmer.java 178239 2005-05-24 18:44:20Z dnaber $

Fields Summary
private StringBuffer
sb
Buffer for the terms while stemming them.
private int
substCount
Amount of characters that are removed with substitute() while stemming.
Constructors Summary
Methods Summary
private booleanisStemmable(java.lang.String term)
Checks if a term could be stemmed.

return
true if, and only if, the given term consists in letters.

      for ( int c = 0; c < term.length(); c++ ) {
        if ( !Character.isLetter( term.charAt( c ) ) )
          return false;
      }
      return true;
    
private voidoptimize(java.lang.StringBuffer buffer)
Does some optimizations on the term. This optimisations are contextual.

      // Additional step for female plurals of professions and inhabitants.
      if ( buffer.length() > 5 && buffer.substring( buffer.length() - 5, buffer.length() ).equals( "erin*" ) ) {
        buffer.deleteCharAt( buffer.length() -1 );
        strip( buffer );
      }
      // Additional step for irregular plural nouns like "Matrizen -> Matrix".
      if ( buffer.charAt( buffer.length() - 1 ) == ( 'z" ) ) {
        buffer.setCharAt( buffer.length() - 1, 'x" );
      }
    
private voidremoveParticleDenotion(java.lang.StringBuffer buffer)
Removes a particle denotion ("ge") from a term.

      if ( buffer.length() > 4 ) {
        for ( int c = 0; c < buffer.length() - 3; c++ ) {
          if ( buffer.substring( c, c + 4 ).equals( "gege" ) ) {
            buffer.delete( c, c + 2 );
            return;
          }
        }
      }
    
private voidresubstitute(java.lang.StringBuffer buffer)
Undoes the changes made by substitute(). That are character pairs and character combinations. Umlauts will remain as their corresponding vowel, as "Ã