GermanStemmerpublic class GermanStemmer extends Object A stemmer for German words. The algorithm is based on the report
"A Fast and Simple Stemming Algorithm for German Words" by Jörg
Caumanns (joerg.caumanns@isst.fhg.de). |
Fields Summary |
---|
private StringBuffer | sbBuffer for the terms while stemming them. | private int | substCountAmount of characters that are removed with substitute() while stemming. |
Methods Summary |
---|
private boolean | isStemmable(java.lang.String term)Checks if a term could be stemmed.
for ( int c = 0; c < term.length(); c++ ) {
if ( !Character.isLetter( term.charAt( c ) ) )
return false;
}
return true;
| private void | optimize(java.lang.StringBuffer buffer)Does some optimizations on the term. This optimisations are
contextual.
// Additional step for female plurals of professions and inhabitants.
if ( buffer.length() > 5 && buffer.substring( buffer.length() - 5, buffer.length() ).equals( "erin*" ) ) {
buffer.deleteCharAt( buffer.length() -1 );
strip( buffer );
}
// Additional step for irregular plural nouns like "Matrizen -> Matrix".
if ( buffer.charAt( buffer.length() - 1 ) == ( 'z" ) ) {
buffer.setCharAt( buffer.length() - 1, 'x" );
}
| private void | removeParticleDenotion(java.lang.StringBuffer buffer)Removes a particle denotion ("ge") from a term.
if ( buffer.length() > 4 ) {
for ( int c = 0; c < buffer.length() - 3; c++ ) {
if ( buffer.substring( c, c + 4 ).equals( "gege" ) ) {
buffer.delete( c, c + 2 );
return;
}
}
}
| private void | resubstitute(java.lang.StringBuffer buffer)Undoes the changes made by substitute(). That are character pairs and
character combinations. Umlauts will remain as their corresponding vowel,
as "Ã |
|