File Doc Category Size Date Package
SimilarityQueries.java API Doc Apache Lucene 2.0.0 3744 Fri May 26 09:54:02 BST 2006 org.apache.lucene.search.similar

SimilarityQueries

java.lang.Object

public final class SimilarityQueries extends Object

Simple similarity measures.

see: MoreLikeThis

Fields Summary
Constructors Summary
private SimilarityQueries()

Methods Summary
public static org.apache.lucene.search.Query formSimilarQuery(java.lang.String body, org.apache.lucene.analysis.Analyzer a, java.lang.String field, java.util.Set stop)
Simple similarity query generators. Takes every unique word and forms a boolean query where all words are optional. After you get this you'll use to to query your {@link IndexSearcher} for similar docs. The only caveat is the first hit returned should be your source document - you'll need to then ignore that.
So, if you have a code fragment like this:
Query q = formSimilaryQuery( "I use Lucene to search fast. Fast searchers are good", new StandardAnalyzer(), "contents", null);
The query returned, in string form, will be '(i use lucene to search fast searchers are good').
The philosophy behind this method is "two documents are similar if they share lots of words". Note that behind the scenes, Lucenes scoring algorithm will tend to give two documents a higher similarity score if the share more uncommon words.
This method is fail-safe in that if a long 'body' is passed in and {@link BooleanQuery#add BooleanQuery.add()} (used internally) throws {@link org.apache.lucene.search.BooleanQuery.TooManyClauses BooleanQuery.TooManyClauses}, the query as it is will be returned.
param
body the body of the document you want to find similar documents to
param
a the analyzer to use to parse the body
param
field the field you want to search on, probably something like "contents" or "body"
param
stop optional set of stop words to ignore
return
a query with all unique words in 'body'
throws
IOException this can't happen...
TokenStream ts = a.tokenStream( field, new StringReader( body)); org.apache.lucene.analysis.Token t; BooleanQuery tmp = new BooleanQuery(); Set already = new HashSet(); // ignore dups while ( (t = ts.next()) != null) { String word = t.termText(); // ignore opt stop words if ( stop != null && stop.contains( word)) continue; // ignore dups if ( ! already.add( word)) continue; // add to query TermQuery tq = new TermQuery( new Term( field, word)); try { tmp.add( tq, BooleanClause.Occur.SHOULD); } catch( BooleanQuery.TooManyClauses too) { // fail-safe, just return what we have, not the end of the world break; } } return tmp;