Simple similarity query generators.
Takes every unique word and forms a boolean query where all words are optional.
After you get this you'll use to to query your {@link IndexSearcher} for similar docs.
The only caveat is the first hit returned should be your source document - you'll
need to then ignore that.
So, if you have a code fragment like this:
Query q = formSimilaryQuery( "I use Lucene to search fast. Fast searchers are good", new StandardAnalyzer(), "contents", null);
The query returned, in string form, will be '(i use lucene to search fast searchers are good')
.
The philosophy behind this method is "two documents are similar if they share lots of words".
Note that behind the scenes, Lucenes scoring algorithm will tend to give two documents a higher similarity score if the share more uncommon words.
This method is fail-safe in that if a long 'body' is passed in and
{@link BooleanQuery#add BooleanQuery.add()} (used internally)
throws
{@link org.apache.lucene.search.BooleanQuery.TooManyClauses BooleanQuery.TooManyClauses}, the
query as it is will be returned.
TokenStream ts = a.tokenStream( field, new StringReader( body));
org.apache.lucene.analysis.Token t;
BooleanQuery tmp = new BooleanQuery();
Set already = new HashSet(); // ignore dups
while ( (t = ts.next()) != null)
{
String word = t.termText();
// ignore opt stop words
if ( stop != null &&
stop.contains( word)) continue;
// ignore dups
if ( ! already.add( word)) continue;
// add to query
TermQuery tq = new TermQuery( new Term( field, word));
try
{
tmp.add( tq, BooleanClause.Occur.SHOULD);
}
catch( BooleanQuery.TooManyClauses too)
{
// fail-safe, just return what we have, not the end of the world
break;
}
}
return tmp;