FileDocCategorySizeDatePackage
MoreLikeThis.javaAPI DocApache Lucene 2.1.031360Wed Feb 14 10:46:34 GMT 2007org.apache.lucene.search.similar

MoreLikeThis

public final class MoreLikeThis extends Object
Generate "more like this" similarity queries. Based on this mail:
Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
Term frequencies can be computed by re-tokenizing the text, which, for a single document,
is usually fast enough. But looking up the docFreq() of every term in the document is
probably too slow.

You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
or at all. Since you're trying to maximize a tf*idf score, you're probably most interested
in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
reduce the number of terms under consideration. Another heuristic is that terms with a
high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
number of characters, not selecting anything less than, e.g., six or seven characters.
With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
that do a pretty good job of characterizing a document.

It all depends on what you're trying to do. If you're trying to eek out that last percent
of precision and recall regardless of computational difficulty so that you can win a TREC
competition, then the techniques I mention above are useless. But if you're trying to
provide a "more like this" button on a search results page that does a decent job and has
good performance, such techniques might be useful.

An efficient, effective "more-like-this" query generator would be a great contribution, if
anyone's interested. I'd imagine that it would take a Reader or a String (the document's
text), analyzer Analyzer, and return a set of representative terms using heuristics like those
above. The frequency and length thresholds could be parameters, etc.

Doug

Initial Usage

This class has lots of options to try to make it efficient and flexible. See the body of {@link #main main()} below in the source for real code, or if you want pseudo code, the simpliest possible usage is as follows. The bold fragment is specific to this class.

IndexReader ir = ...
IndexSearcher is = ...

MoreLikeThis mlt = new MoreLikeThis(ir);
Reader target = ... // orig source of doc you want to find similarities to
Query query = mlt.like( target);

Hits hits = is.search(query);
// now the usual iteration thru 'hits' - the only thing to watch for is to make sure
you ignore the doc if it matches your 'target' document, as it should be similar to itself 

Thus you:
  1. do your normal, Lucene setup for searching,
  2. create a MoreLikeThis,
  3. get the text of the doc you want to find similaries to
  4. then call one of the like() calls to generate a similarity query
  5. call the searcher to find the similar docs

More Advanced Usage

You may want to use {@link #setFieldNames setFieldNames(...)} so you can examine multiple fields (e.g. body and title) for similarity.

Depending on the size of your index and the size and makeup of your documents you may want to call the other set methods to control how the similarity queries are generated:

  • {@link #setMinTermFreq setMinTermFreq(...)}
  • {@link #setMinDocFreq setMinDocFreq(...)}
  • {@link #setMinWordLen setMinWordLen(...)}
  • {@link #setMaxWordLen setMaxWordLen(...)}
  • {@link #setMaxQueryTerms setMaxQueryTerms(...)}
  • {@link #setMaxNumTokensParsed setMaxNumTokensParsed(...)}
  • {@link #setStopWords setStopWord(...)}

Changes: Mark Harwood 29/02/04
Some bugfixing, some refactoring, some optimisation.
- bugfix: retrieveTerms(int docNum) was not working for indexes without a termvector -added missing code
- bugfix: No significant terms being created for fields with a termvector - because
was only counting one occurence per term/field pair in calculations(ie not including frequency info from TermVector)
- refactor: moved common code into isNoiseWord()
- optimise: when no termvector support available - used maxNumTermsParsed to limit amount of tokenization
author
David Spencer
author
Bruce Ritchie
author
Mark Harwood

Fields Summary
public static final int
DEFAULT_MAX_NUM_TOKENS_PARSED
Default maximum number of tokens to parse in each example doc field that is not stored with TermVector support.
public static final Analyzer
DEFAULT_ANALYZER
Default analyzer to parse source doc with.
public static final int
DEFAULT_MIN_TERM_FREQ
Ignore terms with less than this frequency in the source doc.
public static final int
DEFAULT_MIN_DOC_FREQ
Ignore words which do not occur in at least this many docs.
public static final boolean
DEFAULT_BOOST
Boost terms in query based on score.
public static final String[]
DEFAULT_FIELD_NAMES
Default field names. Null is used to specify that the field names should be looked up at runtime from the provided reader.
public static final int
DEFAULT_MIN_WORD_LENGTH
Ignore words less than this length or if 0 then this has no effect.
public static final int
DEFAULT_MAX_WORD_LENGTH
Ignore words greater than this length or if 0 then this has no effect.
public static final Set
DEFAULT_STOP_WORDS
Default set of stopwords. If null means to allow stop words.
private Set
stopWords
Current set of stop words.
public static final int
DEFAULT_MAX_QUERY_TERMS
Return a Query with no more than this many terms.
private Analyzer
analyzer
Analyzer that will be used to parse the doc.
private int
minTermFreq
Ignore words less freqent that this.
private int
minDocFreq
Ignore words which do not occur in at least this many docs.
private boolean
boost
Should we apply a boost to the Query based on the scores?
private String[]
fieldNames
Field name we'll analyze.
private int
maxNumTokensParsed
The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
private int
minWordLen
Ignore words if less than this len.
private int
maxWordLen
Ignore words if greater than this len.
private int
maxQueryTerms
Don't return a query longer than this.
private Similarity
similarity
For idf() calculations.
private final IndexReader
ir
IndexReader to use
Constructors Summary
public MoreLikeThis(IndexReader ir)
Constructor requiring an IndexReader.


             
       
        this.ir = ir;
    
Methods Summary
private voidaddTermFrequencies(java.util.Map termFreqMap, org.apache.lucene.index.TermFreqVector vector)
Adds terms and frequencies found in vector into the Map termFreqMap

param
termFreqMap a Map of terms and their frequencies
param
vector List of terms and their frequencies for a doc/field

		String[] terms = vector.getTerms();
		int freqs[]=vector.getTermFrequencies();
		for (int j = 0; j < terms.length; j++) {
		    String term = terms[j];
		
			if(isNoiseWord(term)){
				continue;
			}
		    // increment frequency
		    Int cnt = (Int) termFreqMap.get(term);
		    if (cnt == null) {
		    	cnt=new Int();
				termFreqMap.put(term, cnt);
				cnt.x=freqs[j];				
		    }
		    else {
		        cnt.x+=freqs[j];
		    }
		}
	
private voidaddTermFrequencies(java.io.Reader r, java.util.Map termFreqMap, java.lang.String fieldName)
Adds term frequencies found by tokenizing text from reader into the Map words

param
r a source of text to be tokenized
param
termFreqMap a Map of terms and their frequencies
param
fieldName Used by analyzer for any special per-field analysis

		   TokenStream ts = analyzer.tokenStream(fieldName, r);
			org.apache.lucene.analysis.Token token;
			int tokenCount=0;
			while ((token = ts.next()) != null) { // for every token
				String word = token.termText();
				tokenCount++;
				if(tokenCount>maxNumTokensParsed)
				{
					break;
				}
				if(isNoiseWord(word)){
					continue;
				}
				
				// increment frequency
				Int cnt = (Int) termFreqMap.get(word);
				if (cnt == null) {
					termFreqMap.put(word, new Int());
				}
				else {
					cnt.x++;
				}
			}
	
private org.apache.lucene.search.QuerycreateQuery(org.apache.lucene.util.PriorityQueue q)
Create the More like query from a PriorityQueue

        BooleanQuery query = new BooleanQuery();
        Object cur;
        int qterms = 0;
        float bestScore = 0;

        while (((cur = q.pop()) != null)) {
            Object[] ar = (Object[]) cur;
            TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0]));

            if (boost) {
                if (qterms == 0) {
                    bestScore = ((Float) ar[2]).floatValue();
                }
                float myScore = ((Float) ar[2]).floatValue();

                tq.setBoost(myScore / bestScore);
            }

            try {
                query.add(tq, BooleanClause.Occur.SHOULD);
            }
            catch (BooleanQuery.TooManyClauses ignore) {
                break;
            }

            qterms++;
            if (maxQueryTerms > 0 && qterms >= maxQueryTerms) {
                break;
            }
        }

        return query;
    
private org.apache.lucene.util.PriorityQueuecreateQueue(java.util.Map words)
Create a PriorityQueue from a word->tf map.

param
words a map of words keyed on the word(String) with Int objects as the values.

        // have collected all words in doc and their freqs
        int numDocs = ir.numDocs();
        FreqQ res = new FreqQ(words.size()); // will order words by score

        Iterator it = words.keySet().iterator();
        while (it.hasNext()) { // for every word
            String word = (String) it.next();

            int tf = ((Int) words.get(word)).x; // term freq in the source doc
            if (minTermFreq > 0 && tf < minTermFreq) {
                continue; // filter out words that don't occur enough times in the source
            }

            // go through all the fields and find the largest document frequency
            String topField = fieldNames[0];
            int docFreq = 0;
            for (int i = 0; i < fieldNames.length; i++) {
                int freq = ir.docFreq(new Term(fieldNames[i], word));
                topField = (freq > docFreq) ? fieldNames[i] : topField;
                docFreq = (freq > docFreq) ? freq : docFreq;
            }

            if (minDocFreq > 0 && docFreq < minDocFreq) {
                continue; // filter out words that don't occur in enough docs
            }

            if (docFreq == 0) {
                continue; // index update problem?
            }

            float idf = similarity.idf(docFreq, numDocs);
            float score = tf * idf;

            // only really need 1st 3 entries, other ones are for troubleshooting
            res.insert(new Object[]{word,                   // the word
                                    topField,               // the top field
                                    new Float(score),       // overall score
                                    new Float(idf),         // idf
                                    new Integer(docFreq),   // freq in all docs
                                    new Integer(tf)
            });
        }
        return res;
    
public java.lang.StringdescribeParams()
Describe the parameters that control how the "more like this" query is formed.

        StringBuffer sb = new StringBuffer();
        sb.append("\t" + "maxQueryTerms  : " + maxQueryTerms + "\n");
        sb.append("\t" + "minWordLen     : " + minWordLen + "\n");
        sb.append("\t" + "maxWordLen     : " + maxWordLen + "\n");
        sb.append("\t" + "fieldNames     : \"");
        String delim = "";
        for (int i = 0; i < fieldNames.length; i++) {
            String fieldName = fieldNames[i];
            sb.append(delim).append(fieldName);
            delim = ", ";
        }
        sb.append("\n");
        sb.append("\t" + "boost          : " + boost + "\n");
        sb.append("\t" + "minTermFreq    : " + minTermFreq + "\n");
        sb.append("\t" + "minDocFreq     : " + minDocFreq + "\n");
        return sb.toString();
    
public org.apache.lucene.analysis.AnalyzergetAnalyzer()
Returns an analyzer that will be used to parse source doc with. The default analyzer is the {@link #DEFAULT_ANALYZER}.

return
the analyzer that will be used to parse source doc with.
see
#DEFAULT_ANALYZER

        return analyzer;
    
public java.lang.String[]getFieldNames()
Returns the field names that will be used when generating the 'More Like This' query. The default field names that will be used is {@link #DEFAULT_FIELD_NAMES}.

return
the field names that will be used when generating the 'More Like This' query.

        return fieldNames;
    
public intgetMaxNumTokensParsed()

return
The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
see
#DEFAULT_MAX_NUM_TOKENS_PARSED

		return maxNumTokensParsed;
	
public intgetMaxQueryTerms()
Returns the maximum number of query terms that will be included in any generated query. The default is {@link #DEFAULT_MAX_QUERY_TERMS}.

return
the maximum number of query terms that will be included in any generated query.

        return maxQueryTerms;
    
public intgetMaxWordLen()
Returns the maximum word length above which words will be ignored. Set this to 0 for no maximum word length. The default is {@link #DEFAULT_MAX_WORD_LENGTH}.

return
the maximum word length above which words will be ignored.

        return maxWordLen;
    
public intgetMinDocFreq()
Returns the frequency at which words will be ignored which do not occur in at least this many docs. The default frequency is {@link #DEFAULT_MIN_DOC_FREQ}.

return
the frequency at which words will be ignored which do not occur in at least this many docs.

        return minDocFreq;
    
public intgetMinTermFreq()
Returns the frequency below which terms will be ignored in the source doc. The default frequency is the {@link #DEFAULT_MIN_TERM_FREQ}.

return
the frequency below which terms will be ignored in the source doc.

        return minTermFreq;
    
public intgetMinWordLen()
Returns the minimum word length below which words will be ignored. Set this to 0 for no minimum word length. The default is {@link #DEFAULT_MIN_WORD_LENGTH}.

return
the minimum word length below which words will be ignored.

        return minWordLen;
    
public java.util.SetgetStopWords()
Get the current stop words being used.

see
#setStopWords

		return stopWords;
	
public booleanisBoost()
Returns whether to boost terms in query based on "score" or not. The default is {@link #DEFAULT_BOOST}.

return
whether to boost terms in query based on "score" or not.
see
#setBoost

        return boost;
    
private booleanisNoiseWord(java.lang.String term)
determines if the passed term is likely to be of interest in "more like" comparisons

param
term The word being considered
return
true if should be ignored, false if should be used in further analysis

		int len = term.length();
		if (minWordLen > 0 && len < minWordLen) {
			return true;
		}
		if (maxWordLen > 0 && len > maxWordLen) {
			return true;
		}
		if (stopWords != null && stopWords.contains( term)) {
			return true;
		}
		return false;
	
public org.apache.lucene.search.Querylike(int docNum)
Return a query that will return docs like the passed lucene document ID.

param
docNum the documentID of the lucene doc to generate the 'More Like This" query for.
return
a query that will return docs like the passed lucene document ID.

        if (fieldNames == null) {
            // gather list of valid fields from lucene
            Collection fields = ir.getFieldNames( IndexReader.FieldOption.INDEXED);
            fieldNames = (String[]) fields.toArray(new String[fields.size()]);
        }

        return createQuery(retrieveTerms(docNum));
    
public org.apache.lucene.search.Querylike(java.io.File f)
Return a query that will return docs like the passed file.

return
a query that will return docs like the passed file.

        if (fieldNames == null) {
            // gather list of valid fields from lucene
            Collection fields = ir.getFieldNames( IndexReader.FieldOption.INDEXED);
            fieldNames = (String[]) fields.toArray(new String[fields.size()]);
        }

        return like(new FileReader(f));
    
public org.apache.lucene.search.Querylike(java.net.URL u)
Return a query that will return docs like the passed URL.

return
a query that will return docs like the passed URL.

        return like(new InputStreamReader(u.openConnection().getInputStream()));
    
public org.apache.lucene.search.Querylike(java.io.InputStream is)
Return a query that will return docs like the passed stream.

return
a query that will return docs like the passed stream.

        return like(new InputStreamReader(is));
    
public org.apache.lucene.search.Querylike(java.io.Reader r)
Return a query that will return docs like the passed Reader.

return
a query that will return docs like the passed Reader.

        return createQuery(retrieveTerms(r));
    
public static voidmain(java.lang.String[] a)
Test driver. Pass in "-i INDEX" and then either "-fn FILE" or "-url URL".

        String indexName = "localhost_index";
        String fn = "c:/Program Files/Apache Group/Apache/htdocs/manual/vhosts/index.html.en";
        URL url = null;
        for (int i = 0; i < a.length; i++) {
            if (a[i].equals("-i")) {
                indexName = a[++i];
            }
            else if (a[i].equals("-f")) {
                fn = a[++i];
            }
            else if (a[i].equals("-url")) {
                url = new URL(a[++i]);
            }
        }

        PrintStream o = System.out;
        IndexReader r = IndexReader.open(indexName);
        o.println("Open index " + indexName + " which has " + r.numDocs() + " docs");

        MoreLikeThis mlt = new MoreLikeThis(r);

        o.println("Query generation parameters:");
        o.println(mlt.describeParams());
        o.println();

        Query query = null;
        if (url != null) {
            o.println("Parsing URL: " + url);
            query = mlt.like(url);
        }
        else if (fn != null) {
            o.println("Parsing file: " + fn);
            query = mlt.like(new File(fn));
        }

        o.println("q: " + query);
        o.println();
        IndexSearcher searcher = new IndexSearcher(indexName);

        Hits hits = searcher.search(query);
        int len = hits.length();
        o.println("found: " + len + " documents matching");
        o.println();
        for (int i = 0; i < Math.min(25, len); i++) {
            Document d = hits.doc(i);
			String summary = d.get( "summary");
            o.println("score  : " + hits.score(i));
            o.println("url    : " + d.get("url"));
            o.println("\ttitle  : " + d.get("title"));
			if ( summary != null)
				o.println("\tsummary: " + d.get("summary"));
            o.println();
        }
    
public java.lang.String[]retrieveInterestingTerms(java.io.Reader r)
Convenience routine to make it easy to return the most interesting words in a document. More advanced users will call {@link #retrieveTerms(java.io.Reader) retrieveTerms()} directly.

param
r the source document
return
the most interesting words in the document
see
#retrieveTerms(java.io.Reader)
see
#setMaxQueryTerms

		ArrayList al = new ArrayList( maxQueryTerms);
		PriorityQueue pq = retrieveTerms( r);
		Object cur;
		int lim = maxQueryTerms; // have to be careful, retrieveTerms returns all words but that's probably not useful to our caller...
		// we just want to return the top words
		while (((cur = pq.pop()) != null) && lim-- > 0) {
            Object[] ar = (Object[]) cur;
			al.add( ar[ 0]); // the 1st entry is the interesting word
		}
		String[] res = new String[ al.size()];
		return (String[]) al.toArray( res);
	
private org.apache.lucene.util.PriorityQueueretrieveTerms(int docNum)
Find words for a more-like-this query former.

param
docNum the id of the lucene document from which to find terms

        Map termFreqMap = new HashMap();
        for (int i = 0; i < fieldNames.length; i++) {
            String fieldName = fieldNames[i];
            TermFreqVector vector = ir.getTermFreqVector(docNum, fieldName);

            // field does not store term vector info
            if (vector == null) {
            	Document d=ir.document(docNum);
            	String text[]=d.getValues(fieldName);
            	if(text!=null)
            	{
                for (int j = 0; j < text.length; j++) {
                  addTermFrequencies(new StringReader(text[j]), termFreqMap, fieldName);
                }
            	}
            }
            else {
				addTermFrequencies(termFreqMap, vector);
            }

        }

        return createQueue(termFreqMap);
    
public org.apache.lucene.util.PriorityQueueretrieveTerms(java.io.Reader r)
Find words for a more-like-this query former. The result is a priority queue of arrays with one entry for every word in the document. Each array has 6 elements. The elements are:
  1. The word (String)
  2. The top field that this word comes from (String)
  3. The score for this word (Float)
  4. The IDF value (Float)
  5. The frequency of this word in the index (Integer)
  6. The frequency of this word in the source document (Integer)
This is a somewhat "advanced" routine, and in general only the 1st entry in the array is of interest. This method is exposed so that you can identify the "interesting words" in a document. For an easier method to call see {@link #retrieveInterestingTerms retrieveInterestingTerms()}.

param
r the reader that has the content of the document
return
the most intresting words in the document ordered by score, with the highest scoring, or best entry, first
see
#retrieveInterestingTerms

        Map words = new HashMap();
        for (int i = 0; i < fieldNames.length; i++) {
            String fieldName = fieldNames[i];
			addTermFrequencies(r, words, fieldName);
        }
        return createQueue(words);
    
public voidsetAnalyzer(org.apache.lucene.analysis.Analyzer analyzer)
Sets the analyzer to use. An analyzer is not required for generating a query with the {@link #like(int)} method, all other 'like' methods require an analyzer.

param
analyzer the analyzer to use to tokenize text.

        this.analyzer = analyzer;
    
public voidsetBoost(boolean boost)
Sets whether to boost terms in query based on "score" or not.

param
boost true to boost terms in query based on "score", false otherwise.
see
#isBoost

        this.boost = boost;
    
public voidsetFieldNames(java.lang.String[] fieldNames)
Sets the field names that will be used when generating the 'More Like This' query. Set this to null for the field names to be determined at runtime from the IndexReader provided in the constructor.

param
fieldNames the field names that will be used when generating the 'More Like This' query.

        this.fieldNames = fieldNames;
    
public voidsetMaxNumTokensParsed(int i)

param
i The maximum number of tokens to parse in each example doc field that is not stored with TermVector support

		maxNumTokensParsed = i;
	
public voidsetMaxQueryTerms(int maxQueryTerms)
Sets the maximum number of query terms that will be included in any generated query.

param
maxQueryTerms the maximum number of query terms that will be included in any generated query.

        this.maxQueryTerms = maxQueryTerms;
    
public voidsetMaxWordLen(int maxWordLen)
Sets the maximum word length above which words will be ignored.

param
maxWordLen the maximum word length above which words will be ignored.

        this.maxWordLen = maxWordLen;
    
public voidsetMinDocFreq(int minDocFreq)
Sets the frequency at which words will be ignored which do not occur in at least this many docs.

param
minDocFreq the frequency at which words will be ignored which do not occur in at least this many docs.

        this.minDocFreq = minDocFreq;
    
public voidsetMinTermFreq(int minTermFreq)
Sets the frequency below which terms will be ignored in the source doc.

param
minTermFreq the frequency below which terms will be ignored in the source doc.

        this.minTermFreq = minTermFreq;
    
public voidsetMinWordLen(int minWordLen)
Sets the minimum word length below which words will be ignored.

param
minWordLen the minimum word length below which words will be ignored.

        this.minWordLen = minWordLen;
    
public voidsetStopWords(java.util.Set stopWords)
Set the set of stopwords. Any word in this set is considered "uninteresting" and ignored. Even if your Analyzer allows stopwords, you might want to tell the MoreLikeThis code to ignore them, as for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting".

param
stopWords set of stopwords, if null it means to allow stop words
see
org.apache.lucene.analysis.StopFilter#makeStopSet StopFilter.makeStopSet()
see
#getStopWords

		this.stopWords = stopWords;