FileDocCategorySizeDatePackage
MemoryIndex.javaAPI DocApache Lucene 1.937009Mon Feb 20 09:18:36 GMT 2006org.apache.lucene.index.memory

MemoryIndex

public class MemoryIndex extends Object
High-performance single-document main memory Apache Lucene fulltext search index.

Overview

This class is a replacement/substitute for a large subset of {@link org.apache.lucene.store.RAMDirectory} functionality. It is designed to enable maximum efficiency for on-the-fly matchmaking combining structured and fuzzy fulltext search in realtime streaming applications such as Nux XQuery based XML message queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. Rather than targetting fulltext search of infrequent queries over huge persistent data archives (historic search), this class targets fulltext search of huge numbers of queries over comparatively small transient realtime data (prospective search). For example as in float score = search(String text, Query query).

Each instance can hold at most one Lucene "document", with a document containing zero or more "fields", each field having a name and a fulltext value. The fulltext value is tokenized (split and transformed) into zero or more index terms (aka words) on addField(), according to the policy implemented by an Analyzer. For example, Lucene analyzers can split on whitespace, normalize to lower case for case insensitivity, ignore common terms with little discriminatory value such as "he", "in", "and" (stop words), reduce the terms to their natural linguistic root form such as "fishing" being reduced to "fish" (stemming), resolve synonyms/inflexions/thesauri (upon indexing and/or querying), etc. For details, see Lucene Analyzer Intro.

Arbitrary Lucene queries can be run against this class - see Lucene Query Syntax as well as Query Parser Rules. Note that a Lucene query selects on the field names and associated (indexed) tokenized terms, not on the original fulltext(s) - the latter are not stored but rather thrown away immediately after tokenization.

For some interesting background information on search technology, see Bob Wyman's Prospective Search, Jim Gray's A Call to Arms - Custom subscriptions, and Tim Bray's On Search, the Series.

Example Usage

Analyzer analyzer = PatternAnalyzer.DEFAULT_ANALYZER;
//Analyzer analyzer = new SimpleAnalyzer();
MemoryIndex index = new MemoryIndex();
index.addField("content", "Readings about Salmons and other select Alaska fishing Manuals", analyzer);
index.addField("author", "Tales of James", analyzer);
float score = index.search(QueryParser.parse("+author:james +salmon~ +fish* manual~", "content", analyzer));
if (score > 0.0f) {
System.out.println("it's a match");
} else {
System.out.println("no match found");
}
System.out.println("indexData=" + index.toString());

Example XQuery Usage

(: An XQuery that finds all books authored by James that have something to do with "salmon fishing manuals", sorted by relevance :)
declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
declare variable $query := "+salmon~ +fish* manual~"; (: any arbitrary Lucene query can go here :)

for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0]
let $score := lucene:match($book/abstract, $query)
order by $score descending
return $book

No thread safety guarantees

An instance can be queried multiple times with the same or different queries, but an instance is not thread-safe. If desired use idioms such as:
MemoryIndex index = ...
synchronized (index) {
// read and/or write index (i.e. add fields and/or query)
}

Performance Notes

Internally there's a new data structure geared towards efficient indexing and searching, plus the necessary support code to seamlessly plug into the Lucene framework.

This class performs very well for very small texts (e.g. 10 chars) as well as for large texts (e.g. 10 MB) and everything in between. Typically, it is about 10-100 times faster than RAMDirectory. Note that RAMDirectory has particularly large efficiency overheads for small to medium sized texts, both in time and space. Indexing a field with N tokens takes O(N) in the best case, and O(N logN) in the worst case. Memory consumption is probably larger than for RAMDirectory.

If you're curious about the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ).

author
whoschek.AT.lbl.DOT.gov

Fields Summary
private final HashMap
fields
info for each field: Map
private transient Map$Entry[]
sortedFields
fields sorted ascending by fieldName; lazily computed on demand
private final int
stride
pos: positions[3*i], startOffset: positions[3*i +1], endOffset: positions[3*i +2]
private static final long
serialVersionUID
private static final boolean
DEBUG
private static final Comparator
termComparator
Sorts term entries into ascending order; also works for Arrays.binarySearch() and Arrays.sort()
private static final Term
MATCH_ALL_TERM
Constructors Summary
public MemoryIndex()
Constructs an empty instance.


	    	 
	  
		this(false);
	
private MemoryIndex(boolean storeOffsets)
Constructs an empty instance that can optionally store the start and end character offset of each token term in the text. This can be useful for highlighting of hit locations with the Lucene highlighter package. Private until the highlighter package matures, so that this can actually be meaningfully integrated.

param
storeOffsets whether or not to store the start and end character offset of each token term in the text

		this.stride = storeOffsets ? 3 : 1;
	
Methods Summary
public voidaddField(java.lang.String fieldName, java.lang.String text, org.apache.lucene.analysis.Analyzer analyzer)
Convenience method; Tokenizes the given field text and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, non-keyword Lucene {@link org.apache.lucene.document.Field}.

param
fieldName a name to be associated with the text
param
text the text to tokenize and index.
param
analyzer the analyzer to use for tokenization

		if (fieldName == null)
			throw new IllegalArgumentException("fieldName must not be null");
		if (text == null)
			throw new IllegalArgumentException("text must not be null");
		if (analyzer == null)
			throw new IllegalArgumentException("analyzer must not be null");
		
		TokenStream stream;
		if (analyzer instanceof PatternAnalyzer) {
			stream = ((PatternAnalyzer) analyzer).tokenStream(fieldName, text);
		} else {
			stream = analyzer.tokenStream(fieldName, 
					new PatternAnalyzer.FastStringReader(text));
		}
		addField(fieldName, stream);
	
public voidaddField(java.lang.String fieldName, org.apache.lucene.analysis.TokenStream stream)
Iterates over the given token stream and adds the resulting terms to the index; Equivalent to adding a tokenized, indexed, termVectorStored, unstored, Lucene {@link org.apache.lucene.document.Field}. Finally closes the token stream. Note that untokenized keywords can be added with this method via {@link #keywordTokenStream(Collection)}, the Lucene contrib KeywordTokenizer or similar utilities.

param
fieldName a name to be associated with the text
param
stream the token stream to retrieve tokens from.

		/*
		 * Note that this method signature avoids having a user call new
		 * o.a.l.d.Field(...) which would be much too expensive due to the
		 * String.intern() usage of that class.
		 * 
		 * More often than not, String.intern() leads to serious performance
		 * degradations rather than improvements! If you're curious why, check
		 * out the JDK's native code, see how it oscillates multiple times back
		 * and forth between Java code and native code on each intern() call,
		 * only to end up using a plain vanilla java.util.HashMap on the Java
		 * heap for it's interned strings! String.equals() has a small cost
		 * compared to String.intern(), trust me. Application level interning
		 * (e.g. a HashMap per Directory/Index) typically leads to better
		 * solutions than frequent hidden low-level calls to String.intern().
		 * 
		 * Perhaps with some luck, Lucene's Field.java (and Term.java) and
		 * cousins could be fixed to not use String.intern(). Sigh :-(
		 */
		try {
			if (fieldName == null)
				throw new IllegalArgumentException("fieldName must not be null");
			if (stream == null)
				throw new IllegalArgumentException("token stream must not be null");
			if (fields.get(fieldName) != null)
				throw new IllegalArgumentException("field must not be added more than once");
			
			HashMap terms = new HashMap();
			int numTokens = 0;
			int pos = -1;
			Token token;
			
			while ((token = stream.next()) != null) {
				String term = token.termText();
				if (term.length() == 0) continue; // nothing to do
//				if (DEBUG) System.err.println("token='" + term + "'");
				numTokens++;
				pos += token.getPositionIncrement();
				
				ArrayIntList positions = (ArrayIntList) terms.get(term);
				if (positions == null) { // term not seen before
					positions = new ArrayIntList(stride);
					terms.put(term, positions);
				}
				if (stride == 1) {
					positions.add(pos);
				} else {
					positions.add(pos, token.startOffset(), token.endOffset());
				}
			}
			
			// ensure infos.numTokens > 0 invariant; needed for correct operation of terms()
			if (numTokens > 0) {
				fields.put(fieldName, new Info(terms, numTokens));
				sortedFields = null;    // invalidate sorted view, if any
			}
		} catch (IOException e) { // can never happen
			throw new RuntimeException(e);
		} finally {
			try {
				if (stream != null) stream.close();
			} catch (IOException e2) {
				throw new RuntimeException(e2);
			}
		}
	
public org.apache.lucene.search.IndexSearchercreateSearcher()
Creates and returns a searcher that can be used to execute arbitrary Lucene queries and to collect the resulting query results as hits.

return
a searcher

		MemoryIndexReader reader = new MemoryIndexReader();
		IndexSearcher searcher = new IndexSearcher(reader); // ensures no auto-close !!
		reader.setSearcher(searcher); // to later get hold of searcher.getSimilarity()
		return searcher;
	
public intgetMemorySize()
Returns a reasonable approximation of the main memory [bytes] consumed by this instance. Useful for smart memory sensititve caches/pools. Assumes fieldNames are interned, whereas tokenized terms are memory-overlaid. For simplicity, assumes no VM word boundary alignment of instance vars.

return
the main memory consumption

		// for example usage in a smart cache see nux.xom.pool.Pool
		int HEADER = 12; // object header of any java object
		int PTR = 4; // pointer on 32 bit VMs
		int ARR = HEADER + 4;
		int STR = HEADER + 3*4 + PTR + ARR; // string
		int INTARRLIST = HEADER + 4 + PTR + ARR;
		int HASHMAP = HEADER + 4*PTR + 4*4 + ARR;
		
		int size = 0;
		size += HEADER + 2*PTR + 4; // memory index
		if (sortedFields != null) size += ARR + PTR * sortedFields.length;
		
		size += HASHMAP + fields.size() * (PTR + HEADER + 3*PTR + 4); // Map.entries
		Iterator iter = fields.entrySet().iterator();
		while (iter.hasNext()) { // for each Field Info
			Map.Entry entry = (Map.Entry) iter.next();			
			Info info = (Info) entry.getValue();
			size += HEADER + 4 + PTR + PTR + PTR; // Info instance vars
			if (info.sortedTerms != null) size += ARR + PTR * info.sortedTerms.length;
			
			int len = info.terms.size();
			size += HASHMAP + len * (PTR + HEADER + 3*PTR + 4); // Map.entries
			Iterator iter2 = info.terms.entrySet().iterator();
			while (--len >= 0) { // for each term
				Map.Entry e = (Map.Entry) iter2.next();
				size += STR - ARR; // assumes substring() memory overlay
//				size += STR + 2 * ((String) e.getKey()).length();
				ArrayIntList positions = (ArrayIntList) e.getValue();
				size += INTARRLIST + 4*positions.size();
			}
		}
		return size;
	
public org.apache.lucene.analysis.TokenStreamkeywordTokenStream(java.util.Collection keywords)
Convenience method; Creates and returns a token stream that generates a token for each keyword in the given collection, "as is", without any transforming text analysis. The resulting token stream can be fed into {@link #addField(String, TokenStream)}, perhaps wrapped into another {@link org.apache.lucene.analysis.TokenFilter}, as desired.

param
keywords the keywords to generate tokens for
return
the corresponding token stream

		// TODO: deprecate & move this method into AnalyzerUtil?
		if (keywords == null)
			throw new IllegalArgumentException("keywords must not be null");
		
		return new TokenStream() {
			private Iterator iter = keywords.iterator();
			private int start = 0;
			public Token next() {
				if (!iter.hasNext()) return null;
				
				Object obj = iter.next();
				if (obj == null) 
					throw new IllegalArgumentException("keyword must not be null");
				
				String term = obj.toString();
				Token token = new Token(term, start, start + term.length());
				start += term.length() + 1; // separate words by 1 (blank) character
				return token;
			}
		};
	
private intnumPositions(org.apache.lucene.index.memory.MemoryIndex$ArrayIntList positions)

		return positions.size() / stride;
	
public floatsearch(org.apache.lucene.search.Query query)
Convenience method that efficiently returns the relevance score by matching this index against the given Lucene query expression.

param
query an arbitrary Lucene query to run against this index
return
the relevance score of the matchmaking; A number in the range [0.0 .. 1.0], with 0.0 indicating no match. The higher the number the better the match.
see
org.apache.lucene.queryParser.QueryParser#parse(String)

		if (query == null) 
			throw new IllegalArgumentException("query must not be null");
		
		Searcher searcher = createSearcher();
		try {
			final float[] scores = new float[1]; // inits to 0.0f (no match)
			searcher.search(query, new HitCollector() {
				public void collect(int doc, float score) {
					scores[0] = score;
				}
			});
			float score = scores[0];
			return score;
		} catch (IOException e) { // can never happen (RAMDirectory)
			throw new RuntimeException(e);
		} finally {
			// searcher.close();
			/*
			 * Note that it is harmless and important for good performance to
			 * NOT close the index reader!!! This avoids all sorts of
			 * unnecessary baggage and locking in the Lucene IndexReader
			 * superclass, all of which is completely unnecessary for this main
			 * memory index data structure without thread-safety claims.
			 * 
			 * Wishing IndexReader would be an interface...
			 * 
			 * Actually with the new tight createSearcher() API auto-closing is now
			 * made impossible, hence searcher.close() would be harmless...
			 */
		}		
	
private static java.util.Map$Entry[]sort(java.util.HashMap map)
returns a view of the given map's entries, sorted ascending by key

		int size = map.size();
		Map.Entry[] entries = new Map.Entry[size];
		
		Iterator iter = map.entrySet().iterator();
		for (int i=0; i < size; i++) {
			entries[i] = (Map.Entry) iter.next();
		}
		
		if (size > 1) Arrays.sort(entries, termComparator);
		return entries;
	
private voidsortFields()
sorts into ascending order (on demand), reusing memory along the way

		if (sortedFields == null) sortedFields = sort(fields);
	
public java.lang.StringtoString()
Returns a String representation of the index data for debugging purposes.

return
the string representation

		StringBuffer result = new StringBuffer(256);		
		sortFields();		
		int sumChars = 0;
		int sumPositions = 0;
		int sumTerms = 0;
		
		for (int i=0; i < sortedFields.length; i++) {
			Map.Entry entry = sortedFields[i];
			String fieldName = (String) entry.getKey();
			Info info = (Info) entry.getValue();
			info.sortTerms();
			result.append(fieldName + ":\n");
			
			int numChars = 0;
			int numPositions = 0;
			for (int j=0; j < info.sortedTerms.length; j++) {
				Map.Entry e = info.sortedTerms[j];
				String term = (String) e.getKey();
				ArrayIntList positions = (ArrayIntList) e.getValue();
				result.append("\t'" + term + "':" + numPositions(positions) + ":");
				result.append(positions.toString(stride)); // ignore offsets
				result.append("\n");
				numPositions += numPositions(positions);
				numChars += term.length();
			}
			
			result.append("\tterms=" + info.sortedTerms.length);
			result.append(", positions=" + numPositions);
			result.append(", Kchars=" + (numChars/1000.0f));
			result.append("\n");
			sumPositions += numPositions;
			sumChars += numChars;
			sumTerms += info.sortedTerms.length;
		}
		
		result.append("\nfields=" + sortedFields.length);
		result.append(", terms=" + sumTerms);
		result.append(", positions=" + sumPositions);
		result.append(", Kchars=" + (sumChars/1000.0f));
		return result.toString();