IndexWriterpublic class IndexWriter extends Object An IndexWriter creates and maintains an index.
The third argument to the
constructor
determines whether a new index is created, or whether an existing index is
opened for the addition of new documents.
In either case, documents are added with the addDocument method.
When finished adding documents, close should be called.
If an index will not have more documents added for a while and optimal search
performance is desired, then the optimize
method should be called before the index is closed.
Opening an IndexWriter creates a lock file for the directory in use. Trying to open
another IndexWriter on the same directory will lead to an IOException. The IOException
is also thrown if an IndexReader on the same directory is used to delete documents
from the index. |
Fields Summary |
---|
public static final long | WRITE_LOCK_TIMEOUTDefault value is 1,000. | public static final long | COMMIT_LOCK_TIMEOUTDefault value is 10,000. | public static final String | WRITE_LOCK_NAME | public static final String | COMMIT_LOCK_NAME | public static final int | DEFAULT_MERGE_FACTORDefault value is 10. Change using {@link #setMergeFactor(int)}. | public static final int | DEFAULT_MAX_BUFFERED_DOCSDefault value is 10. Change using {@link #setMaxBufferedDocs(int)}. | public static final int | DEFAULT_MIN_MERGE_DOCS | public static final int | DEFAULT_MAX_MERGE_DOCSDefault value is {@link Integer#MAX_VALUE}. Change using {@link #setMaxMergeDocs(int)}. | public static final int | DEFAULT_MAX_FIELD_LENGTHDefault value is 10,000. Change using {@link #setMaxFieldLength(int)}. | public static final int | DEFAULT_TERM_INDEX_INTERVALDefault value is 128. Change using {@link #setTermIndexInterval(int)}. | private Directory | directory | private Analyzer | analyzer | private Similarity | similarity | private SegmentInfos | segmentInfos | private final Directory | ramDirectory | private Lock | writeLock | private int | termIndexInterval | private boolean | useCompoundFileUse compound file setting. Defaults to true, minimizing the number of
files used. Setting this to false may improve indexing performance, but
may also cause file handle problems. | private boolean | closeDir | public int | maxFieldLengthThe maximum number of terms that will be indexed for a single field in a
document. This limits the amount of memory required for indexing, so that
collections with very large files will not crash the indexing process by
running out of memory.
Note that this effectively truncates large documents, excluding from the
index terms that occur further in the document. If you know your source
documents are large, be sure to set this value high enough to accomodate
the expected size. If you set it to Integer.MAX_VALUE, then the only limit
is your memory, but you should anticipate an OutOfMemoryError.
By default, no more than 10,000 terms will be indexed for a field. | public int | mergeFactorDetermines how often segment indices are merged by addDocument(). With
smaller values, less RAM is used while indexing, and searches on
unoptimized indices are faster, but indexing speed is slower. With larger
values, more RAM is used during indexing, and while searches on unoptimized
indices are slower, indexing is faster. Thus larger values (> 10) are best
for batch index creation, and smaller values (< 10) for indices that are
interactively maintained.
This must never be less than 2. The default value is 10. | public int | minMergeDocsDetermines the minimal number of documents required before the buffered
in-memory documents are merging and a new Segment is created.
Since Documents are merged in a {@link org.apache.lucene.store.RAMDirectory},
large value gives faster indexing. At the same time, mergeFactor limits
the number of files open in a FSDirectory.
The default value is 10. | public int | maxMergeDocsDetermines the largest number of documents ever merged by addDocument().
Small values (e.g., less than 10,000) are best for interactive indexing,
as this limits the length of pauses while indexing to a few seconds.
Larger values are best for batched indexing and speedier searches.
The default value is {@link Integer#MAX_VALUE}. | public PrintStream | infoStreamIf non-null, information about merges will be printed to this. |
Constructors Summary |
---|
private IndexWriter(Directory d, Analyzer a, boolean create, boolean closeDir)
this.closeDir = closeDir;
directory = d;
analyzer = a;
Lock writeLock = directory.makeLock(IndexWriter.WRITE_LOCK_NAME);
if (!writeLock.obtain(WRITE_LOCK_TIMEOUT)) // obtain write lock
throw new IOException("Index locked for write: " + writeLock);
this.writeLock = writeLock; // save it
synchronized (directory) { // in- & inter-process sync
new Lock.With(directory.makeLock(IndexWriter.COMMIT_LOCK_NAME), COMMIT_LOCK_TIMEOUT) {
public Object doBody() throws IOException {
if (create)
segmentInfos.write(directory);
else
segmentInfos.read(directory);
return null;
}
}.run();
}
| public IndexWriter(String path, Analyzer a, boolean create)Constructs an IndexWriter for the index in path .
Text will be analyzed with a . If create
is true, then a new, empty index will be created in
path , replacing the index already there, if any.
this(FSDirectory.getDirectory(path, create), a, create, true);
| public IndexWriter(File path, Analyzer a, boolean create)Constructs an IndexWriter for the index in path .
Text will be analyzed with a . If create
is true, then a new, empty index will be created in
path , replacing the index already there, if any.
this(FSDirectory.getDirectory(path, create), a, create, true);
| public IndexWriter(Directory d, Analyzer a, boolean create)Constructs an IndexWriter for the index in d .
Text will be analyzed with a . If create
is true, then a new, empty index will be created in
d , replacing the index already there, if any.
this(d, a, create, false);
|
Methods Summary |
---|
public void | addDocument(org.apache.lucene.document.Document doc)Adds a document to this index. If the document contains more than
{@link #setMaxFieldLength(int)} terms for a given field, the remainder are
discarded.
addDocument(doc, analyzer);
| public void | addDocument(org.apache.lucene.document.Document doc, org.apache.lucene.analysis.Analyzer analyzer)Adds a document to this index, using the provided analyzer instead of the
value of {@link #getAnalyzer()}. If the document contains more than
{@link #setMaxFieldLength(int)} terms for a given field, the remainder are
discarded.
DocumentWriter dw =
new DocumentWriter(ramDirectory, analyzer, this);
dw.setInfoStream(infoStream);
String segmentName = newSegmentName();
dw.addDocument(segmentName, doc);
synchronized (this) {
segmentInfos.addElement(new SegmentInfo(segmentName, 1, ramDirectory));
maybeMergeSegments();
}
| public synchronized void | addIndexes(org.apache.lucene.store.Directory[] dirs)Merges all segments from an array of indexes into this index.
This may be used to parallelize batch indexing. A large document
collection can be broken into sub-collections. Each sub-collection can be
indexed in parallel, on a different thread, process or machine. The
complete index can then be created by merging sub-collection indexes
with this method.
After this completes, the index is optimized.
optimize(); // start with zero or 1 seg
int start = segmentInfos.size();
for (int i = 0; i < dirs.length; i++) {
SegmentInfos sis = new SegmentInfos(); // read infos from dir
sis.read(dirs[i]);
for (int j = 0; j < sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
}
}
// merge newly added segments in log(n) passes
while (segmentInfos.size() > start+mergeFactor) {
for (int base = start+1; base < segmentInfos.size(); base++) {
int end = Math.min(segmentInfos.size(), base+mergeFactor);
if (end-base > 1)
mergeSegments(base, end);
}
}
optimize(); // final cleanup
| public synchronized void | addIndexes(org.apache.lucene.index.IndexReader[] readers)Merges the provided indexes into this index.
After this completes, the index is optimized.
The provided IndexReaders are not closed.
optimize(); // start with zero or 1 seg
final String mergedName = newSegmentName();
SegmentMerger merger = new SegmentMerger(this, mergedName);
final Vector segmentsToDelete = new Vector();
IndexReader sReader = null;
if (segmentInfos.size() == 1){ // add existing index, if any
sReader = SegmentReader.get(segmentInfos.info(0));
merger.add(sReader);
segmentsToDelete.addElement(sReader); // queue segment for deletion
}
for (int i = 0; i < readers.length; i++) // add new indexes
merger.add(readers[i]);
int docCount = merger.merge(); // merge 'em
segmentInfos.setSize(0); // pop old infos & add new
segmentInfos.addElement(new SegmentInfo(mergedName, docCount, directory));
if(sReader != null)
sReader.close();
synchronized (directory) { // in- & inter-process sync
new Lock.With(directory.makeLock(COMMIT_LOCK_NAME), COMMIT_LOCK_TIMEOUT) {
public Object doBody() throws IOException {
segmentInfos.write(directory); // commit changes
deleteSegments(segmentsToDelete); // delete now-unused segments
return null;
}
}.run();
}
if (useCompoundFile) {
final Vector filesToDelete = merger.createCompoundFile(mergedName + ".tmp");
synchronized (directory) { // in- & inter-process sync
new Lock.With(directory.makeLock(COMMIT_LOCK_NAME), COMMIT_LOCK_TIMEOUT) {
public Object doBody() throws IOException {
// make compound file visible for SegmentReaders
directory.renameFile(mergedName + ".tmp", mergedName + ".cfs");
// delete now unused files of segment
deleteFiles(filesToDelete);
return null;
}
}.run();
}
}
| public synchronized void | close()Flushes all changes to an index and closes all associated files.
flushRamSegments();
ramDirectory.close();
if (writeLock != null) {
writeLock.release(); // release write lock
writeLock = null;
}
if(closeDir)
directory.close();
| private final void | deleteFiles(java.util.Vector files)
Vector deletable = new Vector();
deleteFiles(readDeleteableFiles(), deletable); // try to delete deleteable
deleteFiles(files, deletable); // try to delete our files
writeDeleteableFiles(deletable); // note files we can't delete
| private final void | deleteFiles(java.util.Vector files, org.apache.lucene.store.Directory directory)
for (int i = 0; i < files.size(); i++)
directory.deleteFile((String)files.elementAt(i));
| private final void | deleteFiles(java.util.Vector files, java.util.Vector deletable)
for (int i = 0; i < files.size(); i++) {
String file = (String)files.elementAt(i);
try {
directory.deleteFile(file); // try to delete each file
} catch (IOException e) { // if delete fails
if (directory.fileExists(file)) {
if (infoStream != null)
infoStream.println(e.toString() + "; Will re-try later.");
deletable.addElement(file); // add to deletable
}
}
}
| private final void | deleteSegments(java.util.Vector segments)
Vector deletable = new Vector();
deleteFiles(readDeleteableFiles(), deletable); // try to delete deleteable
for (int i = 0; i < segments.size(); i++) {
SegmentReader reader = (SegmentReader)segments.elementAt(i);
if (reader.directory() == this.directory)
deleteFiles(reader.files(), deletable); // try to delete our files
else
deleteFiles(reader.files(), reader.directory()); // delete other files
}
writeDeleteableFiles(deletable); // note files we can't delete
| public synchronized int | docCount()Returns the number of documents currently in this index.
int count = 0;
for (int i = 0; i < segmentInfos.size(); i++) {
SegmentInfo si = segmentInfos.info(i);
count += si.docCount;
}
return count;
| protected void | finalize()Release the write lock, if needed.
if (writeLock != null) {
writeLock.release(); // release write lock
writeLock = null;
}
| private final void | flushRamSegments()Merges all RAM-resident segments.
int minSegment = segmentInfos.size()-1;
int docCount = 0;
while (minSegment >= 0 &&
(segmentInfos.info(minSegment)).dir == ramDirectory) {
docCount += segmentInfos.info(minSegment).docCount;
minSegment--;
}
if (minSegment < 0 || // add one FS segment?
(docCount + segmentInfos.info(minSegment).docCount) > mergeFactor ||
!(segmentInfos.info(segmentInfos.size()-1).dir == ramDirectory))
minSegment++;
if (minSegment >= segmentInfos.size())
return; // none to merge
mergeSegments(minSegment);
| public org.apache.lucene.analysis.Analyzer | getAnalyzer()Returns the analyzer used by this index.
return analyzer;
| public org.apache.lucene.store.Directory | getDirectory()Returns the Directory used by this index.
return directory;
| public java.io.PrintStream | getInfoStream()
return infoStream;
| public int | getMaxBufferedDocs()
return minMergeDocs;
| public int | getMaxFieldLength()
return maxFieldLength;
| public int | getMaxMergeDocs()
return maxMergeDocs;
| public int | getMergeFactor()
return mergeFactor;
| final int | getSegmentsCounter()
return segmentInfos.counter;
| public org.apache.lucene.search.Similarity | getSimilarity()Expert: Return the Similarity implementation used by this IndexWriter.
This defaults to the current value of {@link Similarity#getDefault()}.
return this.similarity;
| public int | getTermIndexInterval()Expert: Return the interval between indexed terms. return termIndexInterval;
| public boolean | getUseCompoundFile()Get the current setting of whether to use the compound file format.
Note that this just returns the value you set with setUseCompoundFile(boolean)
or the default. You cannot use this to query the status of an existing index.
return useCompoundFile;
| private final void | maybeMergeSegments()Incremental segment merger.
long targetMergeDocs = minMergeDocs;
while (targetMergeDocs <= maxMergeDocs) {
// find segments smaller than current target size
int minSegment = segmentInfos.size();
int mergeDocs = 0;
while (--minSegment >= 0) {
SegmentInfo si = segmentInfos.info(minSegment);
if (si.docCount >= targetMergeDocs)
break;
mergeDocs += si.docCount;
}
if (mergeDocs >= targetMergeDocs) // found a merge to do
mergeSegments(minSegment+1);
else
break;
targetMergeDocs *= mergeFactor; // increase target size
}
| private final void | mergeSegments(int minSegment)Pops segments off of segmentInfos stack down to minSegment, merges them,
and pushes the merged index onto the top of the segmentInfos stack.
mergeSegments(minSegment, segmentInfos.size());
| private final void | mergeSegments(int minSegment, int end)Merges the named range of segments, replacing them in the stack with a
single segment.
final String mergedName = newSegmentName();
if (infoStream != null) infoStream.print("merging segments");
SegmentMerger merger = new SegmentMerger(this, mergedName);
final Vector segmentsToDelete = new Vector();
for (int i = minSegment; i < end; i++) {
SegmentInfo si = segmentInfos.info(i);
if (infoStream != null)
infoStream.print(" " + si.name + " (" + si.docCount + " docs)");
IndexReader reader = SegmentReader.get(si);
merger.add(reader);
if ((reader.directory() == this.directory) || // if we own the directory
(reader.directory() == this.ramDirectory))
segmentsToDelete.addElement(reader); // queue segment for deletion
}
int mergedDocCount = merger.merge();
if (infoStream != null) {
infoStream.println(" into "+mergedName+" ("+mergedDocCount+" docs)");
}
for (int i = end-1; i >= minSegment; i--) // remove old infos & add new
segmentInfos.remove(i);
segmentInfos.addElement(new SegmentInfo(mergedName, mergedDocCount,
directory));
// close readers before we attempt to delete now-obsolete segments
merger.closeReaders();
synchronized (directory) { // in- & inter-process sync
new Lock.With(directory.makeLock(COMMIT_LOCK_NAME), COMMIT_LOCK_TIMEOUT) {
public Object doBody() throws IOException {
segmentInfos.write(directory); // commit before deleting
deleteSegments(segmentsToDelete); // delete now-unused segments
return null;
}
}.run();
}
if (useCompoundFile) {
final Vector filesToDelete = merger.createCompoundFile(mergedName + ".tmp");
synchronized (directory) { // in- & inter-process sync
new Lock.With(directory.makeLock(COMMIT_LOCK_NAME), COMMIT_LOCK_TIMEOUT) {
public Object doBody() throws IOException {
// make compound file visible for SegmentReaders
directory.renameFile(mergedName + ".tmp", mergedName + ".cfs");
// delete now unused files of segment
deleteFiles(filesToDelete);
return null;
}
}.run();
}
}
| private final synchronized java.lang.String | newSegmentName()
return "_" + Integer.toString(segmentInfos.counter++, Character.MAX_RADIX);
| public synchronized void | optimize()Merges all segments together into a single segment, optimizing an index
for search.
flushRamSegments();
while (segmentInfos.size() > 1 ||
(segmentInfos.size() == 1 &&
(SegmentReader.hasDeletions(segmentInfos.info(0)) ||
segmentInfos.info(0).dir != directory ||
(useCompoundFile &&
(!SegmentReader.usesCompoundFile(segmentInfos.info(0)) ||
SegmentReader.hasSeparateNorms(segmentInfos.info(0))))))) {
int minSegment = segmentInfos.size() - mergeFactor;
mergeSegments(minSegment < 0 ? 0 : minSegment);
}
| private final java.util.Vector | readDeleteableFiles()
Vector result = new Vector();
if (!directory.fileExists(IndexFileNames.DELETABLE))
return result;
IndexInput input = directory.openInput(IndexFileNames.DELETABLE);
try {
for (int i = input.readInt(); i > 0; i--) // read file names
result.addElement(input.readString());
} finally {
input.close();
}
return result;
| public void | setInfoStream(java.io.PrintStream infoStream)If non-null, information about merges and a message when
maxFieldLength is reached will be printed to this.
this.infoStream = infoStream;
| public void | setMaxBufferedDocs(int maxBufferedDocs)Determines the minimal number of documents required before the buffered
in-memory documents are merging and a new Segment is created.
Since Documents are merged in a {@link org.apache.lucene.store.RAMDirectory},
large value gives faster indexing. At the same time, mergeFactor limits
the number of files open in a FSDirectory.
The default value is 10.
if (maxBufferedDocs < 2)
throw new IllegalArgumentException("maxBufferedDocs must at least be 2");
this.minMergeDocs = maxBufferedDocs;
| public void | setMaxFieldLength(int maxFieldLength)The maximum number of terms that will be indexed for a single field in a
document. This limits the amount of memory required for indexing, so that
collections with very large files will not crash the indexing process by
running out of memory.
Note that this effectively truncates large documents, excluding from the
index terms that occur further in the document. If you know your source
documents are large, be sure to set this value high enough to accomodate
the expected size. If you set it to Integer.MAX_VALUE, then the only limit
is your memory, but you should anticipate an OutOfMemoryError.
By default, no more than 10,000 terms will be indexed for a field.
this.maxFieldLength = maxFieldLength;
| public void | setMaxMergeDocs(int maxMergeDocs)Determines the largest number of documents ever merged by addDocument().
Small values (e.g., less than 10,000) are best for interactive indexing,
as this limits the length of pauses while indexing to a few seconds.
Larger values are best for batched indexing and speedier searches.
The default value is {@link Integer#MAX_VALUE}.
this.maxMergeDocs = maxMergeDocs;
| public void | setMergeFactor(int mergeFactor)Determines how often segment indices are merged by addDocument(). With
smaller values, less RAM is used while indexing, and searches on
unoptimized indices are faster, but indexing speed is slower. With larger
values, more RAM is used during indexing, and while searches on unoptimized
indices are slower, indexing is faster. Thus larger values (> 10) are best
for batch index creation, and smaller values (< 10) for indices that are
interactively maintained.
This must never be less than 2. The default value is 10.
if (mergeFactor < 2)
throw new IllegalArgumentException("mergeFactor cannot be less than 2");
this.mergeFactor = mergeFactor;
| public void | setSimilarity(org.apache.lucene.search.Similarity similarity)Expert: Set the Similarity implementation used by this IndexWriter.
this.similarity = similarity;
| public void | setTermIndexInterval(int interval)Expert: Set the interval between indexed terms. Large values cause less
memory to be used by IndexReader, but slow random-access to terms. Small
values cause more memory to be used by an IndexReader, and speed
random-access to terms.
This parameter determines the amount of computation required per query
term, regardless of the number of documents that contain that term. In
particular, it is the maximum number of other terms that must be
scanned before a term is located and its frequency and position information
may be processed. In a large index with user-entered query terms, query
processing time is likely to be dominated not by term lookup but rather
by the processing of frequency and positional data. In a small index
or when many uncommon query terms are generated (e.g., by wildcard
queries) term lookup may become a dominant cost.
In particular, numUniqueTerms/interval terms are read into
memory by an IndexReader, and, on average, interval/2 terms
must be scanned for each random term access.
this.termIndexInterval = interval;
| public void | setUseCompoundFile(boolean value)Setting to turn on usage of a compound file. When on, multiple files
for each segment are merged into a single file once the segment creation
is finished. This is done regardless of what directory is in use.
useCompoundFile = value;
| private final void | writeDeleteableFiles(java.util.Vector files)
IndexOutput output = directory.createOutput("deleteable.new");
try {
output.writeInt(files.size());
for (int i = 0; i < files.size(); i++)
output.writeString((String)files.elementAt(i));
} finally {
output.close();
}
directory.renameFile("deleteable.new", IndexFileNames.DELETABLE);
|
|