FileDocCategorySizeDatePackage
BreakIterator.javaAPI DocAndroid 1.5 API25053Wed May 06 22:41:06 BST 2009java.text

BreakIterator

public abstract class BreakIterator extends Object implements Cloneable
Locates boundaries in text. This class defines a protocol for objects that break up a piece of natural-language text according to a set of criteria. Instances or subclasses of {@code BreakIterator} can be provided, for example, to break a piece of text into words, sentences, or logical characters according to the conventions of some language or group of languages. We provide four built-in types of {@code BreakIterator}:
  • {@link #getSentenceInstance()} returns a {@code BreakIterator} that locates boundaries between sentences. This is useful for triple-click selection, for example.
  • {@link #getWordInstance()} returns a {@code BreakIterator} that locates boundaries between words. This is useful for double-click selection or "find whole words" searches. This type of {@code BreakIterator} makes sure there is a boundary position at the beginning and end of each legal word (numbers count as words, too). Whitespace and punctuation are kept separate from real words.
  • {@code getLineInstance()} returns a {@code BreakIterator} that locates positions where it is legal for a text editor to wrap lines. This is similar to word breaking, but not the same: punctuation and whitespace are generally kept with words (you don't want a line to start with whitespace, for example), and some special characters can force a position to be considered a line break position or prevent a position from being a line break position.
  • {@code getCharacterInstance()} returns a {@code BreakIterator} that locates boundaries between logical characters. Because of the structure of the Unicode encoding, a logical character may be stored internally as more than one Unicode code point. (A with an umlaut may be stored as an a followed by a separate combining umlaut character, for example, but the user still thinks of it as one character.) This iterator allows various processes (especially text editors) to treat as characters the units of text that a user would think of as characters, rather than the units of text that the computer sees as "characters".
{@code BreakIterator}'s interface follows an "iterator" model (hence the name), meaning it has a concept of a "current position" and methods like {@code first()}, {@code last()}, {@code next()}, and {@code previous()} that update the current position. All {@code BreakIterator}s uphold the following invariants:
  • The beginning and end of the text are always treated as boundary positions.
  • The current position of the iterator is always a boundary position (random- access methods move the iterator to the nearest boundary position before or after the specified position, not to the specified position).
  • {@code DONE} is used as a flag to indicate when iteration has stopped. {@code DONE} is only returned when the current position is the end of the text and the user calls {@code next()}, or when the current position is the beginning of the text and the user calls {@code previous()}.
  • Break positions are numbered by the positions of the characters that follow them. Thus, under normal circumstances, the position before the first character is 0, the position after the first character is 1, and the position after the last character is 1 plus the length of the string.
  • The client can change the position of an iterator, or the text it analyzes, at will, but cannot change the behavior. If the user wants different behavior, he must instantiate a new iterator.

{@code BreakIterator} accesses the text it analyzes through a {@link CharacterIterator}, which makes it possible to use {@code BreakIterator} to analyze text in any text-storage vehicle that provides a {@code CharacterIterator} interface.

Note: Some types of {@code BreakIterator} can take a long time to create, and instances of {@code BreakIterator} are not currently cached by the system. For optimal performance, keep instances of {@code BreakIterator} around as long as it makes sense. For example, when word-wrapping a document, don't create and destroy a new {@code BreakIterator} for each line. Create one break iterator for the whole document (or whatever stretch of text you're wrapping) and use it to do the whole job of wrapping the text.

Examples:

Creating and using text boundaries:

public static void main(String args[]) {
if (args.length == 1) {
String stringToExamine = args[0];
//print each word in order
BreakIterator boundary = BreakIterator.getWordInstance();
boundary.setText(stringToExamine);
printEachForward(boundary, stringToExamine);
//print each sentence in reverse order
boundary = BreakIterator.getSentenceInstance(Locale.US);
boundary.setText(stringToExamine);
printEachBackward(boundary, stringToExamine);
printFirst(boundary, stringToExamine);
printLast(boundary, stringToExamine);
}
}

Print each element in order:

public static void printEachForward(BreakIterator boundary, String source) {
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
System.out.println(source.substring(start, end));
}
}

Print each element in reverse order:

public static void printEachBackward(BreakIterator boundary, String source) {
int end = boundary.last();
for (int start = boundary.previous(); start != BreakIterator.DONE; end = start, start = boundary
.previous()) {
System.out.println(source.substring(start, end));
}
}

Print the first element:

public static void printFirst(BreakIterator boundary, String source) {
int start = boundary.first();
int end = boundary.next();
System.out.println(source.substring(start, end));
}

Print the last element:

public static void printLast(BreakIterator boundary, String source) {
int end = boundary.last();
int start = boundary.previous();
System.out.println(source.substring(start, end));
}

Print the element at a specified position:

public static void printAt(BreakIterator boundary, int pos, String source) {
int end = boundary.following(pos);
int start = boundary.previous();
System.out.println(source.substring(start, end));
}

Find the next word:

public static int nextWordStartAfter(int pos, String text) {
BreakIterator wb = BreakIterator.getWordInstance();
wb.setText(text);
int last = wb.following(pos);
int current = wb.next();
while (current != BreakIterator.DONE) {
for (int p = last; p < current; p++) {
if (Character.isLetter(text.charAt(p)))
return last;
}
last = current;
current = wb.next();
}
return BreakIterator.DONE;
}

The iterator returned by {@code BreakIterator.getWordInstance()} is unique in that the break positions it returns don't represent both the start and end of the thing being iterated over. That is, a sentence-break iterator returns breaks that each represent the end of one sentence and the beginning of the next. With the word-break iterator, the characters between two boundaries might be a word, or they might be the punctuation or whitespace between two words. The above code uses a simple heuristic to determine which boundary is the beginning of a word: If the characters between this boundary and the next boundary include at least one letter (this can be an alphabetical letter, a CJK ideograph, a Hangul syllable, a Kana character, etc.), then the text between this boundary and the next is a word; otherwise, it's the material between words.)

see
CharacterIterator
since
Android 1.0

Fields Summary
public static final int
DONE
This constant is returned by iterate methods like {@code previous()} or {@code next()} if they have returned all valid boundaries.
private static final int
LONG_LENGTH
private static final int
INT_LENGTH
private static final int
SHORT_LENGTH
com.ibm.icu4jni.text.BreakIterator
wrapped
Constructors Summary
protected BreakIterator()
Default constructor, just for invocation by a subclass.

since
Android 1.0


    /*
     * -----------------------------------------------------------------------
     * constructors
     * -----------------------------------------------------------------------
     */
                     
      
        super();
    
BreakIterator(com.ibm.icu4jni.text.BreakIterator iterator)

        wrapped = iterator;
    
Methods Summary
public java.lang.Objectclone()
Creates a copy of this iterator, all status information including the current position are kept the same.

return
a copy of this iterator.
since
Android 1.0

        try {
            BreakIterator cloned = (BreakIterator) super.clone();
            cloned.wrapped = (com.ibm.icu4jni.text.BreakIterator) wrapped.clone();
            return cloned;
        } catch (CloneNotSupportedException e) {
            throw new InternalError(e.getMessage());
        }
    
public abstract intcurrent()
Returns this iterator's current position.

return
this iterator's current position.
since
Android 1.0

public abstract intfirst()
Sets this iterator's current position to the first boundary and returns that position.

return
the position of the first boundary.
since
Android 1.0

public abstract intfollowing(int offset)
Sets the position of the first boundary to the one following the given offset and returns this position. Returns {@code DONE} if there is no boundary after the given offset.

param
offset the given position to be searched for.
return
the position of the first boundary following the given offset.
since
Android 1.0

public static java.util.Locale[]getAvailableLocales()
Returns all supported locales in an array.

return
all supported locales.
since
Android 1.0

        return com.ibm.icu4jni.text.BreakIterator.getAvailableLocales();
    
public static java.text.BreakIteratorgetCharacterInstance()
Returns a new instance of {@code BreakIterator} to iterate over characters using the default locale.

return
a new instance of {@code BreakIterator} using the default locale.
since
Android 1.0

        return new RuleBasedBreakIterator(com.ibm.icu4jni.text.BreakIterator
                .getCharacterInstance());
    
public static java.text.BreakIteratorgetCharacterInstance(java.util.Locale where)
Returns a new instance of {@code BreakIterator} to iterate over characters using the given locale.

param
where the given locale.
return
a new instance of {@code BreakIterator} using the given locale.
since
Android 1.0

        if (where == null) {
            throw new NullPointerException();
        }

        return new RuleBasedBreakIterator(com.ibm.icu4jni.text.BreakIterator
                .getCharacterInstance(where));
    
protected static intgetInt(byte[] buf, int offset)
Gets an int value from the given byte array, starting from the given offset.

param
buf the bytes to be converted.
param
offset the start position of the conversion.
return
the converted int value.
throws
NullPointerException if {@code buf} is {@code null}.
throws
ArrayIndexOutOfBoundsException if {@code offset < 0} or {@code offset + INT_LENGTH} is greater than the length of {@code buf}.
since
Android 1.0

        if (null == buf) {
            throw new NullPointerException();
        }
        if (offset < 0 || buf.length - INT_LENGTH < offset) {
            throw new ArrayIndexOutOfBoundsException();
        }
        int result = 0;
        for (int i = offset; i < offset + INT_LENGTH; i++) {
            result = (result << 8) | (buf[i] & 0xff);
        }
        return result;
    
public static java.text.BreakIteratorgetLineInstance()
Returns a new instance of {{@code BreakIterator} to iterate over line breaks using the default locale.

return
a new instance of {@code BreakIterator} using the default locale.
since
Android 1.0

        return new RuleBasedBreakIterator(com.ibm.icu4jni.text.BreakIterator
                .getLineInstance());
    
public static java.text.BreakIteratorgetLineInstance(java.util.Locale where)
Returns a new instance of {@code BreakIterator} to iterate over line breaks using the given locale.

param
where the given locale.
return
a new instance of {@code BreakIterator} using the given locale.
throws
NullPointerException if {@code where} is {@code null}.
since
Android 1.0

        if (where == null) {
            throw new NullPointerException();
        }

        return new RuleBasedBreakIterator(com.ibm.icu4jni.text.BreakIterator
                .getLineInstance(where));
    
protected static longgetLong(byte[] buf, int offset)
Gets a long value from the given byte array, starting from the given offset.

param
buf the bytes to be converted.
param
offset the start position of the conversion.
return
the converted long value.
throws
NullPointerException if {@code buf} is {@code null}.
throws
ArrayIndexOutOfBoundsException if {@code offset < 0} or {@code offset + LONG_LENGTH} is greater than the length of {@code buf}.
since
Android 1.0

        if (null == buf) {
            throw new NullPointerException();
        }
        if (offset < 0 || buf.length - offset < LONG_LENGTH) {
            throw new ArrayIndexOutOfBoundsException();
        }
        long result = 0;
        for (int i = offset; i < offset + LONG_LENGTH; i++) {
            result = (result << 8) | (buf[i] & 0xff);
        }
        return result;
    
public static java.text.BreakIteratorgetSentenceInstance()
Returns a new instance of {@code BreakIterator} to iterate over sentence-breaks using the default locale.

return
a new instance of {@code BreakIterator} using the default locale.
since
Android 1.0

        return new RuleBasedBreakIterator(com.ibm.icu4jni.text.BreakIterator
                .getSentenceInstance());
    
public static java.text.BreakIteratorgetSentenceInstance(java.util.Locale where)
Returns a new instance of {@code BreakIterator} to iterate over sentence-breaks using the given locale.

param
where the given locale.
return
a new instance of {@code BreakIterator} using the given locale.
throws
NullPointerException if {@code where} is {@code null}.
since
Android 1.0

        if (where == null) {
            throw new NullPointerException();
        }

        return new RuleBasedBreakIterator(com.ibm.icu4jni.text.BreakIterator
                .getSentenceInstance(where));
    
protected static shortgetShort(byte[] buf, int offset)
Gets a short value from the given byte array, starting from the given offset.

param
buf the bytes to be converted.
param
offset the start position of the conversion.
return
the converted short value.
throws
NullPointerException if {@code buf} is {@code null}.
throws
ArrayIndexOutOfBoundsException if {@code offset < 0} or {@code offset + SHORT_LENGTH} is greater than the length of {@code buf}.
since
Android 1.0

        if (null == buf) {
            throw new NullPointerException();
        }
        if (offset < 0 || buf.length - SHORT_LENGTH < offset) {
            throw new ArrayIndexOutOfBoundsException();
        }
        short result = 0;
        for (int i = offset; i < offset + SHORT_LENGTH; i++) {
            result = (short) ((result << 8) | (buf[i] & 0xff));
        }
        return result;
    
public abstract java.text.CharacterIteratorgetText()
Returns a {@code CharacterIterator} which represents the text being analyzed. Please note that the returned value is probably the internal iterator used by this object. If the invoker wants to modify the status of the returned iterator, it is recommended to first create a clone of the iterator returned.

return
a {@code CharacterIterator} which represents the text being analyzed.
since
Android 1.0

public static java.text.BreakIteratorgetWordInstance()
Returns a new instance of {@code BreakIterator} to iterate over word-breaks using the default locale.

return
a new instance of {@code BreakIterator} using the default locale.
since
Android 1.0

        return new RuleBasedBreakIterator(com.ibm.icu4jni.text.BreakIterator
                .getWordInstance());
    
public static java.text.BreakIteratorgetWordInstance(java.util.Locale where)
Returns a new instance of {@code BreakIterator} to iterate over word-breaks using the given locale.

param
where the given locale.
return
a new instance of {@code BreakIterator} using the given locale.
throws
NullPointerException if {@code where} is {@code null}.
since
Android 1.0

        if (where == null) {
            throw new NullPointerException();
        }

        return new RuleBasedBreakIterator(com.ibm.icu4jni.text.BreakIterator
                .getWordInstance(where));
    
public booleanisBoundary(int offset)
Indicates whether the given offset is a boundary position. If this method returns true, the current iteration position is set to the given position; if the function returns false, the current iteration position is set as though {@link #following(int)} had been called.

param
offset the given offset to check.
return
{@code true} if the given offset is a boundary position; {@code false} otherwise.
since
Android 1.0

        return wrapped.isBoundary(offset);
    
public abstract intlast()
Sets this iterator's current position to the last boundary and returns that position.

return
the position of last boundary.
since
Android 1.0

public abstract intnext()
Sets this iterator's current position to the next boundary after the current position, and returns this position. Returns {@code DONE} if no boundary was found after the current position.

return
the position of last boundary.
since
Android 1.0

public abstract intnext(int n)
Sets this iterator's current position to the next boundary after the given position, and returns that position. Returns {@code DONE} if no boundary was found after the given position.

param
n the given position.
return
the position of last boundary.
since
Android 1.0

public intpreceding(int offset)
Returns the position of last boundary preceding the given offset, and sets the current position to the returned value, or {@code DONE} if the given offset specifies the starting position.

param
offset the given start position to be searched for.
return
the position of the last boundary preceding the given offset.
since
Android 1.0

        return wrapped.preceding(offset);
    
public abstract intprevious()
Sets this iterator's current position to the previous boundary before the current position and returns that position. Returns {@code DONE} if no boundary was found before the current position.

return
the position of last boundary.
since
Android 1.0

public voidsetText(java.lang.String newText)
Sets the new text string to be analyzed, the current position will be reset to the beginning of this new string, and the old string will be lost.

param
newText the new text string to be analyzed.
since
Android 1.0

        wrapped.setText(newText);
    
public abstract voidsetText(java.text.CharacterIterator newText)
Sets the new text to be analyzed by the given {@code CharacterIterator}. The position will be reset to the beginning of the new text, and other status information of this iterator will be kept.

param
newText the {@code CharacterIterator} referring to the text to be analyzed.
since
Android 1.0