Locates boundaries in text. This class defines a protocol for objects that
break up a piece of natural-language text according to a set of criteria.
Instances or subclasses of {@code BreakIterator} can be provided, for
example, to break a piece of text into words, sentences, or logical
characters according to the conventions of some language or group of
languages. We provide four built-in types of {@code BreakIterator}:
- {@link #getSentenceInstance()} returns a {@code BreakIterator} that
locates boundaries between sentences. This is useful for triple-click
selection, for example.
- {@link #getWordInstance()} returns a {@code BreakIterator} that locates
boundaries between words. This is useful for double-click selection or "find
whole words" searches. This type of {@code BreakIterator} makes sure there is
a boundary position at the beginning and end of each legal word (numbers
count as words, too). Whitespace and punctuation are kept separate from real
words.
- {@code getLineInstance()} returns a {@code BreakIterator} that locates
positions where it is legal for a text editor to wrap lines. This is similar
to word breaking, but not the same: punctuation and whitespace are generally
kept with words (you don't want a line to start with whitespace, for
example), and some special characters can force a position to be considered a
line break position or prevent a position from being a line break position.
- {@code getCharacterInstance()} returns a {@code BreakIterator} that
locates boundaries between logical characters. Because of the structure of
the Unicode encoding, a logical character may be stored internally as more
than one Unicode code point. (A with an umlaut may be stored as an a followed
by a separate combining umlaut character, for example, but the user still
thinks of it as one character.) This iterator allows various processes
(especially text editors) to treat as characters the units of text that a
user would think of as characters, rather than the units of text that the
computer sees as "characters".
{@code BreakIterator}'s interface follows an "iterator" model (hence
the name), meaning it has a concept of a "current position" and methods like
{@code first()}, {@code last()}, {@code next()}, and {@code previous()} that
update the current position. All {@code BreakIterator}s uphold the following
invariants:
- The beginning and end of the text are always treated as boundary
positions.
- The current position of the iterator is always a boundary position
(random- access methods move the iterator to the nearest boundary position
before or after the specified position, not to the specified
position).
- {@code DONE} is used as a flag to indicate when iteration has stopped.
{@code DONE} is only returned when the current position is the end of the
text and the user calls {@code next()}, or when the current position is the
beginning of the text and the user calls {@code previous()}.
- Break positions are numbered by the positions of the characters that
follow them. Thus, under normal circumstances, the position before the first
character is 0, the position after the first character is 1, and the position
after the last character is 1 plus the length of the string.
- The client can change the position of an iterator, or the text it
analyzes, at will, but cannot change the behavior. If the user wants
different behavior, he must instantiate a new iterator.
{@code BreakIterator} accesses the text it analyzes through a
{@link CharacterIterator}, which makes it possible to use {@code
BreakIterator} to analyze text in any text-storage vehicle that provides a
{@code CharacterIterator} interface.
Note: Some types of {@code BreakIterator} can take a long time to
create, and instances of {@code BreakIterator} are not currently cached by
the system. For optimal performance, keep instances of {@code BreakIterator}
around as long as it makes sense. For example, when word-wrapping a document,
don't create and destroy a new {@code BreakIterator} for each line. Create
one break iterator for the whole document (or whatever stretch of text you're
wrapping) and use it to do the whole job of wrapping the text.
Examples:
Creating and using text boundaries:
public static void main(String args[]) {
if (args.length == 1) {
String stringToExamine = args[0];
//print each word in order
BreakIterator boundary = BreakIterator.getWordInstance();
boundary.setText(stringToExamine);
printEachForward(boundary, stringToExamine);
//print each sentence in reverse order
boundary = BreakIterator.getSentenceInstance(Locale.US);
boundary.setText(stringToExamine);
printEachBackward(boundary, stringToExamine);
printFirst(boundary, stringToExamine);
printLast(boundary, stringToExamine);
}
}
Print each element in order:
public static void printEachForward(BreakIterator boundary, String source) {
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
System.out.println(source.substring(start, end));
}
}
Print each element in reverse order:
public static void printEachBackward(BreakIterator boundary, String source) {
int end = boundary.last();
for (int start = boundary.previous(); start != BreakIterator.DONE; end = start, start = boundary
.previous()) {
System.out.println(source.substring(start, end));
}
}
Print the first element:
public static void printFirst(BreakIterator boundary, String source) {
int start = boundary.first();
int end = boundary.next();
System.out.println(source.substring(start, end));
}
Print the last element:
public static void printLast(BreakIterator boundary, String source) {
int end = boundary.last();
int start = boundary.previous();
System.out.println(source.substring(start, end));
}
Print the element at a specified position:
public static void printAt(BreakIterator boundary, int pos, String source) {
int end = boundary.following(pos);
int start = boundary.previous();
System.out.println(source.substring(start, end));
}
Find the next word:
public static int nextWordStartAfter(int pos, String text) {
BreakIterator wb = BreakIterator.getWordInstance();
wb.setText(text);
int last = wb.following(pos);
int current = wb.next();
while (current != BreakIterator.DONE) {
for (int p = last; p < current; p++) {
if (Character.isLetter(text.charAt(p)))
return last;
}
last = current;
current = wb.next();
}
return BreakIterator.DONE;
}
The iterator returned by {@code BreakIterator.getWordInstance()} is unique in
that the break positions it returns don't represent both the start and end of
the thing being iterated over. That is, a sentence-break iterator returns
breaks that each represent the end of one sentence and the beginning of the
next. With the word-break iterator, the characters between two boundaries
might be a word, or they might be the punctuation or whitespace between two
words. The above code uses a simple heuristic to determine which boundary is
the beginning of a word: If the characters between this boundary and the next
boundary include at least one letter (this can be an alphabetical letter, a
CJK ideograph, a Hangul syllable, a Kana character, etc.), then the text
between this boundary and the next is a word; otherwise, it's the material
between words.)
|