Concrete implementation class for Collation.
The collation table is composed of a list of collation rules, where each
rule is of three forms:
< modifier >
< relation > < text-argument >
< reset > < text-argument >
RuleBasedCollator has the following restrictions for efficiency
(other subclasses may be used for more complex languages) :
- If a French secondary ordering is specified it applies to the whole
collator object.
- All non-mentioned Unicode characters are at the end of the collation
order.
- If a character is not located in the RuleBasedCollator, the default
Unicode Collation Algorithm (UCA) rulebased table is automatically
searched as a backup.
The following demonstrates how to create your own collation rules:
- Text-Argument: A text-argument is any sequence of
characters, excluding special characters (that is, common whitespace
characters [0009-000D, 0020] and rule syntax characters [0021-002F,
003A-0040, 005B-0060, 007B-007E]). If those characters are desired,
you can put them in single quotes (e.g. ampersand => '&'). Note that
unquoted white space characters are ignored; e.g.
b c is
treated as bc .
- Modifier: There is a single modifier which is used
to specify that all accents (secondary differences) are backwards.
'@' : Indicates that accents are sorted backwards, as in French.
- Relation: The relations are the following:
- '<' : Greater, as a letter difference (primary)
- ';' : Greater, as an accent difference (secondary)
- ',' : Greater, as a case difference (tertiary)
- '=' : Equal
- Reset: There is a single reset which is used
primarily for contractions and expansions, but which can also be used
to add a modification at the end of a set of rules.
'&' : Indicates that the next rule follows the position to where
the reset text-argument would be sorted.
This sounds more complicated than it is in practice. For example, the
following are equivalent ways of expressing the same thing:
a < b < c
a < b & b < c
a < c & a < b
Notice that the order is important, as the subsequent item goes immediately
after the text-argument. The following are not equivalent:
a < b & a < c
a < c & a < b
Either the text-argument must already be present in the sequence, or some
initial substring of the text-argument must be present. (e.g. "a < b & ae <
e" is valid since "a" is present in the sequence before "ae" is reset). In
this latter case, "ae" is not entered and treated as a single character;
instead, "e" is sorted as if it were expanded to two characters: "a"
followed by an "e". This difference appears in natural languages: in
traditional Spanish "ch" is treated as though it contracts to a single
character (expressed as "c < ch < d"), while in traditional German a-umlaut
is treated as though it expanded to two characters (expressed as "a,A < b,B
... & ae;? & AE;?"). [? and ? are, of course, the escape sequences for
a-umlaut.]
Ignorable Characters
For ignorable characters, the first rule must start with a relation (the
examples we have used above are really fragments; "a < b" really should be
"< a < b"). If, however, the first relation is not "<", then all the all
text-arguments up to the first "<" are ignorable. For example, ", - < a < b"
makes "-" an ignorable character, as we saw earlier in the word
"black-birds". In the samples for different languages, you see that most
accents are ignorable.
Normalization and Accents
RuleBasedCollator automatically processes its rule table to
include both pre-composed and combining-character versions of accented
characters. Even if the provided rule string contains only base characters
and separate combining accent characters, the pre-composed accented
characters matching all canonical combinations of characters from the rule
string will be entered in the table.
This allows you to use a RuleBasedCollator to compare accented strings even
when the collator is set to NO_DECOMPOSITION. However, if the strings to be
collated contain combining sequences that may not be in canonical order, you
should set the collator to CANONICAL_DECOMPOSITION to enable sorting of
combining sequences.
For more information, see
The Unicode Standard, Version 3.0.)
Errors
The following are errors:
- A text-argument contains unquoted punctuation symbols
(e.g. "a < b-c < d").
- A relation or reset character not followed by a text-argument
(e.g. "a < , b").
- A reset where the text-argument (or an initial substring of the
text-argument) is not already in the sequence or allocated in the
default UCA table.
(e.g. "a < b & e < f")
If you produce one of these errors, a RuleBasedCollator throws
a ParseException .
Examples
Simple: "< a < b < c < d"
Norwegian: "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J
< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T
< u,U< v,V< w,W< x,X< y,Y< z,Z
< ?=a?,?=A?
;aa,AA< ?,?< ?,?"
Normally, to create a rule-based Collator object, you will use
Collator 's factory method getInstance .
However, to create a rule-based Collator object with specialized rules
tailored to your needs, you construct the RuleBasedCollator
with the rules contained in a String object. For example:
String Simple = "< a < b < c < d";
RuleBasedCollator mySimple = new RuleBasedCollator(Simple);
Or:
String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J" +
"< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T" +
"< u,U< v,V< w,W< x,X< y,Y< z,Z" +
"< ?=a?,?=A?" +
";aa,AA< ?,?< ?,?";
RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);
Combining Collator s is as simple as concatenating strings.
Here's an example that combines two Collator s from two
different locales:
// Create an en_US Collator object
RuleBasedCollator en_USCollator = (RuleBasedCollator)
Collator.getInstance(new Locale("en", "US", ""));
// Create a da_DK Collator object
RuleBasedCollator da_DKCollator = (RuleBasedCollator)
Collator.getInstance(new Locale("da", "DK", ""));
// Combine the two
// First, get the collation rules from en_USCollator
String en_USRules = en_USCollator.getRules();
// Second, get the collation rules from da_DKCollator
String da_DKRules = da_DKCollator.getRules();
RuleBasedCollator newCollator =
new RuleBasedCollator(en_USRules + da_DKRules);
// newCollator has the combined rules
Another more interesting example would be to make changes on an existing
table to create a new Collator object. For example, add
"& C < ch, cH, Ch, CH" to the en_USCollator object to create
your own:
// Create a new Collator object with additional rules
String addRules = "& C < ch, cH, Ch, CH";
RuleBasedCollator myCollator =
new RuleBasedCollator(en_USCollator + addRules);
// myCollator contains the new rules
The following example demonstrates how to change the order of
non-spacing accents,
// old rule
String oldRules = "=?;?;?" // main accents Diaeresis 00A8, Macron 00AF
// Acute 00BF
+ "< a , A ; ae, AE ; ? , ?"
+ "< b , B < c, C < e, E & C < d, D";
// change the order of accent characters
String addOn = "& ?;?;?;"; // Acute 00BF, Macron 00AF, Diaeresis 00A8
RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn);
The last example shows how to put new primary ordering in before the
default setting. For example, in Japanese Collator , you
can either sort English characters before or after Japanese characters,
// get en_US Collator rules
RuleBasedCollator en_USCollator =
(RuleBasedCollator)Collator.getInstance(Locale.US);
// add a few Japanese character to sort before English characters
// suppose the last character before the first base letter 'a' in
// the English collation rule is ?
String jaString = "& \\u30A2 , \\u30FC < \\u30C8";
RuleBasedCollator myJapaneseCollator = new
RuleBasedCollator(en_USCollator.getRules() + jaString);
|