A concrete implementation class for {@code Collation}.
{@code RuleBasedCollator} has the following restrictions for efficiency
(other subclasses may be used for more complex languages):
- If a French secondary ordering is specified it applies to the whole
collator object.
- All non-mentioned Unicode characters are at the end of the collation
order.
- If a character is not located in the {@code RuleBasedCollator}, the
default Unicode Collation Algorithm (UCA) rulebased table is automatically
searched as a backup.
The collation table is composed of a list of collation rules, where each rule
is of three forms:
The rule elements are defined as follows:
- Text-Argument: A text-argument is any sequence of
characters, excluding special characters (that is, common whitespace
characters [0009-000D, 0020] and rule syntax characters [0021-002F,
003A-0040, 005B-0060, 007B-007E]). If those characters are desired, you can
put them in single quotes (for example, use '&' for ampersand). Note that
unquoted white space characters are ignored; for example, {@code b c} is
treated as {@code bc}.
- Modifier: There is a single modifier which is used to
specify that all accents (secondary differences) are backwards.
'@' : Indicates that accents are sorted backwards, as in French.
- Relation: The relations are the following:
- '<' : Greater, as a letter difference (primary)
- ';' : Greater, as an accent difference (secondary)
- ',' : Greater, as a case difference (tertiary)
- '=' : Equal
- Reset: There is a single reset which is used primarily
for contractions and expansions, but which can also be used to add a
modification at the end of a set of rules.
'&' : Indicates that the next rule follows the position to where the reset
text-argument would be sorted.
This sounds more complicated than it is in practice. For example, the
following are equivalent ways of expressing the same thing:
a < b < c
a < b & b < c
a < c & a < b
Notice that the order is important, as the subsequent item goes immediately
after the text-argument. The following are not equivalent:
a < b & a < c
a < c & a < b
Either the text-argument must already be present in the sequence, or some
initial substring of the text-argument must be present. For example
{@code "a < b & ae < e"} is valid since "a" is present in the sequence before
"ae" is reset. In this latter case, "ae" is not entered and treated as a
single character; instead, "e" is sorted as if it were expanded to two
characters: "a" followed by an "e". This difference appears in natural
languages: in traditional Spanish "ch" is treated as if it contracts to a
single character (expressed as {@code "c < ch < d"}), while in traditional
German a-umlaut is treated as if it expands to two characters (expressed as
{@code "a,A < b,B ... & ae;\u00e3 & AE;\u00c3"}, where \u00e3 and \u00c3
are the escape sequences for a-umlaut).
Ignorable Characters
For ignorable characters, the first rule must start with a relation (the
examples we have used above are really fragments; {@code "a < b"} really
should be {@code "< a < b"}). If, however, the first relation is not
{@code "<"}, then all text-arguments up to the first {@code "<"} are
ignorable. For example, {@code ", - < a < b"} makes {@code "-"} an ignorable
character.
Normalization and Accents
{@code RuleBasedCollator} automatically processes its rule table to include
both pre-composed and combining-character versions of accented characters.
Even if the provided rule string contains only base characters and separate
combining accent characters, the pre-composed accented characters matching
all canonical combinations of characters from the rule string will be entered
in the table.
This allows you to use a RuleBasedCollator to compare accented strings even
when the collator is set to NO_DECOMPOSITION. However, if the strings to be
collated contain combining sequences that may not be in canonical order, you
should set the collator to CANONICAL_DECOMPOSITION to enable sorting of
combining sequences. For more information, see The Unicode Standard, Version 3.0.
Errors
The following rules are not valid:
- A text-argument contains unquoted punctuation symbols, for example
{@code "a < b-c < d"}.
- A relation or reset character is not followed by a text-argument, for
example {@code "a < , b"}.
- A reset where the text-argument (or an initial substring of the
text-argument) is not already in the sequence or allocated in the default UCA
table, for example {@code "a < b & e < f"}.
If you produce one of these errors, {@code RuleBasedCollator} throws a
{@code ParseException}.
Examples
Normally, to create a rule-based collator object, you will use
{@code Collator}'s factory method {@code getInstance}. However, to create a
rule-based collator object with specialized rules tailored to your needs, you
construct the {@code RuleBasedCollator} with the rules contained in a
{@code String} object. For example:
String Simple = "< a < b < c < d";
RuleBasedCollator mySimple = new RuleBasedCollator(Simple);
Or:
String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I"
+ "< j,J< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R"
+ "< s,S< t,T< u,U< v,V< w,W< x,X< y,Y< z,Z"
+ "< \u00E5=a\u030A,\u00C5=A\u030A"
+ ";aa,AA< \u00E6,\u00C6< \u00F8,\u00D8";
RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);
Combining {@code Collator}s is as simple as concatenating strings. Here is
an example that combines two {@code Collator}s from two different locales:
// Create an en_US Collator object
RuleBasedCollator en_USCollator = (RuleBasedCollator)Collator
.getInstance(new Locale("en", "US", ""));
// Create a da_DK Collator object
RuleBasedCollator da_DKCollator = (RuleBasedCollator)Collator
.getInstance(new Locale("da", "DK", ""));
// Combine the two collators
// First, get the collation rules from en_USCollator
String en_USRules = en_USCollator.getRules();
// Second, get the collation rules from da_DKCollator
String da_DKRules = da_DKCollator.getRules();
RuleBasedCollator newCollator = new RuleBasedCollator(en_USRules + da_DKRules);
// newCollator has the combined rules
The next example shows to make changes on an existing table to create a new
{@code Collator} object. For example, add {@code "& C < ch, cH, Ch, CH"} to
the {@code en_USCollator} object to create your own:
// Create a new Collator object with additional rules
String addRules = "& C < ch, cH, Ch, CH";
RuleBasedCollator myCollator = new RuleBasedCollator(en_USCollator + addRules);
// myCollator contains the new rules
The following example demonstrates how to change the order of non-spacing
accents:
// old rule
String oldRules = "= \u00a8 ; \u00af ; \u00bf" + "< a , A ; ae, AE ; \u00e6 , \u00c6"
+ "< b , B < c, C < e, E & C < d, D";
// change the order of accent characters
String addOn = "& \u00bf ; \u00af ; \u00a8;";
RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn);
The last example shows how to put new primary ordering in before the default
setting. For example, in the Japanese {@code Collator}, you can either sort
English characters before or after Japanese characters:
// get en_US Collator rules
RuleBasedCollator en_USCollator = (RuleBasedCollator)
Collator.getInstance(Locale.US);
// add a few Japanese character to sort before English characters
// suppose the last character before the first base letter 'a' in
// the English collation rule is \u30A2
String jaString = "& \u30A2 , \u30FC < \u30C8";
RuleBasedCollator myJapaneseCollator =
new RuleBasedCollator(en_USCollator.getRules() + jaString);
|