File Doc Category Size Date Package
RuleBasedCollator.java API Doc Android 1.5 API 18366 Wed May 06 22:41:06 BST 2009 java.text

RuleBasedCollator

java.lang.Object
- java.text.Collator

public class RuleBasedCollator extends Collator

A concrete implementation class for {@code Collation}.

{@code RuleBasedCollator} has the following restrictions for efficiency (other subclasses may be used for more complex languages):

If a French secondary ordering is specified it applies to the whole collator object.
All non-mentioned Unicode characters are at the end of the collation order.
If a character is not located in the {@code RuleBasedCollator}, the default Unicode Collation Algorithm (UCA) rulebased table is automatically searched as a backup.

The collation table is composed of a list of collation rules, where each rule is of three forms:

 
 

The rule elements are defined as follows:

Text-Argument: A text-argument is any sequence of characters, excluding special characters (that is, common whitespace characters [0009-000D, 0020] and rule syntax characters [0021-002F, 003A-0040, 005B-0060, 007B-007E]). If those characters are desired, you can put them in single quotes (for example, use '&' for ampersand). Note that unquoted white space characters are ignored; for example, {@code b c} is treated as {@code bc}.
Modifier: There is a single modifier which is used to specify that all accents (secondary differences) are backwards.
'@' : Indicates that accents are sorted backwards, as in French.
Relation: The relations are the following:
- '<' : Greater, as a letter difference (primary)
- ';' : Greater, as an accent difference (secondary)
- ',' : Greater, as a case difference (tertiary)
- '=' : Equal
Reset: There is a single reset which is used primarily for contractions and expansions, but which can also be used to add a modification at the end of a set of rules.
'&' : Indicates that the next rule follows the position to where the reset text-argument would be sorted.

This sounds more complicated than it is in practice. For example, the following are equivalent ways of expressing the same thing:

a < b < c
a < b & b < c
a < c & a < b

Notice that the order is important, as the subsequent item goes immediately after the text-argument. The following are not equivalent:

a < b & a < c
a < c & a < b

Either the text-argument must already be present in the sequence, or some initial substring of the text-argument must be present. For example {@code "a < b & ae < e"} is valid since "a" is present in the sequence before "ae" is reset. In this latter case, "ae" is not entered and treated as a single character; instead, "e" is sorted as if it were expanded to two characters: "a" followed by an "e". This difference appears in natural languages: in traditional Spanish "ch" is treated as if it contracts to a single character (expressed as {@code "c < ch < d"}), while in traditional German a-umlaut is treated as if it expands to two characters (expressed as {@code "a,A < b,B ... & ae;\u00e3 & AE;\u00c3"}, where \u00e3 and \u00c3 are the escape sequences for a-umlaut).

Ignorable Characters

For ignorable characters, the first rule must start with a relation (the examples we have used above are really fragments; {@code "a < b"} really should be {@code "< a < b"}). If, however, the first relation is not {@code "<"}, then all text-arguments up to the first {@code "<"} are ignorable. For example, {@code ", - < a < b"} makes {@code "-"} an ignorable character.

Normalization and Accents

{@code RuleBasedCollator} automatically processes its rule table to include both pre-composed and combining-character versions of accented characters. Even if the provided rule string contains only base characters and separate combining accent characters, the pre-composed accented characters matching all canonical combinations of characters from the rule string will be entered in the table.

This allows you to use a RuleBasedCollator to compare accented strings even when the collator is set to NO_DECOMPOSITION. However, if the strings to be collated contain combining sequences that may not be in canonical order, you should set the collator to CANONICAL_DECOMPOSITION to enable sorting of combining sequences. For more information, see The Unicode Standard, Version 3.0.

Errors

The following rules are not valid:

A text-argument contains unquoted punctuation symbols, for example {@code "a < b-c < d"}.
A relation or reset character is not followed by a text-argument, for example {@code "a < , b"}.
A reset where the text-argument (or an initial substring of the text-argument) is not already in the sequence or allocated in the default UCA table, for example {@code "a < b & e < f"}.

If you produce one of these errors, {@code RuleBasedCollator} throws a {@code ParseException}.

Examples

Normally, to create a rule-based collator object, you will use {@code Collator}'s factory method {@code getInstance}. However, to create a rule-based collator object with specialized rules tailored to your needs, you construct the {@code RuleBasedCollator} with the rules contained in a {@code String} object. For example:

String Simple = "< a < b < c < d";

RuleBasedCollator mySimple = new RuleBasedCollator(Simple);

Or:

String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I"
+ "< j,J< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R"
+ "< s,S< t,T< u,U< v,V< w,W< x,X< y,Y< z,Z"
+ "< \u00E5=a\u030A,\u00C5=A\u030A"
+ ";aa,AA< \u00E6,\u00C6< \u00F8,\u00D8";

RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);

Combining {@code Collator}s is as simple as concatenating strings. Here is an example that combines two {@code Collator}s from two different locales:

// Create an en_US Collator object
RuleBasedCollator en_USCollator = (RuleBasedCollator)Collator
.getInstance(new Locale("en", "US", ""));

// Create a da_DK Collator object
RuleBasedCollator da_DKCollator = (RuleBasedCollator)Collator
.getInstance(new Locale("da", "DK", ""));

// Combine the two collators
// First, get the collation rules from en_USCollator
String en_USRules = en_USCollator.getRules();

// Second, get the collation rules from da_DKCollator
String da_DKRules = da_DKCollator.getRules();

RuleBasedCollator newCollator = new RuleBasedCollator(en_USRules + da_DKRules);
// newCollator has the combined rules

The next example shows to make changes on an existing table to create a new {@code Collator} object. For example, add {@code "& C < ch, cH, Ch, CH"} to the {@code en_USCollator} object to create your own:

// Create a new Collator object with additional rules
String addRules = "& C < ch, cH, Ch, CH";

RuleBasedCollator myCollator = new RuleBasedCollator(en_USCollator + addRules);
// myCollator contains the new rules

The following example demonstrates how to change the order of non-spacing accents:

// old rule
String oldRules = "= \u00a8 ; \u00af ; \u00bf" + "< a , A ; ae, AE ; \u00e6 , \u00c6"
+ "< b , B < c, C < e, E & C < d, D";

// change the order of accent characters
String addOn = "& \u00bf ; \u00af ; \u00a8;";

RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn);

The last example shows how to put new primary ordering in before the default setting. For example, in the Japanese {@code Collator}, you can either sort English characters before or after Japanese characters:

// get en_US Collator rules
RuleBasedCollator en_USCollator = (RuleBasedCollator)
Collator.getInstance(Locale.US);

// add a few Japanese character to sort before English characters
// suppose the last character before the first base letter 'a' in
// the English collation rule is \u30A2
String jaString = "& \u30A2 , \u30FC < \u30C8";

RuleBasedCollator myJapaneseCollator =
new RuleBasedCollator(en_USCollator.getRules() + jaString);

since: Android 1.0

Fields Summary
Constructors Summary
RuleBasedCollator(com.ibm.icu4jni.text.Collator wrapper)
super(wrapper);
public RuleBasedCollator(String rules)
Constructs a new instance of {@code RuleBasedCollator} using the specified {@code rules}. The {@code rules} are usually either hand-written based on the {@link RuleBasedCollator class description} or the result of a former {@link #getRules()} call.
Note that the {@code rules} are actually interpreted as a delta to the standard Unicode Collation Algorithm (UCA). Hence, an empty {@code rules} string results in the default UCA rules being applied. This differs slightly from other implementations which work with full {@code rules} specifications and may result in different behavior.
param
rules the collation rules.
throws
NullPointerException if {@code rules} is {@code null}.
throws
ParseException if {@code rules} contains rules with invalid collation rule syntax.
since
Android 1.0
if (rules == null) { throw new NullPointerException(); } // BEGIN android-removed // if (rules.length() == 0) { // // text.06=Build rules empty // throw new ParseException(Messages.getString("text.06"), 0); //$NON-NLS-1$ // } // END andriod-removed try { this.icuColl = new com.ibm.icu4jni.text.RuleBasedCollator(rules); // BEGIN android-added this.icuColl.setDecomposition( com.ibm.icu4jni.text.Collator.CANONICAL_DECOMPOSITION); // END android-added } catch (Exception e) { if (e instanceof ParseException) { throw (ParseException) e; } /* * -1 means it's not a ParseException. Maybe IOException thrown when * an error occured while reading internal data. */ throw new ParseException(e.getMessage(), -1); }
Methods Summary
public java.lang.Object clone()
Returns a new collator with the same collation rules, decomposition mode and strength value as this collator.
return
a shallow copy of this collator.
see
java.lang.Cloneable
since
Android 1.0
RuleBasedCollator clone = (RuleBasedCollator) super.clone(); return clone;
public int compare(java.lang.String source, java.lang.String target)
Compares the {@code source} text to the {@code target} text according to the collation rules, strength and decomposition mode for this {@code RuleBasedCollator}. See the {@code Collator} class description for an example of use.
General recommendation: If comparisons are to be done with the same strings multiple times, it is more efficient to generate {@code CollationKey} objects for the strings and use {@code CollationKey.compareTo(CollationKey)} for the comparisons. If each string is compared to only once, using {@code RuleBasedCollator.compare(String, String)} has better performance.
param
source the source text.
param
target the target text.
return
an integer which may be a negative value, zero, or else a positive value depending on whether {@code source} is less than, equivalent to, or greater than {@code target}.
since
Android 1.0
if (source == null || target == null) { // text.08=one of arguments is null throw new NullPointerException(Messages.getString("text.08")); //$NON-NLS-1$ } return this.icuColl.compare(source, target);
public boolean equals(java.lang.Object obj)
Compares the specified object with this {@code RuleBasedCollator} and indicates if they are equal. In order to be equal, {@code object} must be an instance of {@code Collator} with the same collation rules and the same attributes.
param
obj the object to compare with this object.
return
{@code true} if the specified object is equal to this {@code RuleBasedCollator}; {@code false} otherwise.
see
#hashCode
since
Android 1.0
if (!(obj instanceof Collator)) { return false; } return super.equals(obj);
public java.text.CollationElementIterator getCollationElementIterator(java.text.CharacterIterator source)
Obtains a {@code CollationElementIterator} for the given {@code CharacterIterator}. The source iterator's integrity will be preserved since a new copy will be created for use.
param
source the source character iterator.
return
a {@code CollationElementIterator} for {@code source}.
since
Android 1.0
if (source == null) { throw new NullPointerException(); } return new CollationElementIterator( ((com.ibm.icu4jni.text.RuleBasedCollator) this.icuColl) .getCollationElementIterator(source));
public java.text.CollationElementIterator getCollationElementIterator(java.lang.String source)
Obtains a {@code CollationElementIterator} for the given string.
param
source the source string.
return
the {@code CollationElementIterator} for {@code source}.
since
Android 1.0
if (source == null) { throw new NullPointerException(); } return new CollationElementIterator( ((com.ibm.icu4jni.text.RuleBasedCollator) this.icuColl) .getCollationElementIterator(source));
public java.text.CollationKey getCollationKey(java.lang.String source)
Returns the {@code CollationKey} for the given source text.
param
source the specified source text.
return
the {@code CollationKey} for the given source text.
since
Android 1.0
com.ibm.icu4jni.text.CollationKey icuKey = this.icuColl .getCollationKey(source); if (icuKey == null) { return null; } return new CollationKey(source, icuKey);
public java.lang.String getRules()
Returns the collation rules of this collator. These {@code rules} can be fed into the {@link #RuleBasedCollator(String)} constructor.
Note that the {@code rules} are actually interpreted as a delta to the standard Unicode Collation Algorithm (UCA). Hence, an empty {@code rules} string results in the default UCA rules being applied. This differs slightly from other implementations which work with full {@code rules} specifications and may result in different behavior.
return
the collation rules.
since
Android 1.0
return ((com.ibm.icu4jni.text.RuleBasedCollator) this.icuColl).getRules();
public int hashCode()
return ((com.ibm.icu4jni.text.RuleBasedCollator) this.icuColl).getRules() .hashCode();