FileDocCategorySizeDatePackage
RuleBasedCollator.javaAPI DocAndroid 1.5 API18366Wed May 06 22:41:06 BST 2009java.text

RuleBasedCollator

public class RuleBasedCollator extends Collator
A concrete implementation class for {@code Collation}.

{@code RuleBasedCollator} has the following restrictions for efficiency (other subclasses may be used for more complex languages):

  1. If a French secondary ordering is specified it applies to the whole collator object.
  2. All non-mentioned Unicode characters are at the end of the collation order.
  3. If a character is not located in the {@code RuleBasedCollator}, the default Unicode Collation Algorithm (UCA) rulebased table is automatically searched as a backup.

The collation table is composed of a list of collation rules, where each rule is of three forms:


 
 

The rule elements are defined as follows:

  • Text-Argument: A text-argument is any sequence of characters, excluding special characters (that is, common whitespace characters [0009-000D, 0020] and rule syntax characters [0021-002F, 003A-0040, 005B-0060, 007B-007E]). If those characters are desired, you can put them in single quotes (for example, use '&' for ampersand). Note that unquoted white space characters are ignored; for example, {@code b c} is treated as {@code bc}.
  • Modifier: There is a single modifier which is used to specify that all accents (secondary differences) are backwards.

    '@' : Indicates that accents are sorted backwards, as in French.

  • Relation: The relations are the following:
    • '<' : Greater, as a letter difference (primary)
    • ';' : Greater, as an accent difference (secondary)
    • ',' : Greater, as a case difference (tertiary)
    • '=' : Equal
  • Reset: There is a single reset which is used primarily for contractions and expansions, but which can also be used to add a modification at the end of a set of rules.

    '&' : Indicates that the next rule follows the position to where the reset text-argument would be sorted.

This sounds more complicated than it is in practice. For example, the following are equivalent ways of expressing the same thing:

a < b < c
a < b & b < c
a < c & a < b

Notice that the order is important, as the subsequent item goes immediately after the text-argument. The following are not equivalent:

a < b & a < c
a < c & a < b

Either the text-argument must already be present in the sequence, or some initial substring of the text-argument must be present. For example {@code "a < b & ae < e"} is valid since "a" is present in the sequence before "ae" is reset. In this latter case, "ae" is not entered and treated as a single character; instead, "e" is sorted as if it were expanded to two characters: "a" followed by an "e". This difference appears in natural languages: in traditional Spanish "ch" is treated as if it contracts to a single character (expressed as {@code "c < ch < d"}), while in traditional German a-umlaut is treated as if it expands to two characters (expressed as {@code "a,A < b,B ... & ae;\u00e3 & AE;\u00c3"}, where \u00e3 and \u00c3 are the escape sequences for a-umlaut).

Ignorable Characters

For ignorable characters, the first rule must start with a relation (the examples we have used above are really fragments; {@code "a < b"} really should be {@code "< a < b"}). If, however, the first relation is not {@code "<"}, then all text-arguments up to the first {@code "<"} are ignorable. For example, {@code ", - < a < b"} makes {@code "-"} an ignorable character.

Normalization and Accents

{@code RuleBasedCollator} automatically processes its rule table to include both pre-composed and combining-character versions of accented characters. Even if the provided rule string contains only base characters and separate combining accent characters, the pre-composed accented characters matching all canonical combinations of characters from the rule string will be entered in the table.

This allows you to use a RuleBasedCollator to compare accented strings even when the collator is set to NO_DECOMPOSITION. However, if the strings to be collated contain combining sequences that may not be in canonical order, you should set the collator to CANONICAL_DECOMPOSITION to enable sorting of combining sequences. For more information, see The Unicode Standard, Version 3.0.

Errors

The following rules are not valid:

  • A text-argument contains unquoted punctuation symbols, for example {@code "a < b-c < d"}.
  • A relation or reset character is not followed by a text-argument, for example {@code "a < , b"}.
  • A reset where the text-argument (or an initial substring of the text-argument) is not already in the sequence or allocated in the default UCA table, for example {@code "a < b & e < f"}.

If you produce one of these errors, {@code RuleBasedCollator} throws a {@code ParseException}.

Examples

Normally, to create a rule-based collator object, you will use {@code Collator}'s factory method {@code getInstance}. However, to create a rule-based collator object with specialized rules tailored to your needs, you construct the {@code RuleBasedCollator} with the rules contained in a {@code String} object. For example:

String Simple = "< a < b < c < d";

RuleBasedCollator mySimple = new RuleBasedCollator(Simple);

Or:

String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I"
+ "< j,J< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R"
+ "< s,S< t,T< u,U< v,V< w,W< x,X< y,Y< z,Z"
+ "< \u00E5=a\u030A,\u00C5=A\u030A"
+ ";aa,AA< \u00E6,\u00C6< \u00F8,\u00D8";

RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);

Combining {@code Collator}s is as simple as concatenating strings. Here is an example that combines two {@code Collator}s from two different locales:

// Create an en_US Collator object
RuleBasedCollator en_USCollator = (RuleBasedCollator)Collator
.getInstance(new Locale("en", "US", ""));

// Create a da_DK Collator object
RuleBasedCollator da_DKCollator = (RuleBasedCollator)Collator
.getInstance(new Locale("da", "DK", ""));

// Combine the two collators
// First, get the collation rules from en_USCollator
String en_USRules = en_USCollator.getRules();

// Second, get the collation rules from da_DKCollator
String da_DKRules = da_DKCollator.getRules();

RuleBasedCollator newCollator = new RuleBasedCollator(en_USRules + da_DKRules);
// newCollator has the combined rules

The next example shows to make changes on an existing table to create a new {@code Collator} object. For example, add {@code "& C < ch, cH, Ch, CH"} to the {@code en_USCollator} object to create your own:

// Create a new Collator object with additional rules
String addRules = "& C < ch, cH, Ch, CH";

RuleBasedCollator myCollator = new RuleBasedCollator(en_USCollator + addRules);
// myCollator contains the new rules

The following example demonstrates how to change the order of non-spacing accents:

// old rule
String oldRules = "= \u00a8 ; \u00af ; \u00bf" + "< a , A ; ae, AE ; \u00e6 , \u00c6"
+ "< b , B < c, C < e, E & C < d, D";

// change the order of accent characters
String addOn = "& \u00bf ; \u00af ; \u00a8;";

RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn);

The last example shows how to put new primary ordering in before the default setting. For example, in the Japanese {@code Collator}, you can either sort English characters before or after Japanese characters:

// get en_US Collator rules
RuleBasedCollator en_USCollator = (RuleBasedCollator)
Collator.getInstance(Locale.US);

// add a few Japanese character to sort before English characters
// suppose the last character before the first base letter 'a' in
// the English collation rule is \u30A2
String jaString = "& \u30A2 , \u30FC < \u30C8";

RuleBasedCollator myJapaneseCollator =
new RuleBasedCollator(en_USCollator.getRules() + jaString);
since
Android 1.0

Fields Summary
Constructors Summary
RuleBasedCollator(com.ibm.icu4jni.text.Collator wrapper)

        super(wrapper);
    
public RuleBasedCollator(String rules)
Constructs a new instance of {@code RuleBasedCollator} using the specified {@code rules}. The {@code rules} are usually either hand-written based on the {@link RuleBasedCollator class description} or the result of a former {@link #getRules()} call.

Note that the {@code rules} are actually interpreted as a delta to the standard Unicode Collation Algorithm (UCA). Hence, an empty {@code rules} string results in the default UCA rules being applied. This differs slightly from other implementations which work with full {@code rules} specifications and may result in different behavior.

param
rules the collation rules.
throws
NullPointerException if {@code rules} is {@code null}.
throws
ParseException if {@code rules} contains rules with invalid collation rule syntax.
since
Android 1.0

        if (rules == null) {
            throw new NullPointerException();
        }
        // BEGIN android-removed
        // if (rules.length() == 0) {
        //     // text.06=Build rules empty
        //     throw new ParseException(Messages.getString("text.06"), 0); //$NON-NLS-1$
        // }
        // END andriod-removed

        try {
            this.icuColl = new com.ibm.icu4jni.text.RuleBasedCollator(rules);
            // BEGIN android-added
            this.icuColl.setDecomposition(
                    com.ibm.icu4jni.text.Collator.CANONICAL_DECOMPOSITION);
            // END android-added
        } catch (Exception e) {
            if (e instanceof ParseException) {
                throw (ParseException) e;
            }
            /*
             * -1 means it's not a ParseException. Maybe IOException thrown when
             * an error occured while reading internal data.
             */
            throw new ParseException(e.getMessage(), -1);
        }
    
Methods Summary
public java.lang.Objectclone()
Returns a new collator with the same collation rules, decomposition mode and strength value as this collator.

return
a shallow copy of this collator.
see
java.lang.Cloneable
since
Android 1.0

        RuleBasedCollator clone = (RuleBasedCollator) super.clone();
        return clone;
    
public intcompare(java.lang.String source, java.lang.String target)
Compares the {@code source} text to the {@code target} text according to the collation rules, strength and decomposition mode for this {@code RuleBasedCollator}. See the {@code Collator} class description for an example of use.

General recommendation: If comparisons are to be done with the same strings multiple times, it is more efficient to generate {@code CollationKey} objects for the strings and use {@code CollationKey.compareTo(CollationKey)} for the comparisons. If each string is compared to only once, using {@code RuleBasedCollator.compare(String, String)} has better performance.

param
source the source text.
param
target the target text.
return
an integer which may be a negative value, zero, or else a positive value depending on whether {@code source} is less than, equivalent to, or greater than {@code target}.
since
Android 1.0

        if (source == null || target == null) {
            // text.08=one of arguments is null
            throw new NullPointerException(Messages.getString("text.08")); //$NON-NLS-1$
        }
        return this.icuColl.compare(source, target);
    
public booleanequals(java.lang.Object obj)
Compares the specified object with this {@code RuleBasedCollator} and indicates if they are equal. In order to be equal, {@code object} must be an instance of {@code Collator} with the same collation rules and the same attributes.

param
obj the object to compare with this object.
return
{@code true} if the specified object is equal to this {@code RuleBasedCollator}; {@code false} otherwise.
see
#hashCode
since
Android 1.0

        if (!(obj instanceof Collator)) {
            return false;
        }
        return super.equals(obj);
    
public java.text.CollationElementIteratorgetCollationElementIterator(java.text.CharacterIterator source)
Obtains a {@code CollationElementIterator} for the given {@code CharacterIterator}. The source iterator's integrity will be preserved since a new copy will be created for use.

param
source the source character iterator.
return
a {@code CollationElementIterator} for {@code source}.
since
Android 1.0

        if (source == null) {
            throw new NullPointerException();
        }
        return new CollationElementIterator(
                ((com.ibm.icu4jni.text.RuleBasedCollator) this.icuColl)
                        .getCollationElementIterator(source));
    
public java.text.CollationElementIteratorgetCollationElementIterator(java.lang.String source)
Obtains a {@code CollationElementIterator} for the given string.

param
source the source string.
return
the {@code CollationElementIterator} for {@code source}.
since
Android 1.0

        if (source == null) {
            throw new NullPointerException();
        }
        return new CollationElementIterator(
                ((com.ibm.icu4jni.text.RuleBasedCollator) this.icuColl)
                        .getCollationElementIterator(source));
    
public java.text.CollationKeygetCollationKey(java.lang.String source)
Returns the {@code CollationKey} for the given source text.

param
source the specified source text.
return
the {@code CollationKey} for the given source text.
since
Android 1.0

        com.ibm.icu4jni.text.CollationKey icuKey = this.icuColl
                .getCollationKey(source);
        if (icuKey == null) {
            return null;
        }
        return new CollationKey(source, icuKey);
    
public java.lang.StringgetRules()
Returns the collation rules of this collator. These {@code rules} can be fed into the {@link #RuleBasedCollator(String)} constructor.

Note that the {@code rules} are actually interpreted as a delta to the standard Unicode Collation Algorithm (UCA). Hence, an empty {@code rules} string results in the default UCA rules being applied. This differs slightly from other implementations which work with full {@code rules} specifications and may result in different behavior.

return
the collation rules.
since
Android 1.0

        return ((com.ibm.icu4jni.text.RuleBasedCollator) this.icuColl).getRules();
    
public inthashCode()

        return ((com.ibm.icu4jni.text.RuleBasedCollator) this.icuColl).getRules()
                .hashCode();