FileDocCategorySizeDatePackage
RuleBasedCollator.javaAPI DocAndroid 1.5 API25915Wed May 06 22:41:04 BST 2009com.ibm.icu4jni.text

RuleBasedCollator

public final class RuleBasedCollator extends Collator
Concrete implementation class for Collation.

The collation table is composed of a list of collation rules, where each rule is of three forms:

< modifier >
< relation > < text-argument >
< reset > < text-argument >

RuleBasedCollator has the following restrictions for efficiency (other subclasses may be used for more complex languages) :

  1. If a French secondary ordering is specified it applies to the whole collator object.
  2. All non-mentioned Unicode characters are at the end of the collation order.
  3. If a character is not located in the RuleBasedCollator, the default Unicode Collation Algorithm (UCA) rulebased table is automatically searched as a backup.
The following demonstrates how to create your own collation rules:
  • Text-Argument: A text-argument is any sequence of characters, excluding special characters (that is, common whitespace characters [0009-000D, 0020] and rule syntax characters [0021-002F, 003A-0040, 005B-0060, 007B-007E]). If those characters are desired, you can put them in single quotes (e.g. ampersand => '&'). Note that unquoted white space characters are ignored; e.g. b c is treated as bc.
  • Modifier: There is a single modifier which is used to specify that all accents (secondary differences) are backwards.

    '@' : Indicates that accents are sorted backwards, as in French.

  • Relation: The relations are the following:
    • '<' : Greater, as a letter difference (primary)
    • ';' : Greater, as an accent difference (secondary)
    • ',' : Greater, as a case difference (tertiary)
    • '=' : Equal
  • Reset: There is a single reset which is used primarily for contractions and expansions, but which can also be used to add a modification at the end of a set of rules.

    '&' : Indicates that the next rule follows the position to where the reset text-argument would be sorted.

This sounds more complicated than it is in practice. For example, the following are equivalent ways of expressing the same thing:

a < b < c
a < b & b < c
a < c & a < b
Notice that the order is important, as the subsequent item goes immediately after the text-argument. The following are not equivalent:
a < b & a < c
a < c & a < b
Either the text-argument must already be present in the sequence, or some initial substring of the text-argument must be present. (e.g. "a < b & ae < e" is valid since "a" is present in the sequence before "ae" is reset). In this latter case, "ae" is not entered and treated as a single character; instead, "e" is sorted as if it were expanded to two characters: "a" followed by an "e". This difference appears in natural languages: in traditional Spanish "ch" is treated as though it contracts to a single character (expressed as "c < ch < d"), while in traditional German a-umlaut is treated as though it expanded to two characters (expressed as "a,A < b,B ... & ae;? & AE;?"). [? and ? are, of course, the escape sequences for a-umlaut.]

Ignorable Characters

For ignorable characters, the first rule must start with a relation (the examples we have used above are really fragments; "a < b" really should be "< a < b"). If, however, the first relation is not "<", then all the all text-arguments up to the first "<" are ignorable. For example, ", - < a < b" makes "-" an ignorable character, as we saw earlier in the word "black-birds". In the samples for different languages, you see that most accents are ignorable.

Normalization and Accents

RuleBasedCollator automatically processes its rule table to include both pre-composed and combining-character versions of accented characters. Even if the provided rule string contains only base characters and separate combining accent characters, the pre-composed accented characters matching all canonical combinations of characters from the rule string will be entered in the table.

This allows you to use a RuleBasedCollator to compare accented strings even when the collator is set to NO_DECOMPOSITION. However, if the strings to be collated contain combining sequences that may not be in canonical order, you should set the collator to CANONICAL_DECOMPOSITION to enable sorting of combining sequences. For more information, see The Unicode Standard, Version 3.0.)

Errors

The following are errors:

  • A text-argument contains unquoted punctuation symbols (e.g. "a < b-c < d").
  • A relation or reset character not followed by a text-argument (e.g. "a < , b").
  • A reset where the text-argument (or an initial substring of the text-argument) is not already in the sequence or allocated in the default UCA table. (e.g. "a < b & e < f")
If you produce one of these errors, a RuleBasedCollator throws a ParseException.

Examples

Simple: "< a < b < c < d"

Norwegian: "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J < k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T < u,U< v,V< w,W< x,X< y,Y< z,Z < ?=a?,?=A? ;aa,AA< ?,?< ?,?"

Normally, to create a rule-based Collator object, you will use Collator's factory method getInstance. However, to create a rule-based Collator object with specialized rules tailored to your needs, you construct the RuleBasedCollator with the rules contained in a String object. For example:

String Simple = "< a < b < c < d";
RuleBasedCollator mySimple = new RuleBasedCollator(Simple);
Or:
String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J" +
"< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T" +
"< u,U< v,V< w,W< x,X< y,Y< z,Z" +
"< ?=a?,?=A?" +
";aa,AA< ?,?< ?,?";
RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);

Combining Collators is as simple as concatenating strings. Here's an example that combines two Collators from two different locales:

// Create an en_US Collator object
RuleBasedCollator en_USCollator = (RuleBasedCollator)
Collator.getInstance(new Locale("en", "US", ""));
// Create a da_DK Collator object
RuleBasedCollator da_DKCollator = (RuleBasedCollator)
Collator.getInstance(new Locale("da", "DK", ""));
// Combine the two
// First, get the collation rules from en_USCollator
String en_USRules = en_USCollator.getRules();
// Second, get the collation rules from da_DKCollator
String da_DKRules = da_DKCollator.getRules();
RuleBasedCollator newCollator =
new RuleBasedCollator(en_USRules + da_DKRules);
// newCollator has the combined rules

Another more interesting example would be to make changes on an existing table to create a new Collator object. For example, add "& C < ch, cH, Ch, CH" to the en_USCollator object to create your own:

// Create a new Collator object with additional rules
String addRules = "& C < ch, cH, Ch, CH";
RuleBasedCollator myCollator =
new RuleBasedCollator(en_USCollator + addRules);
// myCollator contains the new rules

The following example demonstrates how to change the order of non-spacing accents,

// old rule
String oldRules = "=?;?;?" // main accents Diaeresis 00A8, Macron 00AF
// Acute 00BF
+ "< a , A ; ae, AE ; ? , ?"
+ "< b , B < c, C < e, E & C < d, D";
// change the order of accent characters
String addOn = "& ?;?;?;"; // Acute 00BF, Macron 00AF, Diaeresis 00A8
RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn);

The last example shows how to put new primary ordering in before the default setting. For example, in Japanese Collator, you can either sort English characters before or after Japanese characters,

// get en_US Collator rules
RuleBasedCollator en_USCollator =
(RuleBasedCollator)Collator.getInstance(Locale.US);
// add a few Japanese character to sort before English characters
// suppose the last character before the first base letter 'a' in
// the English collation rule is ?
String jaString = "& \\u30A2 , \\u30FC < \\u30C8";
RuleBasedCollator myJapaneseCollator = new
RuleBasedCollator(en_USCollator.getRules() + jaString);

author
syn wee quek
stable
ICU 2.4

Fields Summary
private int
m_collator_
C collator
private int
m_hashcode_
Hash code for rules
Constructors Summary
public RuleBasedCollator(String rules)
RuleBasedCollator constructor. This takes the table rules and builds a collation table out of them. Please see RuleBasedCollator class description for more details on the collation rule syntax.

param
rules the collation rules to build the collation table from.
exception
ParseException thrown if rules are empty or a Runtime error if collator can not be created.
stable
ICU 2.4

    // BEGIN android-changed
    if (rules == null) {
      throw new NullPointerException();
    }
    // if (rules.length() == 0)
    //   throw new ParseException("Build rules empty.", 0);
    // END android-changed
    m_collator_ = NativeCollation.openCollatorFromRules(rules,
                              CollationAttribute.VALUE_OFF,
                              CollationAttribute.VALUE_DEFAULT_STRENGTH);
  
RuleBasedCollator()
RuleBasedCollator default constructor. This constructor takes the default locale. The only caller of this class should be Collator.getInstance(). Current implementation createInstance() returns a RuleBasedCollator(Locale) instance. The RuleBasedCollator will be created in the following order,
  • Data from argument locale resource bundle if found, otherwise
  • Data from parent locale resource bundle of arguemtn locale if found, otherwise
  • Data from built-in default collation rules if found, other
  • null is returned

    m_collator_ = NativeCollation.openCollator();
  
public RuleBasedCollator(String rules, int strength)
RuleBasedCollator constructor. This takes the table rules and builds a collation table out of them. Please see RuleBasedCollator class description for more details on the collation rule syntax.

param
rules the collation rules to build the collation table from.
param
strength collation strength
exception
ParseException thrown if rules are empty or a Runtime error if collator can not be created.
see
#PRIMARY
see
#SECONDARY
see
#TERTIARY
see
#QUATERNARY
see
#IDENTICAL
stable
ICU 2.4

    // BEGIN android-changed
    if (rules == null) {
      throw new NullPointerException();
    }
    // if (rules.length() == 0)
    //   throw new ParseException("Build rules empty.", 0);
    // END android-changed
    if (!CollationAttribute.checkStrength(strength))
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
      
    m_collator_ = NativeCollation.openCollatorFromRules(rules,
                                CollationAttribute.VALUE_OFF,
                                strength);
  
RuleBasedCollator(Locale locale)
RuleBasedCollator constructor. This constructor takes a locale. The only caller of this class should be Collator.createInstance(). Current implementation createInstance() returns a RuleBasedCollator(Locale) instance. The RuleBasedCollator will be created in the following order,
  • Data from argument locale resource bundle if found, otherwise
  • Data from parent locale resource bundle of arguemtn locale if found, otherwise
  • Data from built-in default collation rules if found, other
  • null is returned

param
locale locale used

    if (locale == null) {
      m_collator_ = NativeCollation.openCollator();
    }
    else {
      m_collator_ = NativeCollation.openCollator(locale.toString());
    }
  
private RuleBasedCollator(int collatoraddress)
Private use constructor. Does not create any instance of the C collator. Accepts argument as the C collator for new instance.

param
collatoraddress address of C collator

  
  // private constructor -----------------------------------------
  
                               
    
  
    m_collator_ = collatoraddress;
  
public RuleBasedCollator(String rules, int normalizationmode, int strength)
RuleBasedCollator constructor. This takes the table rules and builds a collation table out of them. Please see RuleBasedCollator class description for more details on the collation rule syntax.

Note API change starting from release 2.4. Prior to release 2.4, the normalizationmode argument values are from the class com.ibm.icu4jni.text.Normalization. In 2.4, the valid normalizationmode arguments for this API are CollationAttribute.VALUE_ON and CollationAttribute.VALUE_OFF.

param
rules the collation rules to build the collation table from.
param
strength collation strength
param
normalizationmode normalization mode
exception
IllegalArgumentException thrown when constructor error occurs
see
#PRIMARY
see
#SECONDARY
see
#TERTIARY
see
#QUATERNARY
see
#IDENTICAL
see
#CANONICAL_DECOMPOSITION
see
#NO_DECOMPOSITION
stable
ICU 2.4

    // BEGIN android-added
    if (rules == null) {
      throw new NullPointerException();
    }
    // END android-added
    if (!CollationAttribute.checkStrength(strength) || 
        !CollationAttribute.checkNormalization(normalizationmode)) {
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    }
      
    m_collator_ = NativeCollation.openCollatorFromRules(rules,
                                          normalizationmode, strength);
  
Methods Summary
public java.lang.Objectclone()
Makes a complete copy of the current object.

return
a copy of this object if data clone is a success, otherwise null
stable
ICU 2.4

    RuleBasedCollator result = null;
    int collatoraddress = NativeCollation.safeClone(m_collator_);
    result = new RuleBasedCollator(collatoraddress);
    return (Collator)result;
  
public intcompare(java.lang.String source, java.lang.String target)
The comparison function compares the character data stored in two different strings. Returns information about whether a string is less than, greater than or equal to another string.

Example of use:
Collator myCollation = Collator.createInstance(Locale::US); myCollation.setStrength(CollationAttribute.VALUE_PRIMARY); // result would be Collator.RESULT_EQUAL ("abc" == "ABC") // (no primary difference between "abc" and "ABC") int result = myCollation.compare("abc", "ABC",3); myCollation.setStrength(CollationAttribute.VALUE_TERTIARY); // result would be Collation::LESS (abc" <<< "ABC") // (with tertiary difference between "abc" and "ABC") int result = myCollation.compare("abc", "ABC",3);

param
source The source string.
param
target The target string.
return
result of the comparison, Collator.RESULT_EQUAL, Collator.RESULT_GREATER or Collator.RESULT_LESS
stable
ICU 2.4

    return NativeCollation.compare(m_collator_, source, target);
  
public booleanequals(java.lang.Object target)
Checks if argument object is equals to this object.

param
target object
return
true if source is equivalent to target, false otherwise
stable
ICU 2.4

    if (this == target) 
      return true;
    if (target == null) 
      return false;
    if (getClass() != target.getClass()) 
      return false;
    
    RuleBasedCollator tgtcoll = (RuleBasedCollator)target;
    return getRules().equals(tgtcoll.getRules()) && 
           getStrength() == tgtcoll.getStrength() && 
           getDecomposition() == tgtcoll.getDecomposition();
  
protected voidfinalize()
Garbage collection. Close C collator and reclaim memory.

    NativeCollation.closeCollator(m_collator_);
  
public intgetAttribute(int type)
Gets the attribute to be used in comparison or transformation.

param
type the attribute to be set from CollationAttribute
return
value attribute value from CollationAttribute
stable
ICU 2.4

    if (!CollationAttribute.checkType(type))
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    return NativeCollation.getAttribute(m_collator_, type);
  
public CollationElementIteratorgetCollationElementIterator(java.lang.String source)
Create a CollationElementIterator object that will iterator over the elements in a string, using the collation rules defined in this RuleBasedCollator

param
source string to iterate over
return
address of C collationelement
exception
IllegalArgumentException thrown when error occurs
stable
ICU 2.4

    CollationElementIterator result = new CollationElementIterator(
         NativeCollation.getCollationElementIterator(m_collator_, source));
    // result.setOwnCollationElementIterator(true);
    return result;
  
public CollationElementIteratorgetCollationElementIterator(java.text.CharacterIterator source)
Create a CollationElementIterator object that will iterator over the elements in a string, using the collation rules defined in this RuleBasedCollator

param
source string to iterate over
return
address of C collationelement
exception
IllegalArgumentException thrown when error occurs
stable
ICU 2.4

    CollationElementIterator result = new CollationElementIterator(
         NativeCollation.getCollationElementIterator(m_collator_, 
                 source.toString()));
    // result.setOwnCollationElementIterator(true);
    return result;
  
public CollationKeygetCollationKey(java.lang.String source)
Get the sort key as an CollationKey object from the argument string. To retrieve sort key in terms of byte arrays, use the method as below

Collator collator = Collator.getInstance(); byte[] array = collator.getSortKey(source);
Byte array result are zero-terminated and can be compared using java.util.Arrays.equals();

param
source string to be processed.
return
the sort key
stable
ICU 2.4

    // BEGIN android-removed
    // return new CollationKey(NativeCollation.getSortKey(m_collator_, source));
    // END android-removed
    // BEGIN android-added
    if(source == null) {
        return null;
    }
    byte[] key = NativeCollation.getSortKey(m_collator_, source);
    if(key == null) {
      return null;
    }
    return new CollationKey(key);
    // END android-added
  
public intgetDecomposition()
Get the normalization mode for this object. The normalization mode influences how strings are compared.

see
#CANONICAL_DECOMPOSITION
see
#NO_DECOMPOSITION
stable
ICU 2.4

    return NativeCollation.getNormalization(m_collator_);
  
public java.lang.StringgetRules()
Get the collation rules of this Collation object The rules will follow the rule syntax.

return
collation rules.
stable
ICU 2.4

    return NativeCollation.getRules(m_collator_);
  
public byte[]getSortKey(java.lang.String source)
Get a sort key for the argument string Sort keys may be compared using java.util.Arrays.equals

param
source string for key to be generated
return
sort key
stable
ICU 2.4

    return NativeCollation.getSortKey(m_collator_, source);
  
public intgetStrength()
Determines the minimum strength that will be use in comparison or transformation.

E.g. with strength == CollationAttribute.VALUE_SECONDARY, the tertiary difference is ignored

E.g. with strength == PRIMARY, the secondary and tertiary difference are ignored.

return
the current comparison level.
see
#PRIMARY
see
#SECONDARY
see
#TERTIARY
see
#QUATERNARY
see
#IDENTICAL
stable
ICU 2.4

    return NativeCollation.getAttribute(m_collator_, 
                                        CollationAttribute.STRENGTH);
  
public inthashCode()
Returns a hash of this collation object Note this method is not complete, it only returns 0 at the moment.

return
hash of this collation object
stable
ICU 2.4

    // since rules do not change once it is created, we can cache the hash
    if (m_hashcode_ == 0) {
      m_hashcode_ = NativeCollation.hashCode(m_collator_);
      if (m_hashcode_ == 0)
        m_hashcode_ = 1;
    }
    return m_hashcode_;
  
public voidsetAttribute(int type, int value)
Sets the attribute to be used in comparison or transformation.

Example of use:
Collator myCollation = Collator.createInstance(Locale::US); myCollation.setAttribute(CollationAttribute.CASE_LEVEL, CollationAttribute.VALUE_ON); int result = myCollation->compare("\\u30C3\\u30CF", "\\u30C4\\u30CF"); // result will be Collator.RESULT_LESS.

param
type the attribute to be set from CollationAttribute
param
value attribute value from CollationAttribute
stable
ICU 2.4

    if (!CollationAttribute.checkAttribute(type, value))
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    NativeCollation.setAttribute(m_collator_, type, value);
  
public voidsetDecomposition(int decompositionmode)

Sets the decomposition mode of the Collator object on or off. If the decomposition mode is set to on, string would be decomposed into NFD format where necessary before sorting.

param
decompositionmode the new decomposition mode
see
#CANONICAL_DECOMPOSITION
see
#NO_DECOMPOSITION
stable
ICU 2.4

    if (!CollationAttribute.checkNormalization(decompositionmode)) 
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    NativeCollation.setAttribute(m_collator_, 
                                 CollationAttribute.NORMALIZATION_MODE,
                                 decompositionmode);
  
public voidsetStrength(int strength)
Sets the minimum strength to be used in comparison or transformation.

Example of use:
Collator myCollation = Collator.createInstance(Locale::US); myCollation.setStrength(PRIMARY); // result will be "abc" == "ABC" // tertiary differences will be ignored int result = myCollation->compare("abc", "ABC");

param
strength the new comparison level.
exception
IllegalArgumentException when argument does not belong to any collation strength mode or error occurs while setting data.
see
#PRIMARY
see
#SECONDARY
see
#TERTIARY
see
#QUATERNARY
see
#IDENTICAL
stable
ICU 2.4

    if (!CollationAttribute.checkStrength(strength)) 
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    NativeCollation.setAttribute(m_collator_, CollationAttribute.STRENGTH, 
                                 strength);