Previous | Next | Trail Map | Writing Global Programs | Collation and Text Boundaries


The Collation Demo

The JDK 1.1 release added the Collator(in the API reference documentation) class which performs locale-sensitive string comparison. This class enables you to sort lists, and search for strings in a locale-sensitive way. To get an understanding of what Collator can do, and how it does it, bring up the following demo applet provided by Taligent and try some of the suggestions below.

Since you can't run the applet, here's a picture of it:

Text collation supports language-sensitive comparison of strings, allowing for text searching and alphabetical sorting. Taligent's collation classes provide a choice of ordering strength (for example, to ignore or not ignore case differences) and handle ignored, expanding, and contracting characters.

Developers don't need to know anything about the collation rules for various languages. Any features requiring collation can use the collation object associated with the current default locale, or with a specific locale (like France or Japan) if appropriate.

Collation Basics   Localizable Collation Searching  Customization   Details


Collation Basics

Correctly sorting strings is tricky, even in English. The results of a sort must be consistent--any differences in strings must always be sorted the same way. The sort assigns relative priorities to different features of the text, based on the characters themselves and on the current ordering strength of the collation object.

To See This...

Do This...

Consistent sorting: In English, uppercase letters always sort after lowercase letters whenever there are no other differences in compared strings.
  1. Click on the Sort Ascending button.
  2. Click on the Sort Descending button.
  3. The relative order of "pat", "Pat", and "PAT" reverses.
Differences in ordering strength: Secondary ordering strength means case differences are disregarded (enabling case-insensitive searching). With primary ordering strength accents are also ignored--only base letters are compared.
  1. Select Primary from the Strength menu.
  2. Click alternately on Sort Ascending and Sort Descending.
  3. The relative order of "pat", "Pat", and "PAT" stays the same

Other special characters, including accented or grouped characters, add other complications. For example, the "-" hyphen character in the word "black-bird" is only significant if the other letters in the compared strings are identical.


Localizable Collation

Different collation objects associated with various locales handle the differences required when sorting text strings for different languages.

To See This...

Do This...

In French, accent differences are sorted from the end of the word, so the ordering of "pęche" and "péché" changes from the English ordering.
  1. Select Tertiary from the Strength menu.
  2. Select French (France) from the Locale menu
In German the ordering of "Töne" changes, because German treats o + umlaut (ö) as if it were oe.
  1. Select German (Germany) from the Locale menu.

Searching

Precisely the same collation object can be used to do language-sensitive searching as well.

To See This...

Do This...

Different matching strengths determine whether you have a loose fit, or you want an exact match.
  1. Select the Search tab.
  2. Select Primary from the Strength menu.
  3. Type "Blackbird" into the Search String field.
  4. Click in the target field, and hit right or left arrows
  5. The matching items are highlighted.
Since German treats o + umlaut (ö) as if it were oe, "Töne" and "Toene" will match.
  1. Select German (Germany) from the Locale menu.
  2. Select Secondary from the Strength menu
  3. Type "Toene" in the Search String field.
  4. Click in the target field, and hit right or left arrows
  5. The matching items are highlighted.
A word-break iterator can also be used to restrict matches to whole words.
  1. Check the Match Whole Words Only box at the bottom.
  2. Select Primary from the Strength menu.
  3. Type "blackbird" in the Search String field.
  4. Click in the target field, and hit right or left arrows
  5. Only the whole words that match are highlighted.

Customization

You can produce a new collation by adding to or changing an existing one. You can do this in the demo using the Customize tab in the demonstration. This panel shows the rules that make up the collation sequence for that language. (At the start of the list are a number of odd-looking items such as"\u0308". These use Java notation for Unicode characters, used here because most browsers are currently unable to display the full range of Unicode characters.)

In all of the following examples, you can cut and paste sample rules or test cases instead of typing them in manually. Paste them at the end of the respective fields.

To See This...

Do This...

You can modify an existing collation. Adding items at the end of a collation overrides earlier information.

For example, you can make the letter P sort at the end of the alphabet.

  1. Select the Customize tab.
  2. Click at the end of the very last rule, and type a Return.
  3. Enter the sample rules at the end of the Collator Rules field.
  4. Hit the Set Rules button.
  5. Select the Sort tab to see the resulting sort order.
Sample Rules:
< p , P

Making P sort at the end may not seem terribly useful, but it is used to make modifications in the sorting sequence for different languages.

To See This...

Do This...

You can add new rules to an existing collation. For example, you can add CH as a single letter after C, as in traditional Spanish sorting.
  1. Select the Customize tab.
  2. Click at the end of the very last rule, and type a Return.
  3. Enter the sample rules at the end of the Collator Rules field.
  4. Hit the Set Rules button.
  5. Select the Sort tab, and enter the sample test cases into the Text To Sort field
  6. Click on the Sort Ascending button to see the resulting sort order.
Sample Rules:
& c < ch , cH, Ch, CH
Sample Test Cases:
cat
czar
churo
darn

As well as adding sequences of characters that act as a single character (this is known as contraction), you can also add characters that act like a sequence of characters (this is known as expansion).

To See This...

Do This...

You can also add other sequences to the collation rules, such as sorting symbols with their alphabetic equivalents.
  1. Select the Customize tab.
  2. Click at the end of the very last rule, and type a Return.
  3. Enter the sample rules at the end of the Collator Rules field.
  4. Hit the Set Rules button.
  5. Select the Sort tab, and enter the sample test cases into the Text To Sort field
  6. Click on the Sort Ascending button to see the resulting sort order.
Sample Rules:
& Question'-'mark ; '?'
& Hash'-'mark ; '#'
& Ampersand ; '&'
Sample Test Cases:
?
#
&

Expansion and contraction can actually be combined.

To See This...

Do This...

In Japanese there is a length character that acts as though it doubles a character in sorting. Using analogous English letters, it would be as though "a-" sorts as "aa", "e-" sorts as "ee", etc.
  1. Select the Customize tab.
  2. Click at the end of the very last rule, and type a Return.
  3. Enter the sample rules at the end of the Collator Rules field.
  4. Hit the Set Rules button.
  5. Select the Sort tab, and enter the sample test cases into the Text To Sort field
  6. Click on the Sort Ascending button to see the resulting sort order.
Sample Rules:
& aa ; a'-'
& ee ; e'-'
& ii ; i'-'
& oo ; o'-'
& uu ; u'-'
Sample Test Cases:
aardvark
a-rdvark
abbot
coop
co-p
cop

For more information on how collation rules are constructed, see Details. You can also type in additional words to see different collation behaviors. Try it out!



The source.


This page incorporates material or code copyrighted by Taligent, Inc. For more information on international resources, see their International Fact Sheet.


Previous | Next | Trail Map | Writing Global Programs | Collation and Text Boundaries