ICU collation keyword field type

The icu_collation_keyword field type stores terms as binary encoded collation keys, enabling language-specific sorting and range queries. Unlike standard string sorting, which uses byte-order comparison, this field type applies collation rules that respect linguistic conventions for a specific language or locale.

This field type is particularly useful when you need to sort documents according to language-specific alphabetical order, handle accented characters correctly, or implement culturally appropriate string comparisons.

Installation

The icu_collation_keyword field type requires the analysis-icu plugin. For installation instructions, see ICU analyzer.

How it works

The icu_collation_keyword field encodes terms directly as binary collation keys in doc values and creates a single indexed token (similar to the standard keyword field). This approach provides:

Language-aware sorting: Applies collation rules specific to a language or locale
Efficient storage: Stores binary collation keys rather than full strings
Range query support: Enables range queries that respect linguistic ordering

By default, the field uses DUCET (Default Unicode Collation Element Table) collation, which provides a language-neutral best-effort sort order.

Parameters

The following table lists the parameters accepted by the icu_collation_keyword field type.

Parameter	Data type	Description
`language`	String	The language code (for example, `de` for German, `fr` for French). Optional.
`country`	String	The country code (for example, `DE` for Germany, `FR` for France). Optional.
`variant`	String	A variant string for additional collation options (for example, `@collation=phonebook` for German phonebook order). Optional.
`strength`	String	The collation strength level. Valid values are `primary`, `secondary`, `tertiary`, `quaternary`, and `identical`. Default is `tertiary`. Optional.
`decomposition`	String	How to handle character normalization. Valid values are `no` and `canonical`. Default is `no`. Optional.
`alternate`	String	How to handle whitespace and punctuation. Valid values are `shifted` and `non-ignorable`. Optional.
`case_level`	Boolean	Whether to consider case differences when `strength` is `primary`. Default is `false`. Optional.
`case_first`	String	Whether uppercase or lowercase sorts first. Valid values are `lower` and `upper`. Optional.
`numeric`	Boolean	Whether to sort numeric substrings by numeric value. For example, `item-9` sorts before `item-21`. Default is `false`. Optional.
`variable_top`	String	Specifies which characters are considered variable for the `alternate` option. Optional.
`hiragana_quaternary_mode`	Boolean	Whether to distinguish between Katakana and Hiragana at `quaternary` strength. Optional.
`doc_values`	Boolean	Whether the field should be stored on disk for sorting and aggregations. Default is `true`. Optional.
`index`	Boolean	Whether the field should be searchable. Default is `true`. Optional.
`null_value`	String	A string value to substitute for explicit `null` values. Default is `null` (field treated as missing). Optional.
`store`	Boolean	Whether to store the field value separately from `_source`. Default is `false`. Optional.
`fields`	Object	Multi-field mappings for indexing the same value in different ways. Optional.

Example: German phonebook sorting

The following example creates an index with a field that sorts German names using phonebook ordering:

PUT /german-names
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "sort": {
            "type": "icu_collation_keyword",
            "language": "de",
            "country": "DE",
            "variant": "@collation=phonebook"
          }
        }
      }
    }
  }
}

Index some German names:

POST /german-names/_bulk
{"index":{"_id":"1"}}
{"name":"Müller"}
{"index":{"_id":"2"}}
{"name":"Möller"}
{"index":{"_id":"3"}}
{"name":"Meyer"}
{"index":{"_id":"4"}}
{"name":"Schneider"}

Search and sort using the collation field:

GET /german-names/_search
{
  "query": {
    "match_all": {}
  },
  "sort": "name.sort"
}

The results are sorted according to German phonebook conventions, where ö and ü are treated as distinct characters in the German alphabet:

{
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "german-names",
        "_id": "3",
        "_score": null,
        "_source": {
          "name": "Meyer"
        }
      },
      {
        "_index": "german-names",
        "_id": "2",
        "_score": null,
        "_source": {
          "name": "Möller"
        }
      },
      {
        "_index": "german-names",
        "_id": "1",
        "_score": null,
        "_source": {
          "name": "Müller"
        }
      },
      {
        "_index": "german-names",
        "_id": "4",
        "_score": null,
        "_source": {
          "name": "Schneider"
        }
      }
    ]
  }
}

Example: French accented character sorting

The following example demonstrates French collation, which treats accented characters according to French linguistic rules:

PUT /french-words
{
  "mappings": {
    "properties": {
      "word": {
        "type": "text",
        "fields": {
          "sort": {
            "type": "icu_collation_keyword",
            "language": "fr",
            "country": "FR",
            "strength": "primary"
          }
        }
      }
    }
  }
}

Index French words with accents:

POST /french-words/_bulk
{"index":{"_id":"1"}}
{"word":"cote"}
{"index":{"_id":"2"}}
{"word":"côte"}
{"index":{"_id":"3"}}
{"word":"coté"}
{"index":{"_id":"4"}}
{"word":"côté"}

Query with sorting:

GET /french-words/_search
{
  "query": {
    "match_all": {}
  },
  "sort": "word.sort"
}

The results follow French alphabetical conventions.

Collation strength levels

The strength parameter determines how strictly the collation compares strings:

primary: Compares base characters only, ignoring accents and case. For example, a, A, á, and Á are considered equal.
secondary: Compares base characters and accents, but ignores case. For example, a and á are different, but a and A are equal.
tertiary (default): Compares base characters, accents, and case. For example, a, A, and á are all different.
quaternary: Adds punctuation and whitespace comparison when alternate is set to shifted.
identical: Performs character-by-character binary comparison.

Performance considerations

The icu_collation_keyword field type uses more disk space than standard keyword fields because it stores binary collation keys. However, this approach provides faster sorting and range queries compared to applying collation at query time.

For optimal performance:

Use icu_collation_keyword as a multi-field on text fields rather than as the primary field type
Set index: false if you only need sorting and don’t require range queries on the collation field
Choose an appropriate strength level—lower strengths produce smaller collation keys

Installation
How it works
Parameters
Example: German phonebook sorting
Example: French accented character sorting
Collation strength levels
Performance considerations
Related documentation

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

ICU collation keyword field type

Installation

How it works

Parameters

Example: German phonebook sorting

Example: French accented character sorting

Collation strength levels

Performance considerations

OpenSearch Links

Get Involved

Resources

Contact Us

ICU collation keyword field type

Installation

How it works

Parameters

Example: German phonebook sorting

Example: French accented character sorting

Collation strength levels

Performance considerations

Related documentation