ICU transform token filter

The icu_transform token filter applies ICU text transformations to tokens, enabling operations such as transliteration, case mapping, normalization, and bidirectional text handling. This filter uses transformation rules defined by the ICU Transform framework.

Common use cases include:

Transliteration: Converting text from one script to another (for example, Cyrillic to Latin)
Script conversion: Transforming between different writing systems
Accent removal: Separating base characters from diacritics
Custom transformations: Applying user-defined transformation rules

Installation

The icu_transform token filter requires the analysis-icu plugin. For installation instructions, see ICU analyzer.

Parameters

The following table lists the parameters for the icu_transform token filter.

Parameter	Data type	Description
`id`	String	The ICU transform ID specifying which transformation to apply. Can be a single transform ID or a compound ID with multiple transforms separated by semicolons. Default is `Null` (no transformation).
`dir`	String	The text direction for the transformation. Valid values are `forward` (default, left-to-right) and `reverse` (right-to-left). Default is `forward`.

Transform IDs

You can specify transformations using standard ICU transform IDs. Common transforms include:

Any-Latin: Transliterates text from any script to Latin characters
Latin-Cyrillic: Converts Latin text to Cyrillic
NFD; [:Nonspacing Mark:] Remove; NFC: Decomposes characters, removes diacritics, then recomposes
Lower: Converts text to lowercase
Upper: Converts text to uppercase
Hiragana-Katakana: Converts Hiragana to Katakana

You can chain multiple transforms by separating them with semicolons.

Example: Transliterating to Latin

The following example demonstrates transliteration of multiple scripts to Latin characters:

PUT /icu-transform-latin
{
  "settings": {
    "analysis": {
      "filter": {
        "latin_transform": {
          "type": "icu_transform",
          "id": "Any-Latin"
        }
      },
      "analyzer": {
        "latin_analyzer": {
          "tokenizer": "keyword",
          "filter": ["latin_transform"]
        }
      }
    }
  }
}

Test the analyzer with text in different scripts:

POST /icu-transform-latin/_analyze
{
  "analyzer": "latin_analyzer",
  "text": "Москва"
}

The Cyrillic text is transliterated to Latin:

{
  "tokens": [
    {
      "token": "Moskva",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    }
  ]
}

Test with Japanese text:

POST /icu-transform-latin/_analyze
{
  "analyzer": "latin_analyzer",
  "text": "東京"
}

The Japanese characters are transliterated:

{
  "tokens": [
    {
      "token": "dōng jīng",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    }
  ]
}

Example: Removing accents

The following example removes diacritical marks from text:

PUT /icu-transform-no-accents
{
  "settings": {
    "analysis": {
      "filter": {
        "remove_accents": {
          "type": "icu_transform",
          "id": "NFD; [:Nonspacing Mark:] Remove; NFC"
        }
      },
      "analyzer": {
        "accent_removal_analyzer": {
          "tokenizer": "keyword",
          "filter": ["remove_accents"]
        }
      }
    }
  }
}

Test the analyzer:

POST /icu-transform-no-accents/_analyze
{
  "analyzer": "accent_removal_analyzer",
  "text": "Ênrique Iglesias"
}

The accents are removed:

{
  "tokens": [
    {
      "token": "Enrique Iglesias",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

Example: Script-to-script conversion

The following example converts Latin text to Cyrillic:

PUT /icu-transform-cyrillic
{
  "settings": {
    "analysis": {
      "filter": {
        "to_cyrillic": {
          "type": "icu_transform",
          "id": "Latin-Cyrillic"
        }
      },
      "analyzer": {
        "cyrillic_analyzer": {
          "tokenizer": "keyword",
          "filter": ["to_cyrillic"]
        }
      }
    }
  }
}

Test with Latin text:

POST /icu-transform-cyrillic/_analyze
{
  "analyzer": "cyrillic_analyzer",
  "text": "Sankt Peterburg"
}

The text is converted to Cyrillic script:

{
  "tokens": [
    {
      "token": "Санкт Петербург",
      "start_offset": 0,
      "end_offset": 15,
      "type": "word",
      "position": 0
    }
  ]
}

Compound transformations

You can chain multiple transformations by separating transform IDs with semicolons. The transformations are applied in order from left to right.

For example, the compound ID "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" performs the following steps:

Transliterates to Latin
Applies canonical decomposition (NFD)
Removes non-spacing marks (accents)
Applies canonical composition (NFC)

Installation
Parameters
Transform IDs
Example: Transliterating to Latin
Example: Removing accents
Example: Script-to-script conversion
Compound transformations
Related documentation

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

ICU transform token filter

Installation

Parameters

Transform IDs

Example: Transliterating to Latin

Example: Removing accents

Example: Script-to-script conversion

Compound transformations

OpenSearch Links

Get Involved

Resources

Contact Us

ICU transform token filter

Installation

Parameters

Transform IDs

Example: Transliterating to Latin

Example: Removing accents

Example: Script-to-script conversion

Compound transformations

Related documentation