Link Search Menu Expand Document Documentation Menu

ICU transform token filter

The icu_transform token filter applies ICU text transformations to tokens, enabling operations such as transliteration, case mapping, normalization, and bidirectional text handling. This filter uses transformation rules defined by the ICU Transform framework.

Common use cases include:

  • Transliteration: Converting text from one script to another (for example, Cyrillic to Latin)
  • Script conversion: Transforming between different writing systems
  • Accent removal: Separating base characters from diacritics
  • Custom transformations: Applying user-defined transformation rules

Installation

The icu_transform token filter requires the analysis-icu plugin. For installation instructions, see ICU analyzer.

Parameters

The following table lists the parameters for the icu_transform token filter.

Parameter Data type Description
id String The ICU transform ID specifying which transformation to apply. Can be a single transform ID or a compound ID with multiple transforms separated by semicolons. Default is Null (no transformation).
dir String The text direction for the transformation. Valid values are forward (default, left-to-right) and reverse (right-to-left). Default is forward.

Transform IDs

You can specify transformations using standard ICU transform IDs. Common transforms include:

  • Any-Latin: Transliterates text from any script to Latin characters
  • Latin-Cyrillic: Converts Latin text to Cyrillic
  • NFD; [:Nonspacing Mark:] Remove; NFC: Decomposes characters, removes diacritics, then recomposes
  • Lower: Converts text to lowercase
  • Upper: Converts text to uppercase
  • Hiragana-Katakana: Converts Hiragana to Katakana

You can chain multiple transforms by separating them with semicolons.

Example: Transliterating to Latin

The following example demonstrates transliteration of multiple scripts to Latin characters:

PUT /icu-transform-latin
{
  "settings": {
    "analysis": {
      "filter": {
        "latin_transform": {
          "type": "icu_transform",
          "id": "Any-Latin"
        }
      },
      "analyzer": {
        "latin_analyzer": {
          "tokenizer": "keyword",
          "filter": ["latin_transform"]
        }
      }
    }
  }
}

Test the analyzer with text in different scripts:

POST /icu-transform-latin/_analyze
{
  "analyzer": "latin_analyzer",
  "text": "Москва"
}

The Cyrillic text is transliterated to Latin:

{
  "tokens": [
    {
      "token": "Moskva",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    }
  ]
}

Test with Japanese text:

POST /icu-transform-latin/_analyze
{
  "analyzer": "latin_analyzer",
  "text": "東京"
}

The Japanese characters are transliterated:

{
  "tokens": [
    {
      "token": "dōng jīng",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    }
  ]
}

Example: Removing accents

The following example removes diacritical marks from text:

PUT /icu-transform-no-accents
{
  "settings": {
    "analysis": {
      "filter": {
        "remove_accents": {
          "type": "icu_transform",
          "id": "NFD; [:Nonspacing Mark:] Remove; NFC"
        }
      },
      "analyzer": {
        "accent_removal_analyzer": {
          "tokenizer": "keyword",
          "filter": ["remove_accents"]
        }
      }
    }
  }
}

Test the analyzer:

POST /icu-transform-no-accents/_analyze
{
  "analyzer": "accent_removal_analyzer",
  "text": "Ênrique Iglesias"
}

The accents are removed:

{
  "tokens": [
    {
      "token": "Enrique Iglesias",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

Example: Script-to-script conversion

The following example converts Latin text to Cyrillic:

PUT /icu-transform-cyrillic
{
  "settings": {
    "analysis": {
      "filter": {
        "to_cyrillic": {
          "type": "icu_transform",
          "id": "Latin-Cyrillic"
        }
      },
      "analyzer": {
        "cyrillic_analyzer": {
          "tokenizer": "keyword",
          "filter": ["to_cyrillic"]
        }
      }
    }
  }
}

Test with Latin text:

POST /icu-transform-cyrillic/_analyze
{
  "analyzer": "cyrillic_analyzer",
  "text": "Sankt Peterburg"
}

The text is converted to Cyrillic script:

{
  "tokens": [
    {
      "token": "Санкт Петербург",
      "start_offset": 0,
      "end_offset": 15,
      "type": "word",
      "position": 0
    }
  ]
}

Compound transformations

You can chain multiple transformations by separating transform IDs with semicolons. The transformations are applied in order from left to right.

For example, the compound ID "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" performs the following steps:

  1. Transliterates to Latin
  2. Applies canonical decomposition (NFD)
  3. Removes non-spacing marks (accents)
  4. Applies canonical composition (NFC)