You're viewing version 3.5 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

ICU normalization character filter

The icu_normalizer character filter converts text into a canonical Unicode form by applying one of the normalization modes defined in Unicode Standard Annex #15. This process standardizes character representations before tokenization, ensuring that equivalent characters are treated consistently.

Installation

The icu_normalizer character filter requires the analysis-icu plugin. For installation instructions, see ICU analyzer.

Normalization modes

The character filter supports the following Unicode normalization forms:

nfc (Canonical Decomposition, followed by Canonical Composition): Decomposes combined characters, then recomposes them in a standard order. This is the most common normalization form.
nfd (Canonical Decomposition): Decomposes combined characters into their constituent parts. For example, é becomes e + combining acute accent.
nfkc (Compatibility Decomposition, followed by Canonical Composition): Applies compatibility decompositions (converting visually similar characters to a standard form), then canonical composition.
nfkc_cf (Default): Applies NFKC normalization with case folding. This mode normalizes both character representations and case.

Parameters

The following table lists the parameters for the icu_normalizer character filter.

Parameter	Data type	Description
`name`	String	The Unicode normalization form to apply. Valid values are `nfc`, `nfd`, `nfkc`, and `nfkc_cf`. Default is `nfkc_cf`.
`mode`	String	The normalization mode. Valid values are `compose` (default) and `decompose`. When `decompose` is specified, `nfc` becomes `nfd` and `nfkc` becomes `nfkd`.
`unicode_set_filter`	String	A UnicodeSet expression that specifies which characters to normalize. Optional. If not specified, all characters are normalized.

Example: Default normalization

The following example demonstrates using the default nfkc_cf normalization:

PUT /icu-norm-default
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default_icu_normalizer": {
          "tokenizer": "keyword",
          "char_filter": ["icu_normalizer"]
        }
      }
    }
  }
}

Test the normalizer with text containing ligatures and case variations:

POST /icu-norm-default/_analyze
{
  "analyzer": "default_icu_normalizer",
  "text": "ﬁnancial AFFAIRS"
}

The response shows normalization and case folding:

{
  "tokens": [
    {
      "token": "financial affairs",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

Example: NFD (decomposed) normalization

The following example configures NFD normalization by setting mode to decompose:

PUT /icu-norm-nfd
{
  "settings": {
    "analysis": {
      "char_filter": {
        "nfd_normalizer": {
          "type": "icu_normalizer",
          "name": "nfc",
          "mode": "decompose"
        }
      },
      "analyzer": {
        "nfd_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["nfd_normalizer"]
        }
      }
    }
  }
}

Test with accented characters:

POST /icu-norm-nfd/_analyze
{
  "analyzer": "nfd_analyzer",
  "text": "café"
}

The NFD normalization decomposes the accented character:

{
  "tokens": [
    {
      "token": "café",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

Note: While the visual representation appears the same, the underlying character encoding has changed from a single precomposed character to separate base and combining characters.

Example: Selective normalization with unicode_set_filter

You can limit normalization to specific character ranges using the unicode_set_filter parameter:

PUT /icu-norm-selective
{
  "settings": {
    "analysis": {
      "char_filter": {
        "latin_only_normalizer": {
          "type": "icu_normalizer",
          "name": "nfkc_cf",
          "unicode_set_filter": "[\\u0000-\\u024F]"
        }
      },
      "analyzer": {
        "selective_normalizer": {
          "tokenizer": "keyword",
          "char_filter": ["latin_only_normalizer"]
        }
      }
    }
  }
}

This configuration normalizes only Latin characters (Unicode range U+0000 to U+024F), leaving other scripts unchanged.

Installation
Normalization modes
Parameters
Example: Default normalization
Example: NFD (decomposed) normalization
Example: Selective normalization with unicode_set_filter
Related documentation

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

ICU normalization character filter

Installation

Normalization modes

Parameters

Example: Default normalization

Example: NFD (decomposed) normalization

Example: Selective normalization with unicode_set_filter

OpenSearch Links

Get Involved

Resources

Contact Us

ICU normalization character filter

Installation

Normalization modes

Parameters

Example: Default normalization

Example: NFD (decomposed) normalization

Example: Selective normalization with unicode_set_filter

Related documentation