ICU normalization character filter

The icu_normalizer character filter converts text into a canonical Unicode form by applying one of the normalization modes defined in Unicode Standard Annex #15. This process standardizes character representations before tokenization, ensuring that equivalent characters are treated consistently.

Installation

The icu_normalizer character filter requires the analysis-icu plugin. For installation instructions, see ICU analyzer.

Normalization modes

The character filter supports the following Unicode normalization forms:

nfc (Canonical Decomposition, followed by Canonical Composition): Decomposes combined characters, then recomposes them in a standard order. This is the most common normalization form.
nfd (Canonical Decomposition): Decomposes combined characters into their constituent parts. For example, é becomes e + combining acute accent.
nfkc (Compatibility Decomposition, followed by Canonical Composition): Applies compatibility decompositions (converting visually similar characters to a standard form), then canonical composition.
nfkc_cf (Default): Applies NFKC normalization with case folding. This mode normalizes both character representations and case.

Parameters

The following table lists the parameters for the icu_normalizer character filter.

Parameter	Data type	Description
`name`	String	The Unicode normalization form to apply. Valid values are `nfc`, `nfd`, `nfkc`, and `nfkc_cf`. Default is `nfkc_cf`.
`mode`	String	The normalization mode. Valid values are `compose` (default) and `decompose`. When `decompose` is specified, `nfc` becomes `nfd` and `nfkc` becomes `nfkd`.
`unicode_set_filter`	String	A UnicodeSet expression that specifies which characters to normalize. Optional. If not specified, all characters are normalized.

Example: Default normalization

The following example demonstrates using the default nfkc_cf normalization:

PUT /icu-norm-default
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default_icu_normalizer": {
          "tokenizer": "keyword",
          "char_filter": ["icu_normalizer"]
        }
      }
    }
  }
}

Test the normalizer with text containing ligatures and case variations:

POST /icu-norm-default/_analyze
{
  "analyzer": "default_icu_normalizer",
  "text": "ﬁnancial AFFAIRS"
}

The response shows normalization and case folding:

{
  "tokens": [
    {
      "token": "financial affairs",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

Example: NFD (decomposed) normalization

The following example configures NFD normalization by setting mode to decompose:

PUT /icu-norm-nfd
{
  "settings": {
    "analysis": {
      "char_filter": {
        "nfd_normalizer": {
          "type": "icu_normalizer",
          "name": "nfc",
          "mode": "decompose"
        }
      },
      "analyzer": {
        "nfd_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["nfd_normalizer"]
        }
      }
    }
  }
}

Test with accented characters:

POST /icu-norm-nfd/_analyze
{
  "analyzer": "nfd_analyzer",
  "text": "café"
}

The NFD normalization decomposes the accented character:

{
  "tokens": [
    {
      "token": "café",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

Note: While the visual representation appears the same, the underlying character encoding has changed from a single precomposed character to separate base and combining characters.

Example: Selective normalization with unicode_set_filter

You can limit normalization to specific character ranges using the unicode_set_filter parameter:

PUT /icu-norm-selective
{
  "settings": {
    "analysis": {
      "char_filter": {
        "latin_only_normalizer": {
          "type": "icu_normalizer",
          "name": "nfkc_cf",
          "unicode_set_filter": "[\\u0000-\\u024F]"
        }
      },
      "analyzer": {
        "selective_normalizer": {
          "tokenizer": "keyword",
          "char_filter": ["latin_only_normalizer"]
        }
      }
    }
  }
}

This configuration normalizes only Latin characters (Unicode range U+0000 to U+024F), leaving other scripts unchanged.

Installation
Normalization modes
Parameters
Example: Default normalization
Example: NFD (decomposed) normalization
Example: Selective normalization with unicode_set_filter
Related documentation

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

ICU normalization character filter

Installation

Normalization modes

Parameters

Example: Default normalization

Example: NFD (decomposed) normalization

Example: Selective normalization with unicode_set_filter

OpenSearch Links

Get Involved

Resources

Contact Us

ICU normalization character filter

Installation

Normalization modes

Parameters

Example: Default normalization

Example: NFD (decomposed) normalization

Example: Selective normalization with unicode_set_filter

Related documentation