Link Search Menu Expand Document Documentation Menu

You're viewing version 3.5 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

ICU normalization character filter

The icu_normalizer character filter converts text into a canonical Unicode form by applying one of the normalization modes defined in Unicode Standard Annex #15. This process standardizes character representations before tokenization, ensuring that equivalent characters are treated consistently.

Installation

The icu_normalizer character filter requires the analysis-icu plugin. For installation instructions, see ICU analyzer.

Normalization modes

The character filter supports the following Unicode normalization forms:

  • nfc (Canonical Decomposition, followed by Canonical Composition): Decomposes combined characters, then recomposes them in a standard order. This is the most common normalization form.
  • nfd (Canonical Decomposition): Decomposes combined characters into their constituent parts. For example, é becomes e + combining acute accent.
  • nfkc (Compatibility Decomposition, followed by Canonical Composition): Applies compatibility decompositions (converting visually similar characters to a standard form), then canonical composition.
  • nfkc_cf (Default): Applies NFKC normalization with case folding. This mode normalizes both character representations and case.

Parameters

The following table lists the parameters for the icu_normalizer character filter.

Parameter Data type Description
name String The Unicode normalization form to apply. Valid values are nfc, nfd, nfkc, and nfkc_cf. Default is nfkc_cf.
mode String The normalization mode. Valid values are compose (default) and decompose. When decompose is specified, nfc becomes nfd and nfkc becomes nfkd.
unicode_set_filter String A UnicodeSet expression that specifies which characters to normalize. Optional. If not specified, all characters are normalized.

Example: Default normalization

The following example demonstrates using the default nfkc_cf normalization:

PUT /icu-norm-default
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default_icu_normalizer": {
          "tokenizer": "keyword",
          "char_filter": ["icu_normalizer"]
        }
      }
    }
  }
}

Test the normalizer with text containing ligatures and case variations:

POST /icu-norm-default/_analyze
{
  "analyzer": "default_icu_normalizer",
  "text": "financial AFFAIRS"
}

The response shows normalization and case folding:

{
  "tokens": [
    {
      "token": "financial affairs",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

Example: NFD (decomposed) normalization

The following example configures NFD normalization by setting mode to decompose:

PUT /icu-norm-nfd
{
  "settings": {
    "analysis": {
      "char_filter": {
        "nfd_normalizer": {
          "type": "icu_normalizer",
          "name": "nfc",
          "mode": "decompose"
        }
      },
      "analyzer": {
        "nfd_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["nfd_normalizer"]
        }
      }
    }
  }
}

Test with accented characters:

POST /icu-norm-nfd/_analyze
{
  "analyzer": "nfd_analyzer",
  "text": "café"
}

The NFD normalization decomposes the accented character:

{
  "tokens": [
    {
      "token": "café",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

Note: While the visual representation appears the same, the underlying character encoding has changed from a single precomposed character to separate base and combining characters.

Example: Selective normalization with unicode_set_filter

You can limit normalization to specific character ranges using the unicode_set_filter parameter:

PUT /icu-norm-selective
{
  "settings": {
    "analysis": {
      "char_filter": {
        "latin_only_normalizer": {
          "type": "icu_normalizer",
          "name": "nfkc_cf",
          "unicode_set_filter": "[\\u0000-\\u024F]"
        }
      },
      "analyzer": {
        "selective_normalizer": {
          "tokenizer": "keyword",
          "char_filter": ["latin_only_normalizer"]
        }
      }
    }
  }
}

This configuration normalizes only Latin characters (Unicode range U+0000 to U+024F), leaving other scripts unchanged.