Link Search Menu Expand Document Documentation Menu

ICU normalization character filter

The icu_normalizer character filter converts text into a canonical Unicode form by applying one of the normalization modes defined in Unicode Standard Annex #15. This process standardizes character representations before tokenization, ensuring that equivalent characters are treated consistently.

Installation

The icu_normalizer character filter requires the analysis-icu plugin. For installation instructions, see ICU analyzer.

Normalization modes

The character filter supports the following Unicode normalization forms:

  • nfc (Canonical Decomposition, followed by Canonical Composition): Decomposes combined characters, then recomposes them in a standard order. This is the most common normalization form.
  • nfd (Canonical Decomposition): Decomposes combined characters into their constituent parts. For example, é becomes e + combining acute accent.
  • nfkc (Compatibility Decomposition, followed by Canonical Composition): Applies compatibility decompositions (converting visually similar characters to a standard form), then canonical composition.
  • nfkc_cf (Default): Applies NFKC normalization with case folding. This mode normalizes both character representations and case.

Parameters

The following table lists the parameters for the icu_normalizer character filter.

Parameter Data type Description
name String The Unicode normalization form to apply. Valid values are nfc, nfd, nfkc, and nfkc_cf. Default is nfkc_cf.
mode String The normalization mode. Valid values are compose (default) and decompose. When decompose is specified, nfc becomes nfd and nfkc becomes nfkd.
unicode_set_filter String A UnicodeSet expression that specifies which characters to normalize. Optional. If not specified, all characters are normalized.

Example: Default normalization

The following example demonstrates using the default nfkc_cf normalization:

PUT /icu-norm-default
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default_icu_normalizer": {
          "tokenizer": "keyword",
          "char_filter": ["icu_normalizer"]
        }
      }
    }
  }
}

Test the normalizer with text containing ligatures and case variations:

POST /icu-norm-default/_analyze
{
  "analyzer": "default_icu_normalizer",
  "text": "financial AFFAIRS"
}

The response shows normalization and case folding:

{
  "tokens": [
    {
      "token": "financial affairs",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

Example: NFD (decomposed) normalization

The following example configures NFD normalization by setting mode to decompose:

PUT /icu-norm-nfd
{
  "settings": {
    "analysis": {
      "char_filter": {
        "nfd_normalizer": {
          "type": "icu_normalizer",
          "name": "nfc",
          "mode": "decompose"
        }
      },
      "analyzer": {
        "nfd_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["nfd_normalizer"]
        }
      }
    }
  }
}

Test with accented characters:

POST /icu-norm-nfd/_analyze
{
  "analyzer": "nfd_analyzer",
  "text": "café"
}

The NFD normalization decomposes the accented character:

{
  "tokens": [
    {
      "token": "café",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

Note: While the visual representation appears the same, the underlying character encoding has changed from a single precomposed character to separate base and combining characters.

Example: Selective normalization with unicode_set_filter

You can limit normalization to specific character ranges using the unicode_set_filter parameter:

PUT /icu-norm-selective
{
  "settings": {
    "analysis": {
      "char_filter": {
        "latin_only_normalizer": {
          "type": "icu_normalizer",
          "name": "nfkc_cf",
          "unicode_set_filter": "[\\u0000-\\u024F]"
        }
      },
      "analyzer": {
        "selective_normalizer": {
          "tokenizer": "keyword",
          "char_filter": ["latin_only_normalizer"]
        }
      }
    }
  }
}

This configuration normalizes only Latin characters (Unicode range U+0000 to U+024F), leaving other scripts unchanged.