You're viewing version 3.5 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.
ICU normalization character filter
The icu_normalizer character filter converts text into a canonical Unicode form by applying one of the normalization modes defined in Unicode Standard Annex #15. This process standardizes character representations before tokenization, ensuring that equivalent characters are treated consistently.
Installation
The icu_normalizer character filter requires the analysis-icu plugin. For installation instructions, see ICU analyzer.
Normalization modes
The character filter supports the following Unicode normalization forms:
nfc(Canonical Decomposition, followed by Canonical Composition): Decomposes combined characters, then recomposes them in a standard order. This is the most common normalization form.nfd(Canonical Decomposition): Decomposes combined characters into their constituent parts. For example,ébecomese+ combining acute accent.nfkc(Compatibility Decomposition, followed by Canonical Composition): Applies compatibility decompositions (converting visually similar characters to a standard form), then canonical composition.nfkc_cf(Default): Applies NFKC normalization with case folding. This mode normalizes both character representations and case.
Parameters
The following table lists the parameters for the icu_normalizer character filter.
| Parameter | Data type | Description |
|---|---|---|
name | String | The Unicode normalization form to apply. Valid values are nfc, nfd, nfkc, and nfkc_cf. Default is nfkc_cf. |
mode | String | The normalization mode. Valid values are compose (default) and decompose. When decompose is specified, nfc becomes nfd and nfkc becomes nfkd. |
unicode_set_filter | String | A UnicodeSet expression that specifies which characters to normalize. Optional. If not specified, all characters are normalized. |
Example: Default normalization
The following example demonstrates using the default nfkc_cf normalization:
PUT /icu-norm-default
{
"settings": {
"analysis": {
"analyzer": {
"default_icu_normalizer": {
"tokenizer": "keyword",
"char_filter": ["icu_normalizer"]
}
}
}
}
}
Test the normalizer with text containing ligatures and case variations:
POST /icu-norm-default/_analyze
{
"analyzer": "default_icu_normalizer",
"text": "financial AFFAIRS"
}
The response shows normalization and case folding:
{
"tokens": [
{
"token": "financial affairs",
"start_offset": 0,
"end_offset": 16,
"type": "word",
"position": 0
}
]
}
Example: NFD (decomposed) normalization
The following example configures NFD normalization by setting mode to decompose:
PUT /icu-norm-nfd
{
"settings": {
"analysis": {
"char_filter": {
"nfd_normalizer": {
"type": "icu_normalizer",
"name": "nfc",
"mode": "decompose"
}
},
"analyzer": {
"nfd_analyzer": {
"tokenizer": "keyword",
"char_filter": ["nfd_normalizer"]
}
}
}
}
}
Test with accented characters:
POST /icu-norm-nfd/_analyze
{
"analyzer": "nfd_analyzer",
"text": "café"
}
The NFD normalization decomposes the accented character:
{
"tokens": [
{
"token": "café",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}
Note: While the visual representation appears the same, the underlying character encoding has changed from a single precomposed character to separate base and combining characters.
Example: Selective normalization with unicode_set_filter
You can limit normalization to specific character ranges using the unicode_set_filter parameter:
PUT /icu-norm-selective
{
"settings": {
"analysis": {
"char_filter": {
"latin_only_normalizer": {
"type": "icu_normalizer",
"name": "nfkc_cf",
"unicode_set_filter": "[\\u0000-\\u024F]"
}
},
"analyzer": {
"selective_normalizer": {
"tokenizer": "keyword",
"char_filter": ["latin_only_normalizer"]
}
}
}
}
}
This configuration normalizes only Latin characters (Unicode range U+0000 to U+024F), leaving other scripts unchanged.