ICU normalization character filter
The icu_normalizer character filter converts text into a canonical Unicode form by applying one of the normalization modes defined in Unicode Standard Annex #15. This process standardizes character representations before tokenization, ensuring that equivalent characters are treated consistently.
Installation
The icu_normalizer character filter requires the analysis-icu plugin. For installation instructions, see ICU analyzer.
Normalization modes
The character filter supports the following Unicode normalization forms:
nfc(Canonical Decomposition, followed by Canonical Composition): Decomposes combined characters, then recomposes them in a standard order. This is the most common normalization form.nfd(Canonical Decomposition): Decomposes combined characters into their constituent parts. For example,ébecomese+ combining acute accent.nfkc(Compatibility Decomposition, followed by Canonical Composition): Applies compatibility decompositions (converting visually similar characters to a standard form), then canonical composition.nfkc_cf(Default): Applies NFKC normalization with case folding. This mode normalizes both character representations and case.
Parameters
The following table lists the parameters for the icu_normalizer character filter.
| Parameter | Data type | Description |
|---|---|---|
name | String | The Unicode normalization form to apply. Valid values are nfc, nfd, nfkc, and nfkc_cf. Default is nfkc_cf. |
mode | String | The normalization mode. Valid values are compose (default) and decompose. When decompose is specified, nfc becomes nfd and nfkc becomes nfkd. |
unicode_set_filter | String | A UnicodeSet expression that specifies which characters to normalize. Optional. If not specified, all characters are normalized. |
Example: Default normalization
The following example demonstrates using the default nfkc_cf normalization:
PUT /icu-norm-default
{
"settings": {
"analysis": {
"analyzer": {
"default_icu_normalizer": {
"tokenizer": "keyword",
"char_filter": ["icu_normalizer"]
}
}
}
}
}
Test the normalizer with text containing ligatures and case variations:
POST /icu-norm-default/_analyze
{
"analyzer": "default_icu_normalizer",
"text": "financial AFFAIRS"
}
The response shows normalization and case folding:
{
"tokens": [
{
"token": "financial affairs",
"start_offset": 0,
"end_offset": 16,
"type": "word",
"position": 0
}
]
}
Example: NFD (decomposed) normalization
The following example configures NFD normalization by setting mode to decompose:
PUT /icu-norm-nfd
{
"settings": {
"analysis": {
"char_filter": {
"nfd_normalizer": {
"type": "icu_normalizer",
"name": "nfc",
"mode": "decompose"
}
},
"analyzer": {
"nfd_analyzer": {
"tokenizer": "keyword",
"char_filter": ["nfd_normalizer"]
}
}
}
}
}
Test with accented characters:
POST /icu-norm-nfd/_analyze
{
"analyzer": "nfd_analyzer",
"text": "café"
}
The NFD normalization decomposes the accented character:
{
"tokens": [
{
"token": "café",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}
Note: While the visual representation appears the same, the underlying character encoding has changed from a single precomposed character to separate base and combining characters.
Example: Selective normalization with unicode_set_filter
You can limit normalization to specific character ranges using the unicode_set_filter parameter:
PUT /icu-norm-selective
{
"settings": {
"analysis": {
"char_filter": {
"latin_only_normalizer": {
"type": "icu_normalizer",
"name": "nfkc_cf",
"unicode_set_filter": "[\\u0000-\\u024F]"
}
},
"analyzer": {
"selective_normalizer": {
"tokenizer": "keyword",
"char_filter": ["latin_only_normalizer"]
}
}
}
}
}
This configuration normalizes only Latin characters (Unicode range U+0000 to U+024F), leaving other scripts unchanged.