Link Search Menu Expand Document Documentation Menu

ICU folding token filter

The icu_folding token filter applies Unicode normalization and case folding to tokens, converting them to a form suitable for case-insensitive matching. This filter provides more comprehensive character folding than the ASCII folding filter, handling characters from all Unicode scripts.

The filter implements case folding as defined in Unicode Technical Report #30, which includes:

  • Converting uppercase letters to lowercase
  • Removing diacritical marks (accents)
  • Converting ligatures to their component letters
  • Normalizing character width (for example, full-width to half-width)
  • Converting certain punctuation and symbols to ASCII equivalents

Installation

The icu_folding token filter requires the analysis-icu plugin. For installation instructions, see ICU analyzer.

Parameters

The following table lists the parameters for the icu_folding token filter.

Parameter Data type Description
unicode_set_filter String A UnicodeSet expression specifying which characters to fold. Characters outside this set are passed through unchanged. Optional. If not specified, all characters are folded.

Example: Basic ICU folding

The following example demonstrates the default icu_folding behavior:

PUT /icu-folding-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "icu_folding_analyzer": {
          "tokenizer": "icu_tokenizer",
          "filter": ["icu_folding"]
        }
      }
    }
  }
}

Test the analyzer with text containing diacritics, ligatures, and mixed case:

POST /icu-folding-index/_analyze
{
  "analyzer": "icu_folding_analyzer",
  "text": "Café RÉSUMÉ Æsop"
}

The response shows normalization and folding:

{
  "tokens": [
    {
      "token": "cafe",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "resume",
      "start_offset": 5,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "aesop",
      "start_offset": 12,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Normalization included

The icu_folding filter already performs Unicode normalization, so you don’t need to add a separate normalization character filter or token filter when using icu_folding.

Example: Preserving specific characters

You can preserve specific characters from folding using the unicode_set_filter parameter. The following example preserves German umlauts and the Eszett character:

PUT /icu-folding-german
{
  "settings": {
    "analysis": {
      "filter": {
        "german_folding": {
          "type": "icu_folding",
          "unicode_set_filter": "[^äöüÄÖÜß]"
        }
      },
      "analyzer": {
        "german_analyzer": {
          "tokenizer": "icu_tokenizer",
          "filter": ["german_folding", "lowercase"]
        }
      }
    }
  }
}

The unicode_set_filter value [^äöüÄÖÜß] means “fold all characters except these German characters.” The lowercase filter is added afterward to handle the preserved uppercase characters.

Test the analyzer:

POST /icu-folding-german/_analyze
{
  "analyzer": "german_analyzer",
  "text": "MÜNCHEN Café Größe"
}

The response preserves German characters while folding others:

{
  "tokens": [
    {
      "token": "münchen",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "cafe",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "größe",
      "start_offset": 13,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Comparison with ASCII folding

While the asciifolding token filter converts non-ASCII characters to ASCII equivalents, icu_folding provides more sophisticated normalization:

  • Broader character support: Handles all Unicode scripts, not just Latin-based characters
  • Language-aware: Applies normalization rules appropriate for different writing systems
  • Width normalization: Converts full-width characters to half-width (important for CJK text)
  • Ligature handling: Properly decomposes ligatures across all scripts
350 characters left

Have a question? .

Want to contribute? or .