ICU tokenizer
The icu_tokenizer splits text into words using Unicode text segmentation rules defined in Unicode Standard Annex #29. This tokenizer provides more accurate word boundary detection than the standard tokenizer, particularly for Asian languages that don’t use spaces to separate words.
The icu_tokenizer employs dictionary-based tokenization for Chinese, Japanese, Korean, Thai, and Lao text, and applies specialized rules for segmenting Myanmar and Khmer scripts into syllables.
Installation
The icu_tokenizer requires the analysis-icu plugin. For installation instructions, see ICU analyzer.
Example
The following example demonstrates how to use the icu_tokenizer:
PUT /icu-tokenizer-index
{
"settings": {
"analysis": {
"analyzer": {
"icu_analyzer_custom": {
"tokenizer": "icu_tokenizer"
}
}
}
}
}
Testing the tokenizer
Use the following request to test the icu_tokenizer:
POST /icu-tokenizer-index/_analyze
{
"tokenizer": "icu_tokenizer",
"text": "สวัสดีOpenSearchเป็นเครื่องมือค้นหา"
}
The tokenizer correctly segments Thai text without spaces:
{
"tokens": [
{
"token": "สวัสดี",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "OpenSearch",
"start_offset": 6,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "เป็น",
"start_offset": 16,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "เครื่อง",
"start_offset": 20,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "มือ",
"start_offset": 27,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "ค้นหา",
"start_offset": 30,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 5
}
]
}
Customizing tokenization rules
Advanced users can customize the icu_tokenizer behavior by specifying per-script rule files using the Resource Bundle Break Iterator (RBBI) syntax. This feature is experimental in Lucene.
To apply custom rules, use the rule_files parameter with a comma-separated list of script:filename pairs. Script codes follow the ISO 15924 four-letter standard.
Example with custom rules
Save a custom rule file to your OpenSearch config directory (for example, CustomRules.rbbi):
.+ {200};
Configure an analyzer to use this rule file:
PUT /custom-icu-rules
{
"settings": {
"analysis": {
"tokenizer": {
"custom_icu_tokenizer": {
"type": "icu_tokenizer",
"rule_files": "Latn:CustomRules.rbbi"
}
},
"analyzer": {
"custom_icu_analyzer": {
"tokenizer": "custom_icu_tokenizer"
}
}
}
}
}
Test the custom tokenizer:
POST /custom-icu-rules/_analyze
{
"analyzer": "custom_icu_analyzer",
"text": "Custom tokenization rules"
}
Parameters
The following table lists the parameters for the icu_tokenizer.
| Parameter | Data type | Description |
|---|---|---|
rule_files | String | Comma-separated list of script:rulefile pairs that define custom tokenization rules for specific scripts. Rule files must be placed in the OpenSearch config directory. Optional. |