Hunspell token filter

The hunspell token filter is used for stemming and morphological analysis of words in a specific language. This filter applies Hunspell dictionaries, which are widely used in spell checkers. It works by breaking down words into their root forms (stemming).

The Hunspell dictionary files are automatically loaded at startup from the <OS_PATH_CONF>/hunspell/<locale> directory. For example, the en_GB locale must have at least one .aff file and one or more .dic files in the <OS_PATH_CONF>/hunspell/en_GB/ directory.

Alternatively, you can configure package-based dictionary loading using the ref_path parameter to maintain multiple independent dictionary sets for the same locale. For more information, see Package-based dictionary loading.

You can download these files from LibreOffice dictionaries.

Parameters

The hunspell token filter can be configured with the following parameters.

Parameter	Required/Optional	Data type	Description
`language/lang/locale`	At least one of the three is required	String	Specifies the language for the Hunspell dictionary. Can contain only alphanumeric characters, hyphens, and underscores (for example, `en_US`, `de_DE`).
`ref_path`	Optional	String	Specifies a package name used to load dictionaries from the `<OS_PATH_CONF>/analyzers/<ref_path>/hunspell/<locale>/` directory instead of the default `<OS_PATH_CONF>/hunspell/<locale>/` directory. When specified, the `locale` parameter is required. Both `ref_path` and `locale` parameters can contain only alphanumeric characters, hyphens, and underscores. See Package-based dictionary loading.
`dedup`	Optional	Boolean	Determines whether to remove multiple duplicate stemming terms for the same token. Default is `true`.
`dictionary`	Optional	Array of strings	Configures the dictionary files to be used for the Hunspell dictionary. Default is all files in the `<OS_PATH_CONF>/hunspell/<locale>` directory if `ref_path` is not specified or all files in the `<OS_PATH_CONF>/analyzers/<ref_path>/hunspell/<locale>/` directory when `ref_path` is specified. See Package-based dictionary loading.
`longest_only`	Optional	Boolean	Specifies whether only the longest stemmed version of the token should be returned. Default is `false`.

Example

The following example request creates a new index named my_index and configures an analyzer with a hunspell filter:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_hunspell_filter": {
          "type": "hunspell",
          "lang": "en_GB",
          "dedup": true,
          "longest_only": true
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_hunspell_filter"
          ]
        }
      }
    }
  }
}

Package-based dictionary loading

When you specify a ref_path parameter, dictionaries are loaded from a package-specific directory instead of the default directory. This is useful when you need multiple independent dictionary sets for the same locale, for example, when different indexes require different custom dictionaries.

Place dictionary files in the following directory structure:

<OS_PATH_CONF>/analyzers/<ref_path>/hunspell/<locale>/
├── <locale>.aff       (exactly one .aff file required)
├── <locale>.dic       (one or more .dic files)
└── <locale>_custom.dic

The following example loads a Hunspell dictionary from the package directory <OS_PATH_CONF>/analyzers/pkg-1234/hunspell/en_US/:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_custom_hunspell": {
          "type": "hunspell",
          "ref_path": "pkg-1234",
          "locale": "en_US"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_custom_hunspell"
          ]
        }
      }
    }
  }
}

Multiple indexes can use different packages configured for the same locale. Each package maintains its own independent dictionary cache:

PUT /index_medical
{
  "settings": {
    "analysis": {
      "filter": {
        "medical_hunspell": {
          "type": "hunspell",
          "ref_path": "medical-dict",
          "locale": "en_US"
        }
      }
    }
  }
}

PUT /index_legal
{
  "settings": {
    "analysis": {
      "filter": {
        "legal_hunspell": {
          "type": "hunspell",
          "ref_path": "legal-dict",
          "locale": "en_US"
        }
      }
    }
  }
}

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "the turtle moves slowly"
}

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "turtle",
      "start_offset": 4,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "move",
      "start_offset": 11,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "slow",
      "start_offset": 17,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Parameters
Example
Package-based dictionary loading
Generated tokens

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Hunspell token filter

Parameters

Example

Package-based dictionary loading

Generated tokens

OpenSearch Links

Get Involved

Resources

Contact Us