Link Search Menu Expand Document Documentation Menu

You're viewing version 3.3 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.

Polish analyzer

The Polish language analyzer (polish) provides analysis for Polish text. This analyzer is part of the analysis-stempel plugin, which must be installed before use.

Installing the plugin

Before you can use the Polish analyzer, you must install the analysis-stempel plugin by running the following command:

./bin/opensearch-plugin install analysis-stempel

For more information, see Additional plugins: Complete list of available OpenSearch plugins.

Using the Polish analyzer

To use the Polish analyzer when you map an index, specify the polish value in the analyzer field:

PUT my-index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "polish"
      }
    }
  }
}

Configuring a custom Polish analyzer

You can configure a custom Polish analyzer by creating a custom analyzer that uses the Polish stemmer token filter. The default Polish analyzer applies the following analysis chain:

  1. Tokenizer: standard
  2. Token filters:
    • lowercase
    • polish_stop (removes Polish stop words)
    • polish_stem (applies Polish stemming)

Example: Custom Polish analyzer

PUT my-polish-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_polish": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "polish_stop",
            "polish_stem"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_polish"
      },
      "content": {
        "type": "text",
        "analyzer": "polish"
      }
    }
  }
}

Polish token filters

The analysis-stempel plugin provides the following token filters for Polish language processing.

polish_stop token filter

Removes common Polish stop words from the token stream.

polish_stem token filter

Applies Polish-specific stemming rules to reduce words to their root forms using the Stempel stemming algorithm.

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST _analyze
{
  "analyzer": "polish",
  "text": "Jestem programistą w Polsce i pracuję z OpenSearch"
}

The response contains the generated tokens:

{
  "tokens": [
    {"token": "jest", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0},
    {"token": "prograć", "start_offset": 7, "end_offset": 18, "type": "<ALPHANUM>", "position": 1},
    {"token": "polsce", "start_offset": 21, "end_offset": 27, "type": "<ALPHANUM>", "position": 3},
    {"token": "pracować", "start_offset": 30, "end_offset": 37, "type": "<ALPHANUM>", "position": 5},
    {"token": "opensearch", "start_offset": 40, "end_offset": 50, "type": "<ALPHANUM>", "position": 7}
  ]
}