You're viewing version 3.1 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.
Stop token filter
The stop
token filter is used to remove common words (also known as stopwords) from a token stream during analysis. Stopwords are typically articles and prepositions, such as a
or for
. These words are not significantly meaningful in search queries and are often excluded to improve search efficiency and relevance.
The default list of English stopwords includes the following words: a
, an
, and
, are
, as
, at
, be
, but
, by
, for
, if
, in
, into
, is
, it
, no
, not
, of
, on
, or
, such
, that
, the
, their
, then
, there
, these
, they
, this
, to
, was
, will
, and with
.
Parameters
The stop
token filter can be configured with the following parameters.
Parameter | Required/Optional | Data type | Description |
---|---|---|---|
stopwords | Optional | String | Specifies either a custom array of stopwords or a predefined stopword set for a language. Default is _english_ . |
stopwords_path | Optional | String | Specifies the file path (absolute or relative to the config directory) of the file containing custom stopwords. |
ignore_case | Optional | Boolean | If true , stopwords will be matched regardless of their case. Default is false . |
remove_trailing | Optional | Boolean | If true , trailing stopwords will be removed during analysis. Default is true . |
Example
The following example request creates a new index named my-stopword-index
and configures an analyzer with a stop
filter that uses the predefined stopword list for the English language:
PUT /my-stopword-index
{
"settings": {
"analysis": {
"filter": {
"my_stop_filter": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"my_stop_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stop_filter"
]
}
}
}
}
}
Generated tokens
Use the following request to examine the tokens generated using the analyzer:
GET /my-stopword-index/_analyze
{
"analyzer": "my_stop_analyzer",
"text": "A quick dog jumps over the turtle"
}
The response contains the generated tokens:
{
"tokens": [
{
"token": "quick",
"start_offset": 2,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "jumps",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "over",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "turtle",
"start_offset": 27,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 6
}
]
}
Predefined stopword sets by language
The following is a list of all available predefined stopword sets by language:
_arabic_
_armenian_
_basque_
_bengali_
_brazilian_
(Brazilian Portuguese)_bulgarian_
_catalan_
_cjk_
(Chinese, Japanese, and Korean)_czech_
_danish_
_dutch_
_english_
_estonian_
_finnish_
_french_
_galician_
_german_
_greek_
_hindi_
_hungarian_
_indonesian_
_irish_
_italian_
_latvian_
_lithuanian_
_norwegian_
_persian_
_portuguese_
_romanian_
_russian_
_sorani_
_spanish_
_swedish_
_thai_
_turkish_