You're viewing version 3.1 of the OpenSearch documentation. This version is no longer maintained. For the latest version, see the current documentation. For information about OpenSearch version maintenance, see Release Schedule and Maintenance Policy.
Standard analyzer
The standard
analyzer is the built-in default analyzer used for general-purpose full-text search in OpenSearch. It is designed to provide consistent, language-agnostic text processing by efficiently breaking down text into searchable terms.
The standard
analyzer performs the following operations:
- Tokenization: Uses the
standard
tokenizer, which splits text into words based on Unicode text segmentation rules, handling spaces, punctuation, and common delimiters. - Lowercasing: Applies the
lowercase
token filter to convert all tokens to lowercase, ensuring consistent matching regardless of input case.
This combination makes the standard
analyzer ideal for indexing a wide range of natural language content without needing language-specific customizations.
Example: Creating an index with the standard analyzer
You can assign the standard
analyzer to a text field when creating an index:
PUT /my_standard_index
{
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "standard"
}
}
}
}
Parameters
The standard
analyzer supports the following optional parameters.
Parameter | Data type | Default | Description |
---|---|---|---|
max_token_length | Integer | 255 | The maximum length that a token can be before it is split. |
stopwords | String or list of strings | None | A list of stopwords or a predefined stopword set for a language to remove during analysis. For example, _english_ . |
stopwords_path | String | None | The path to a file containing stopwords to be used during analysis. |
Only use one of the parameters stopwords
or stopwords_path
. If both are used, no error is returned but only the stopwords
parameter is applied.
Example: Analyzer with parameters
The following example creates a products
index and configures the max_token_length
and stopwords
parameters:
PUT /animals
{
"settings": {
"analysis": {
"analyzer": {
"my_manual_stopwords_analyzer": {
"type": "standard",
"max_token_length": 10,
"stopwords": [
"the", "is", "and", "but", "an", "a", "it"
]
}
}
}
}
}
Use the following _analyze
API request to see how the my_manual_stopwords_analyzer
processes text:
POST /animals/_analyze
{
"analyzer": "my_manual_stopwords_analyzer",
"text": "The Turtle is Large but it is Slow"
}
The returned tokens:
- Have been split on spaces.
- Have been lowercased.
- Have had stopwords removed.
{
"tokens": [
{
"token": "turtle",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "large",
"start_offset": 14,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "slow",
"start_offset": 30,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 7
}
]
}