A tokenizer receives a stream of characters and splits the text into individual tokens. A token consists of a term (usually, a word) and metadata about this term. For example, a tokenizer can split text on white space so that the text Actions speak louder than words. becomes [Actions, speak, louder, than, words.].
The output of a tokenizer is a stream of tokens. Tokenizers also maintain the following metadata about tokens:
standard) classify tokens by type, for example, <ALPHANUM> or <NUM>. Simpler tokenizers (for example, letter) only classify tokens as type word.You can use tokenizers to define custom analyzers.
The following tables list the built-in tokenizers that OpenSearch provides.
Word tokenizers parse full text into words.
| Tokenizer | Description | Example |
|---|---|---|
standard |
- Parses strings into tokens at word boundaries - Removes most punctuation |
It’s fun to contribute a brand-new PR or 2 to OpenSearch! becomes [ It’s, fun, to, contribute, a,brand, new, PR, or, 2, to, OpenSearch] |
letter |
- Parses strings into tokens on any non-letter character - Removes non-letter characters |
It’s fun to contribute a brand-new PR or 2 to OpenSearch! becomes [ It, s, fun, to, contribute, a,brand, new, PR, or, to, OpenSearch] |
lowercase |
- Parses strings into tokens on any non-letter character - Removes non-letter characters - Converts terms to lowercase |
It’s fun to contribute a brand-new PR or 2 to OpenSearch! becomes [ it, s, fun, to, contribute, a,brand, new, pr, or, to, opensearch] |
whitespace |
- Parses strings into tokens at white space characters | It’s fun to contribute a brand-new PR or 2 to OpenSearch! becomes [ It’s, fun, to, contribute, a,brand-new, PR, or, 2, to, OpenSearch!] |
uax_url_email |
- Similar to the standard tokenizer - Unlike the standard tokenizer, leaves URLs and email addresses as single terms |
It’s fun to contribute a brand-new PR or 2 to OpenSearch opensearch-project@github.com! becomes [ It’s, fun, to, contribute, a,brand, new, PR, or, 2, to, OpenSearch, opensearch-project@github.com] |
classic |
- Parses strings into tokens on: - Punctuation characters that are followed by a white space character - Hyphens if the term does not contain numbers - Removes punctuation - Leaves URLs and email addresses as single terms |
Part number PA-35234, single-use product (128.32) becomes [ Part, number, PA-35234, single, use, product, 128.32] |
thai |
- Parses Thai text into terms | สวัสดีและยินดีต becomes [ สวัสด, และ, ยินดี, ต] |
Partial word tokenizers parse text into words and generate fragments of those words for partial word matching.
| Tokenizer | Description | Example |
|---|---|---|
ngram |
- Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | My repo becomes [ M, My, y, y , , r, r, re, e, ep, p, po, o] because the default n-gram length is 1–2 characters |
edge_ngram |
- Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | My repo becomes [ M, My] because the default n-gram length is 1–2 characters |
Structured text tokenizers parse structured text, such as identifiers, email addresses, paths, or ZIP Codes.
| Tokenizer | Description | Example |
|---|---|---|
keyword |
- No-op tokenizer - Outputs the entire string unchanged - Can be combined with token filters, like lowercase, to normalize terms |
My repo becomes My repo |
pattern |
- Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms - Uses Java regular expressions |
https://opensearch.org/forum becomes [ https, opensearch, org, forum] because by default the tokenizer splits terms at word boundaries (\W+)Can be configured with a regex pattern |
simple_pattern |
- Uses a regular expression pattern to return matching text as terms - Uses Lucene regular expressions - Faster than the pattern tokenizer because it uses a subset of the pattern tokenizer regular expressions |
Returns an empty array by default Must be configured with a pattern because the pattern defaults to an empty string |
simple_pattern_split |
- Uses a regular expression pattern to split the text on matches rather than returning the matches as terms - Uses Lucene regular expressions - Faster than the pattern tokenizer because it uses a subset of the pattern tokenizer regular expressions |
No-op by default Must be configured with a pattern |
char_group |
- Parses on a set of configurable characters - Faster than tokenizers that run regular expressions |
No-op by default Must be configured with a list of characters |
path_hierarchy |
- Parses text on the path separator (by default, /) and returns a full path to each component in the tree hierarchy |
one/two/three becomes [ one, one/two, one/two/three] |