Semantic field type
Introduced 3.1
The semantic
field type is a high-level abstraction that simplifies neural search setup in OpenSearch. It can wrap a variety of field types, including all string and binary fields. The semantic
field type automatically enables semantic indexing and querying based on the configured machine learning (ML) model.
PREREQUISITE
Before using the semantic
field type, you must configure either a local ML model hosted on your OpenSearch cluster or an externally hosted model connected to your OpenSearch cluster. For more information about local models, see Using ML models within OpenSearch. For more information about externally hosted models, see Connecting to externally hosted models.
Example: Dense embedding model
Once you configure a model, you can use it to create an index with a semantic
field. This example assumes that you have configured a dense embedding model with the ID n17yX5cBsaYnPfyOzmQU
in your cluster:
PUT /my-nlp-index
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"passage": {
"type": "semantic",
"model_id": "n17yX5cBsaYnPfyOzmQU"
}
}
}
}
After creating the index, you can retrieve its mapping to verify that a passage_semantic_info
field was automatically created. The passage_semantic_info
field contains a knn_vector
subfield for storing the dense embedding and additional metadata fields for capturing information such as the model ID, model name, and model type:
GET /my-nlp-index/_mapping
{
"my-nlp-index": {
"mappings": {
"properties": {
"passage": {
"type": "semantic",
"model_id": "n17yX5cBsaYnPfyOzmQU",
"raw_field_type": "text"
},
"passage_semantic_info": {
"properties": {
"embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"engine": "faiss",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
},
"model": {
"properties": {
"id": {
"type": "text",
"index": false
},
"name": {
"type": "text",
"index": false
},
"type": {
"type": "text",
"index": false
}
}
}
}
}
}
}
}
}
The dimension
and space_type
of the knn_vector
field are determined by the ML model configuration. For pretrained dense models, this information is included in the default model configuration. For externally hosted dense embedding models, you must explicitly define the dimension
and space_type
in the model configuration before using the model with a semantic
field.
The autogenerated knn_vector
subfield supports additional settings that are not currently configurable in the semantic
field. For more information, see Limitations.
Example: Sparse encoding model
Once you configure a model, you can use it to create an index with a semantic
field. This example assumes that you have configured a sparse encoding model with the ID n17yX5cBsaYnPfyOzmQU
in your cluster:
PUT /my-nlp-index
{
"mappings": {
"properties": {
"passage": {
"type": "semantic",
"model_id": "nF7yX5cBsaYnPfyOq2SG"
}
}
}
}
After creating the index, you can retrieve its mapping to verify that a rank_features
field was automatically created:
GET /my-nlp-index/_mapping
{
"my-nlp-index": {
"mappings": {
"properties": {
"passage": {
"type": "semantic",
"model_id": "nF7yX5cBsaYnPfyOq2SG",
"raw_field_type": "text"
},
"passage_semantic_info": {
"properties": {
"embedding": {
"type": "rank_features"
},
"model": {
"properties": {
"id": {
"type": "text",
"index": false
},
"name": {
"type": "text",
"index": false
},
"type": {
"type": "text",
"index": false
}
}
}
}
}
}
}
}
}
Parameters
The semantic
field type supports the following parameters.
Parameter | Data type | Required/Optional | Description |
---|---|---|---|
type | String | Required | Must be set to semantic . |
raw_field_type | String | Optional | The underlying field type wrapped by the semantic field. The raw input is stored as this type at the path of the semantic field, allowing it to behave like a standard field of that type. Valid values are text , keyword , match_only_text , wildcard , token_count , and binary . Default is text . You can use any parameters supported by the underlying field type; those parameters function as expected. |
model_id | String | Required | The ID of the ML model used to generate embeddings from field values during indexing and from query input during search. |
search_model_id | String | Optional | The ID of the ML model used specifically for query-time embedding generation. If not specified, the model_id is used. Cannot be specified together with semantic_field_search_analyzer . |
semantic_info_field_name | String | Optional | A custom name for the internal metadata field that stores the embedding and model information. By default, this field name is derived by appending _semantic_info to the semantic field name. |
chunking | Boolean | Optional | Enables fixed-length token chunking during ingestion. When enabled, the input is split into chunks using a default configuration. See Text chunking. |
semantic_field_search_analyzer | String | Optional | Specifies an analyzer for tokenizing the query input when using a sparse model. Valid values are standard , bert-uncased , and mbert-uncased . Cannot be used together with search_model_id . For more information, see Analyzers. |
Text chunking
By default, text chunking is disabled for semantic
fields. This is because enabling chunking requires storing each chunk’s embedding in a nested object, which can increase search latency. Searching nested objects requires joining child documents to their parent, along with additional scoring and aggregation logic. The more matching child documents there are, the higher the potential latency.
If you’re working with long-form text and want to improve search relevance, you can enable chunking by setting the chunking
parameter for the semantic
field to true
when creating an index:
PUT /my-nlp-index
{
"mappings": {
"properties": {
"passage": {
"type": "semantic",
"model_id": "nF7yX5cBsaYnPfyOq2SG",
"chunking": true
}
}
}
}
Chunking is performed using the fixed token length algorithm.
Limitations
Note the following limitations of the semantic
field:
-
When using a
semantic
field with a dense model, the automatically generatedknn_vector
subfield takes thedimension
andspace_type
values from the model configuration, so you must ensure that this information is defined before using the model. Otherknn_vector
parameters use default values and cannot be customized. -
Text chunking uses a fixed token length algorithm with default settings. You cannot modify the chunking algorithm.
-
For sparse models, OpenSearch applies a default prune ratio of
0.1
when generating sparse embeddings. This value is not configurable. Querying a semantic field with a sparse model is not supported by theneural_sparse_two_phase_processor
, which is used to optimize search latency. -
Querying a
semantic
field from a remote cluster is not supported.
Next steps
- Using a
semantic
field with text embedding models for semantic search - Using a
semantic
field with sparse encoding models for neural sparse search