Link Search Menu Expand Document Documentation Menu

Semantic field type

Introduced 3.1

The semantic field type is a high-level abstraction that simplifies neural search setup in OpenSearch. It can wrap a variety of field types, including all string and binary fields. The semantic field type automatically enables semantic indexing and querying based on the configured machine learning (ML) model.

PREREQUISITE
Before using the semantic field type, you must configure either a local ML model hosted on your OpenSearch cluster or an externally hosted model connected to your OpenSearch cluster. For more information about local models, see Using ML models within OpenSearch. For more information about externally hosted models, see Connecting to externally hosted models.

Example: Dense embedding model

Once you configure a model, you can use it to create an index with a semantic field. This example assumes that you have configured a dense embedding model with the ID n17yX5cBsaYnPfyOzmQU in your cluster:

PUT /my-nlp-index
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "passage": {
        "type": "semantic",
        "model_id": "n17yX5cBsaYnPfyOzmQU"
      }
    }
  }
}

After creating the index, you can retrieve its mapping to verify that a passage_semantic_info field was automatically created. The passage_semantic_info field contains a knn_vector subfield for storing the dense embedding and additional metadata fields for capturing information such as the model ID, model name, and model type:

GET /my-nlp-index/_mapping
{
  "my-nlp-index": {
    "mappings": {
      "properties": {
        "passage": {
          "type": "semantic",
          "model_id": "n17yX5cBsaYnPfyOzmQU",
          "raw_field_type": "text"
        },
        "passage_semantic_info": {
          "properties": {
            "embedding": {
              "type": "knn_vector",
              "dimension": 384,
              "method": {
                "engine": "faiss",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
              }
            },
            "model": {
              "properties": {
                "id": {
                  "type": "text",
                  "index": false
                },
                "name": {
                  "type": "text",
                  "index": false
                },
                "type": {
                  "type": "text",
                  "index": false
                }
              }
            }
          }
        }
      }
    }
  }
}

The dimension and space_type of the knn_vector field are determined by the ML model configuration. For pretrained dense models, this information is included in the default model configuration. For externally hosted dense embedding models, you must explicitly define the dimension and space_type in the model configuration before using the model with a semantic field.

The autogenerated knn_vector subfield supports additional settings that are not currently configurable in the semantic field. For more information, see Limitations.

Example: Sparse encoding model

Once you configure a model, you can use it to create an index with a semantic field. This example assumes that you have configured a sparse encoding model with the ID n17yX5cBsaYnPfyOzmQU in your cluster:

PUT /my-nlp-index
{
  "mappings": {
    "properties": {
      "passage": {
        "type": "semantic",
        "model_id": "nF7yX5cBsaYnPfyOq2SG"
      }
    }
  }
}

After creating the index, you can retrieve its mapping to verify that a rank_features field was automatically created:

GET /my-nlp-index/_mapping
{
  "my-nlp-index": {
    "mappings": {
      "properties": {
        "passage": {
          "type": "semantic",
          "model_id": "nF7yX5cBsaYnPfyOq2SG",
          "raw_field_type": "text"
        },
        "passage_semantic_info": {
          "properties": {
            "embedding": {
              "type": "rank_features"
            },
            "model": {
              "properties": {
                "id": {
                  "type": "text",
                  "index": false
                },
                "name": {
                  "type": "text",
                  "index": false
                },
                "type": {
                  "type": "text",
                  "index": false
                }
              }
            }
          }
        }
      }
    }
  }
}

Parameters

The semantic field type supports the following parameters.

Parameter Data type Required/Optional Description
type String Required Must be set to semantic.
raw_field_type String Optional The underlying field type wrapped by the semantic field. The raw input is stored as this type at the path of the semantic field, allowing it to behave like a standard field of that type. Valid values are text, keyword, match_only_text, wildcard, token_count, and binary. Default is text. You can use any parameters supported by the underlying field type; those parameters function as expected.
model_id String Required The ID of the ML model used to generate embeddings from field values during indexing and from query input during search.
search_model_id String Optional The ID of the ML model used specifically for query-time embedding generation. If not specified, the model_id is used. Cannot be specified together with semantic_field_search_analyzer.
semantic_info_field_name String Optional A custom name for the internal metadata field that stores the embedding and model information. By default, this field name is derived by appending _semantic_info to the semantic field name.
chunking Boolean Optional Enables fixed-length token chunking during ingestion. When enabled, the input is split into chunks using a default configuration. See Text chunking.
semantic_field_search_analyzer String Optional Specifies an analyzer for tokenizing the query input when using a sparse model. Valid values are standard, bert-uncased, and mbert-uncased. Cannot be used together with search_model_id. For more information, see Analyzers.

Text chunking

By default, text chunking is disabled for semantic fields. This is because enabling chunking requires storing each chunk’s embedding in a nested object, which can increase search latency. Searching nested objects requires joining child documents to their parent, along with additional scoring and aggregation logic. The more matching child documents there are, the higher the potential latency.

If you’re working with long-form text and want to improve search relevance, you can enable chunking by setting the chunking parameter for the semantic field to true when creating an index:

PUT /my-nlp-index
{
  "mappings": {
    "properties": {
      "passage": {
        "type": "semantic",
        "model_id": "nF7yX5cBsaYnPfyOq2SG",
        "chunking": true
      }
    }
  }
}

Chunking is performed using the fixed token length algorithm.

Limitations

Note the following limitations of the semantic field:

  • When using a semantic field with a dense model, the automatically generated knn_vector subfield takes the dimension and space_type values from the model configuration, so you must ensure that this information is defined before using the model. Other knn_vector parameters use default values and cannot be customized.

  • Text chunking uses a fixed token length algorithm with default settings. You cannot modify the chunking algorithm.

  • For sparse models, OpenSearch applies a default prune ratio of 0.1 when generating sparse embeddings. This value is not configurable. Querying a semantic field with a sparse model is not supported by the neural_sparse_two_phase_processor, which is used to optimize search latency.

  • Querying a semantic field from a remote cluster is not supported.

Next steps

350 characters left

Have a question? .

Want to contribute? or .