Judgments

A judgment is a relevance rating assigned to a specific document in the context of a particular query. Multiple judgments are grouped together into judgment lists. Typically, judgments are categorized as two types—implicit and explicit:

Implicit judgments are ratings derived from user behavior (for example, what did the user see and select after searching?).
Humans have traditionally produced explicit judgments, but large language models (LLMs) are increasingly used for this task.

Search Relevance Workbench (SRW) supports all types of judgments:

Using LLMs as automated judges (an approach known as LLM-as-a-Judge) to generate judgments by evaluating search results using a prompt.
Generating implicit judgments based on data that adheres to the User Behavior Insights (UBI) schema specification.
Importing judgments that were collected using a process outside of SRW.

Using LLM-as-a-Judge

Generate explicit judgments with an LLM in SRW when you don’t have human annotators available, or you need to scale up the number of judgments beyond what humans can provide.

For step-by-step instructions, see Using LLM-as-a-Judge for search relevance.

Prerequisites

To use LLM-as-a-Judge, configure the following components:

A connector to an LLM to use for generating the judgments. For more information, see Connectors.
A query set: Together with the size parameter, the query set defines the scope for generating judgments. For each query, the top k documents are retrieved from the specified index, in which k is defined by the size parameter.
A search configuration: A search configuration defines how documents are retrieved for use in query-document pairs.

The AI-assisted judgment process consists of the following steps:

For each query, the top k documents are retrieved using the defined search configuration, which includes the index information. The query and each document from the result list create a query-document pair.
The LLM is then called with a predefined prompt to generate a judgment for each query-document pair.
All generated judgments are stored in the judgments cache index for reuse in future experiments.

To create a judgment list, provide the model ID of the LLM, an available query set, and a created search configuration.

The following example uses a generic prompt template with a scale of 0.0 to 1.0. To reduce the volume of data sent to the LLM (and therefore the cost), use the contextFields parameter to specify which fields from each result to include:

PUT _plugins/_search_relevance/judgments
{
    "name":"AI-assisted judgment list",
    "description": "Uses GPT-3.5-turbo to evaluate product search results",
    "type":"LLM_JUDGMENT",
    "modelId":"N8AE1osB0jLkkocYjz7D",
    "querySetId":"5f0115ad-94b9-403a-912f-3e762870ccf6",
    "searchConfigurationList":["2f90d4fd-bd5e-450f-95bb-eabe4a740bd1"],
    "size":5,
    "contextFields": ["title", "description", "category"],
    "llmJudgmentRatingType": "SCORE0_1",
    "promptTemplate": "Rate the relevance of these search results {{hits}} for the query '{{queryText}}' on a scale of 0-1, where 0 is completely irrelevant and 1 is perfectly relevant. Consider the product title, description, and category."
}

Request body fields

The following table lists the parameters for creating LLM-based judgments.

Parameter	Data type	Description
`name`	String	The name of the judgment list.
`description`	String	Optional. A description of the judgment list.
`type`	String	Set to `LLM_JUDGMENT`.
`modelId`	String	The ID of the deployed machine learning (ML) model to use for generating judgments. Must be a remote model connected to an external LLM service.
`querySetId`	String	The ID of the query set containing the queries to evaluate.
`searchConfigurationList`	Array of strings	The list of search configuration IDs to use for retrieving documents to evaluate.
`size`	Integer	The number of top documents to retrieve and evaluate for each query. Default is `10`.
`tokenLimit`	Integer	The maximum number of tokens to send to the LLM in a single request. Used to batch documents when the total content exceeds this limit. Default is `4,000`.
`contextFields`	Array of strings	Optional. Specifies which document fields to include when sending content to the LLM. If not specified, the entire document source is sent. Use this parameter to reduce costs and focus the LLM on relevant fields.
`ignoreFailure`	Boolean	Whether to continue processing other documents if the LLM fails to generate a judgment for some documents. Default is `false`.
`llmJudgmentRatingType`	String	The type of rating scale to use. Valid values are `SCORE0_1` (numeric scale 0–1) and `RELEVANT_IRRELEVANT` (binary relevant/irrelevant). Use `SCORE0_1` for graded relevance metrics such as NDCG. Use `RELEVANT_IRRELEVANT` for binary metrics such as precision and recall.
`promptTemplate`	String	Optional. A custom prompt template for the LLM. Supports `{{queryText}}` and `{{hits}}` placeholders. If not provided, the default template is used.
`overwriteCache`	Boolean	Whether to overwrite existing cached judgments for the same query-document pairs. Default is `false` (reuse cached judgments).

Custom prompt templates

You can customize the prompt template to focus on specific aspects of relevance:

PUT /_plugins/_search_relevance/judgments
{
  "name": "Custom Prompt Judgment",
  "type": "LLM_JUDGMENT",
  "modelId": "MODEL_ID_HERE",
  "querySetId": "QUERY_SET_ID_HERE",
  "searchConfigurationList": ["SEARCH_CONFIGURATION_ID_HERE"],
  "promptTemplate": "As an e-commerce search expert, evaluate how well these products {{hits}} match the user's search for '{{queryText}}'. Consider product relevance, brand reputation, and price competitiveness. Rate each result from 0-1.",
  "llmJudgmentRatingType": "SCORE0_1"
}

Binary relevance judgments

For simpler relevance assessment, you can use binary (relevant/irrelevant) judgments:

PUT /_plugins/_search_relevance/judgments
{
  "name": "Binary LLM Judgment",
  "type": "LLM_JUDGMENT",
  "modelId": "MODEL_ID_HERE",
  "querySetId": "QUERY_SET_ID_HERE",
  "searchConfigurationList": ["SEARCH_CONFIGURATION_ID_HERE"],
  "llmJudgmentRatingType": "RELEVANT_IRRELEVANT",
  "promptTemplate": "Determine if these search results {{hits}} are relevant or irrelevant for the query '{{queryText}}'. Consider exact matches and semantic relevance."
}

Using different LLM providers

You can adapt the connector configuration for other providers.

Amazon Bedrock example

The following example creates a connector for Amazon Bedrock:

POST /_plugins/_ml/connectors/_create
{
  "name": "Amazon Bedrock Connector",
  "description": "Connector to Amazon Bedrock",
  "version": "1",
  "protocol": "aws_sigv4",
  "parameters": {
    "region": "us-east-1",
    "service_name": "bedrock",
    "model": "anthropic.claude-v2"
  },
  "credential": {
    "access_key": "YOUR_ACCESS_KEY",
    "secret_key": "YOUR_SECRET_KEY"
  },
  "actions": [
    {
      "action_type": "predict",
      "method": "POST",
      "url": "https://bedrock-runtime.${parameters.region}.amazonaws.com/model/${parameters.model}/invoke",
      "request_body": "{ \"prompt\": \"${parameters.messages}\", \"max_tokens_to_sample\": 300 }"
    }
  ]
}

Implicit judgments

Implicit judgments are derived from past user interactions. SRW supports the Clicks Over Expected Clicks (COEC) click model, which uses impression and click signals to calculate judgments.

Input data must follow the UBI index schemas. COEC uses every event in the ubi_events index with an action_name of impression or click.

COEC calculates an expected click-through rate (CTR) for each rank by dividing the total number of clicks by the total number of impressions observed at that rank, based on all events in ubi_events. This ratio represents the expected CTR for that position.

For each document displayed in a hit list after a query, the average CTR at that rank serves as the expected value for the query-document pair. COEC calculates the actual CTR for the query-document pair and divides it by this expected rank-based CTR. Consequently, query-document pairs with a higher CTR than the average for that rank have a judgment value greater than 1. Conversely, if the CTR is lower than average, the judgment value is lower than 1.

Depending on the tracking implementation, multiple clicks for a single query can be recorded in the ubi_events index. Consequently, the average CTR can sometimes exceed 1 (or 100%).

For query-document observations that occur at different positions, all impressions and clicks are assumed to have occurred at the lowest (best) position. This aggregation approach biases the final judgment toward lower values, reflecting the common trend that higher-ranked results typically receive higher CTRs.

Example request

The following example creates an implicit judgment list using the COEC click model:

PUT _plugins/_search_relevance/judgments
{
  "name": "Implicit Judgments",
  "clickModel": "coec",
  "type": "UBI_JUDGMENT",
  "maxRank": 20
}

Request body fields

The following table lists the parameters for creating implicit judgments.

Parameter	Data type	Description
`name`	String	The name of the judgment list.
`clickModel`	String	The model used to calculate implicit judgments. Only `coec` (Clicks Over Expected Clicks) is supported.
`type`	String	Set to `UBI_JUDGMENT`.
`maxRank`	Integer	The maximum rank to consider when including events in the judgment calculation.
`startDate`	Date	An optional starting date from which behavioral data events are considered for implicit judgment generation. The format is `yyyy-MM-dd`.
`endDate`	Date	An optional end date until which behavioral data events are considered for implicit judgment generation. The format is `yyyy-MM-dd`.

Importing judgments

You may already have external processes for generating judgments. Regardless of the judgment type or the way they were generated, you can import them into SRW.

Example request

The following example imports a set of judgments for two queries:

PUT _plugins/_search_relevance/judgments
{
  "name": "Imported Judgments",
  "description": "Judgments generated outside SRW",
  "type": "IMPORT_JUDGMENT",
  "judgmentRatings": [
    {
      "query": "red dress",
        "ratings": [
          {
                    "docId": "B077ZJXCTS",
                    "rating": "3.000"
          },
          {
                    "docId": "B071S6LTJJ",
                    "rating": "2.000"
          },
          {
                    "docId": "B01IDSPDJI",
                    "rating": "2.000"
          },
          {
                    "docId": "B07QRCGL3G",
                    "rating": "0.000"
          },
          {
                    "docId": "B074V6Q1DR",
                    "rating": "1.000"
          }
        ]
      },
      {
        "query": "blue jeans",
        "ratings": [
          {
                    "docId": "B07L9V4Y98",
                    "rating": "0.000"
          },
          {
                    "docId": "B01N0DSRJC",
                    "rating": "1.000"
          },
          {
                    "docId": "B001CRAWCQ",
                    "rating": "1.000"
          },
          {
                    "docId": "B075DGJZRM",
                    "rating": "2.000"
          },
          {
                    "docId": "B009ZD297U",
                    "rating": "2.000"
          }
        ]
      }
  ]
}

Request body fields

The following table lists the parameters for importing judgments.

Parameter	Data type	Description
`name`	String	The name of the judgment list.
`description`	String	An optional description of the judgment list.
`type`	String	Set to `IMPORT_JUDGMENT`.
`judgmentRatings`	Array	A list of JSON objects containing the judgments. Judgments are grouped by query, each containing a nested map in which document IDs (`docId`) serve as keys and their floating-point ratings serve as values.

Managing judgment lists

You can retrieve or delete judgment lists using the following APIs.

Viewing a judgment list

Retrieve a judgment list by its ID.

Endpoint

GET _plugins/_search_relevance/judgments/{judgment_list_id}

Path parameters

The following table lists the available path parameters.

Parameter	Data type	Description
`judgment_list_id`	String	The ID of the judgment list to retrieve.

Example request

GET _plugins/_search_relevance/judgments/b54f791a-3b02-49cb-a06c-46ab650b2ade

Example response

Response

{
  "took": 36,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "search-relevance-judgment",
        "_id": "b54f791a-3b02-49cb-a06c-46ab650b2ade",
        "_score": 1,
        "_source": {
          "id": "b54f791a-3b02-49cb-a06c-46ab650b2ade",
          "timestamp": "2025-06-11T06:07:23.766Z",
          "name": "Imported Judgments",
          "status": "COMPLETED",
          "type": "IMPORT_JUDGMENT",
          "metadata": {},
          "judgmentRatings": [
            {
              "query": "red dress",
              "ratings": [
                {
                  "rating": "3.000",
                  "docId": "B077ZJXCTS"
                },
                {
                  "rating": "2.000",
                  "docId": "B071S6LTJJ"
                },
                {
                  "rating": "2.000",
                  "docId": "B01IDSPDJI"
                },
                {
                  "rating": "0.000",
                  "docId": "B07QRCGL3G"
                },
                {
                  "rating": "1.000",
                  "docId": "B074V6Q1DR"
                }
              ]
            },
            {
              "query": "blue jeans",
              "ratings": [
                {
                  "rating": "0.000",
                  "docId": "B07L9V4Y98"
                },
                {
                  "rating": "1.000",
                  "docId": "B01N0DSRJC"
                },
                {
                  "rating": "1.000",
                  "docId": "B001CRAWCQ"
                },
                {
                  "rating": "2.000",
                  "docId": "B075DGJZRM"
                },
                {
                  "rating": "2.000",
                  "docId": "B009ZD297U"
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

Deleting a judgment list

Delete a judgment list by its ID.

Endpoint

DELETE _plugins/_search_relevance/judgments/{judgment_list_id}

Example request

DELETE _plugins/_search_relevance/judgments/b54f791a-3b02-49cb-a06c-46ab650b2ade

Example response

{
  "_index": "search-relevance-judgment",
  "_id": "b54f791a-3b02-49cb-a06c-46ab650b2ade",
  "_version": 3,
  "result": "deleted",
  "forced_refresh": true,
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 156,
  "_primary_term": 1
}

Searching for a judgment list

Search for judgment lists using query domain-specific language (DSL). The response excludes judgmentRatings.ratings by default; to include it, specify the _source field in the query.

Endpoints

GET _plugins/_search_relevance/judgments/_search
POST _plugins/_search_relevance/judgments/_search

Example request

The following example searches for judgment lists that include the exact query red dress:

GET _plugins/_search_relevance/judgments/_search
{
  "query": {
    "nested": {
      "path": "judgmentRatings",
      "query": {
        "match_phrase": {
          "judgmentRatings.query": "red dress"
        }
      }
    }
  }
}

Example response

{
  "took": 29,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 4.5558767,
    "hits": [
      {
        "_index": "search-relevance-judgment",
        "_id": "505d00cf-2fce-422b-bb97-2e3a95ce9446",
        "_score": 4.5558767,
        "_source": {
          "metadata": {},
          "name": "Imported Judgments",
          "judgmentRatings": [
            {
              "query": "red dress"
            },
            {
              "query": "blue jeans"
            }
          ],
          "id": "505d00cf-2fce-422b-bb97-2e3a95ce9446",
          "type": "IMPORT_JUDGMENT",
          "timestamp": "2026-01-28T18:16:44.218Z",
          "status": "COMPLETED"
        }
      }
    ]
  }
}

Automate search relevance evaluation using LLMs

Using LLM-as-a-Judge
Implicit judgments
- Example request
- Request body fields
Importing judgments
- Example request
- Request body fields
Managing judgment lists
Related documentation

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Judgments

Using LLM-as-a-Judge

Prerequisites

Request body fields

Custom prompt templates

Binary relevance judgments

Using different LLM providers

Amazon Bedrock example

Implicit judgments

Example request

Request body fields

Importing judgments

Example request

Request body fields

Managing judgment lists

Viewing a judgment list

Endpoint

Path parameters

Example request

Example response

Deleting a judgment list

Endpoint

Example request

Example response

Searching for a judgment list

Endpoints

Example request

Example response

OpenSearch Links

Get Involved

Resources

Contact Us

Judgments

Using LLM-as-a-Judge

Prerequisites

Request body fields

Custom prompt templates

Binary relevance judgments

Using different LLM providers

Amazon Bedrock example

Implicit judgments

Example request

Request body fields

Importing judgments

Example request

Request body fields

Managing judgment lists

Viewing a judgment list

Endpoint

Path parameters

Example request

Example response

Deleting a judgment list

Endpoint

Example request

Example response

Searching for a judgment list

Endpoints

Example request

Example response

Related documentation