Link Search Menu Expand Document Documentation Menu

Evaluating search quality

This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated GitHub issue.

Search Relevance Workbench can run pointwise experiments to evaluate search configuration quality using provided queries and relevance judgments.

For more information about creating a query set, see Query sets.

For more information about creating search configurations, see Search Configurations.

For more information about creating judgments, see Judgments.

Creating a pointwise experiment

A pointwise experiment compares your search configuration results against provided relevance judgments to evaluate search quality.

Example request

PUT _plugins/_search_relevance/experiments
{
   	"querySetId": "a02cedc2-249d-41de-be3e-662f6f221689",
   	"searchConfigurationList": ["4f90e474-0806-4dd2-a8dd-0fb8a5f836eb"],
    "judgmentList": ["d3d93bb3-2cf4-4da0-8d31-c298427c2756"],
   	"size": 8,
   	"type": "POINTWISE_EVALUATION"
}

Request body fields

The following table lists the available input parameters.

Field Data type Description
querySetId String The ID of the query set.
searchConfigurationList List A list of search configuration IDs to use for comparison.
judgmentList Array[String] A list of judgment IDs to use for evaluating search accuracy.
size Integer The number of documents to return in the results.
type String The type of experiment to run. Valid values are PAIRWISE_COMPARISON, HYBRID_OPTIMIZER, or POINTWISE_EVALUATION. Depending on the experiment type, you must provide different body fields in the request. PAIRWISE_COMPARISON is for comparing two search configurations against a query set and is used here. HYBRID_OPTIMIZER is for combining results and is used here. POINTWISE_EVALUATION is for evaluating a search configuration against judgments and is used here.

Example response

{
  "experiment_id": "d707fa0f-3901-4c8b-8645-9a17e690722b",
  "experiment_result": "CREATED"
}

Managing the results

To retrieve experiment results, follow the same process used for comparing query sets in pairwise experiments.

The following is an example completed response:

Response
{
    "took": 140,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": ".plugins-search-relevance-experiment",
                "_id": "bb609dc9-e357-42ec-a956-92b43be0a3ab",
                "_score": 1.0,
                "_source": {
                    "id": "bb609dc9-e357-42ec-a956-92b43be0a3ab",
                    "timestamp": "2025-06-13T08:06:46.046Z",
                    "type": "POINTWISE_EVALUATION",
                    "status": "COMPLETED",
                    "querySetId": "a02cedc2-249d-41de-be3e-662f6f221689",
                    "searchConfigurationList": [
                        "4f90e474-0806-4dd2-a8dd-0fb8a5f836eb"
                    ],
                    "judgmentList": [
                        "d3d93bb3-2cf4-4da0-8d31-c298427c2756"
                    ],
                    "size": 8,
                    "results": [
                        {
                            "evaluationId": "10c60fee-11ca-49b0-9e8a-82cb7b2c044b",
                            "searchConfigurationId": "4f90e474-0806-4dd2-a8dd-0fb8a5f836eb",
                            "queryText": "tv"
                        },
                        {
                            "evaluationId": "c03a5feb-8dc2-4f7f-9d31-d99bfb392116",
                            "searchConfigurationId": "4f90e474-0806-4dd2-a8dd-0fb8a5f836eb",
                            "queryText": "led tv"
                        }
                    ]
                }
            }
        ]
    }
}

The results include an evaluation result ID for each search configuration. To view detailed results, query the search-relevance-evaluation-result index using this ID.

The following is an example of the detailed results:

Response
{
    "took": 59,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "search-relevance-evaluation-result",
                "_id": "10c60fee-11ca-49b0-9e8a-82cb7b2c044b",
                "_score": 1.0,
                "_source": {
                    "id": "10c60fee-11ca-49b0-9e8a-82cb7b2c044b",
                    "timestamp": "2025-06-13T08:06:40.869Z",
                    "searchConfigurationId": "4f90e474-0806-4dd2-a8dd-0fb8a5f836eb",
                    "searchText": "tv",
                    "judgmentIds": [
                        "d3d93bb3-2cf4-4da0-8d31-c298427c2756"
                    ],
                    "documentIds": [
                        "B07Q7VGW4Q",
                        "B00GXD4NWE",
                        "B07VML1CY1",
                        "B07THVCJK3",
                        "B07RKSV7SW",
                        "B010EAW8UK",
                        "B07FPP6TB5",
                        "B073G9ZD33"
                    ],
                    "metrics": [
                        {
                            "metric": "Coverage@8",
                            "value": 0.0
                        },
                        {
                            "metric": "Precision@8",
                            "value": 0.0
                        },
                        {
                            "metric": "MAP@8",
                            "value": 0.0
                        },
                        {
                            "metric": "NDCG@8",
                            "value": 0.0
                        }
                    ]
                }
            }
        ]
    }
}

The results include the original request parameters along with the following metric values:

  • Coverage@k: The proportion of scored documents from the judgment set, calculated as the number of documents with scores divided by the total number of documents.

  • Precision@k: The proportion of documents with nonzero judgment scores out of k (or out of the total number of returned documents, if lower).

  • MAP@k: The Mean Average Precision, which calculates the average precision across all documents. For more information, see Average precision.

  • NDCG@k: The Normalized Discounted Cumulative Gain, which compares the actual ranking of results against a perfect ranking, with higher weights given to top results. This measures the quality of result ordering.

350 characters left

Have a question? .

Want to contribute? or .