Evaluating search quality
Search Relevance Workbench can run pointwise experiments to evaluate search configuration quality using provided queries and relevance judgments.
For more information about creating a query set, see Query sets.
For more information about creating search configurations, see Search Configurations.
For more information about creating judgments, see Judgments.
Creating a pointwise experiment
A pointwise experiment compares your search configuration results against provided relevance judgments to evaluate search quality.
Example request
PUT _plugins/_search_relevance/experiments
{
   	"querySetId": "a02cedc2-249d-41de-be3e-662f6f221689",
   	"searchConfigurationList": ["4f90e474-0806-4dd2-a8dd-0fb8a5f836eb"],
    "judgmentList": ["d3d93bb3-2cf4-4da0-8d31-c298427c2756"],
   	"size": 8,
   	"type": "POINTWISE_EVALUATION"
}
Request body fields
The following table lists the available input parameters.
| Field | Data type | Description | 
|---|---|---|
| querySetId | String | The ID of the query set. | 
| searchConfigurationList | List | A list of search configuration IDs to use for comparison. | 
| judgmentList | Array[String] | A list of judgment IDs to use for evaluating search accuracy. | 
| size | Integer | The number of documents to return in the results. | 
| type | String | The type of experiment to run. Valid values are PAIRWISE_COMPARISON,HYBRID_OPTIMIZER, orPOINTWISE_EVALUATION. Depending on the experiment type, you must provide different body fields in the request.PAIRWISE_COMPARISONis for comparing two search configurations against a query set and is used here.HYBRID_OPTIMIZERis for combining results and is used here.POINTWISE_EVALUATIONis for evaluating a search configuration against judgments and is used here. | 
Example response
{
  "experiment_id": "d707fa0f-3901-4c8b-8645-9a17e690722b",
  "experiment_result": "CREATED"
}
Managing the results
To retrieve experiment results, follow the same process used for comparing query sets in pairwise experiments.
The following is an example completed response:
Response
{
    "took": 140,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": ".plugins-search-relevance-experiment",
                "_id": "bb609dc9-e357-42ec-a956-92b43be0a3ab",
                "_score": 1.0,
                "_source": {
                    "id": "bb609dc9-e357-42ec-a956-92b43be0a3ab",
                    "timestamp": "2025-06-13T08:06:46.046Z",
                    "type": "POINTWISE_EVALUATION",
                    "status": "COMPLETED",
                    "querySetId": "a02cedc2-249d-41de-be3e-662f6f221689",
                    "searchConfigurationList": [
                        "4f90e474-0806-4dd2-a8dd-0fb8a5f836eb"
                    ],
                    "judgmentList": [
                        "d3d93bb3-2cf4-4da0-8d31-c298427c2756"
                    ],
                    "size": 8,
                    "results": [
                        {
                            "evaluationId": "10c60fee-11ca-49b0-9e8a-82cb7b2c044b",
                            "searchConfigurationId": "4f90e474-0806-4dd2-a8dd-0fb8a5f836eb",
                            "queryText": "tv"
                        },
                        {
                            "evaluationId": "c03a5feb-8dc2-4f7f-9d31-d99bfb392116",
                            "searchConfigurationId": "4f90e474-0806-4dd2-a8dd-0fb8a5f836eb",
                            "queryText": "led tv"
                        }
                    ]
                }
            }
        ]
    }
}
The results include an evaluation result ID for each search configuration. To view detailed results, query the search-relevance-evaluation-result index using this ID.
The following is an example of the detailed results:
Response
{
    "took": 59,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "search-relevance-evaluation-result",
                "_id": "10c60fee-11ca-49b0-9e8a-82cb7b2c044b",
                "_score": 1.0,
                "_source": {
                    "id": "10c60fee-11ca-49b0-9e8a-82cb7b2c044b",
                    "timestamp": "2025-06-13T08:06:40.869Z",
                    "searchConfigurationId": "4f90e474-0806-4dd2-a8dd-0fb8a5f836eb",
                    "searchText": "tv",
                    "judgmentIds": [
                        "d3d93bb3-2cf4-4da0-8d31-c298427c2756"
                    ],
                    "documentIds": [
                        "B07Q7VGW4Q",
                        "B00GXD4NWE",
                        "B07VML1CY1",
                        "B07THVCJK3",
                        "B07RKSV7SW",
                        "B010EAW8UK",
                        "B07FPP6TB5",
                        "B073G9ZD33"
                    ],
                    "metrics": [
                        {
                            "metric": "Coverage@8",
                            "value": 0.0
                        },
                        {
                            "metric": "Precision@8",
                            "value": 0.0
                        },
                        {
                            "metric": "MAP@8",
                            "value": 0.0
                        },
                        {
                            "metric": "NDCG@8",
                            "value": 0.0
                        }
                    ]
                }
            }
        ]
    }
}
The results include the original request parameters along with the following metric values:
-  Coverage@k: The proportion of scored documents from the judgment set, calculated as the number of documents with scores divided by the total number of documents.
-  Precision@k: The proportion of documents with nonzero judgment scores out of k (or out of the total number of returned documents, if lower).
-  MAP@k: The Mean Average Precision, which calculates the average precision across all documents. For more information, see Average precision.
-  NDCG@k: The Normalized Discounted Cumulative Gain, which compares the actual ranking of results against a perfect ranking, with higher weights given to top results. This measures the quality of result ordering.
To review these results visually, see Exploring search evaluation results.