Using LLM-as-a-Judge for search relevance
LLM-as-a-Judge is a technique that uses large language models (LLMs) to automatically evaluate search result relevance. Manually annotating search results is time-consuming and inconsistent across annotators. LLM-as-a-Judge automates this process, enabling frequent and repeatable evaluation of search quality.
After completing this tutorial, you can run an experiment to evaluate search quality using the LLM-generated judgments.
Prerequisites
For this tutorial, you need an API key for an external LLM provider (OpenAI, Amazon Bedrock).
Using an external LLM incurs API costs based on the number of queries and results evaluated.
Enable the Search Relevance Workbench and configure the following settings:
PUT /_cluster/settings
{
"persistent": {
"plugins.search_relevance.workbench_enabled": true,
"plugins.ml_commons.only_run_on_ml_node": "false",
"plugins.ml_commons.model_access_control_enabled": "true",
"plugins.ml_commons.allow_registering_model_via_url": "true"
}
}
Step 1: Configure a model
First, create a connector to an externally hosted LLM. This tutorial uses OpenAI, but you can adapt it for other providers such as Amazon Bedrock. Replace YOUR_API_KEY with your OpenAI API key:
POST /_plugins/_ml/connectors/_create
{
"name": "OpenAI Chat Connector",
"description": "Connector to OpenAI Chat API for LLM judgments",
"version": "1",
"protocol": "http",
"parameters": {
"endpoint": "api.openai.com",
"model": "gpt-3.5-turbo"
},
"credential": {
"openAI_key": "YOUR_API_KEY"
},
"actions": [
{
"action_type": "predict",
"method": "POST",
"url": "https://api.openai.com/v1/chat/completions",
"headers": {
"Authorization": "Bearer ${credential.openAI_key}",
"Content-Type": "application/json"
},
"request_body": "{ \"model\": \"${parameters.model}\", \"messages\": ${parameters.messages}, \"temperature\": 0 }"
}
]
}
Then register and deploy the model. Replace {connector_id} with the ID returned in the previous response:
POST /_plugins/_ml/models/_register?deploy=true
{
"name": "openai_gpt-3.5-turbo",
"function_name": "remote",
"description": "External LLM model via OpenAI",
"connector_id": "{connector_id}"
}
This is an asynchronous operation. To verify the task status, use the Get ML task API. Once the state is COMPLETED, OpenSearch returns the model_id you’ll use in the following steps.
Step 2: Create a search index
Create a products index:
PUT /products
{
"mappings": {
"properties": {
"title": { "type": "text" },
"description": { "type": "text" },
"category": { "type": "keyword" },
"brand": { "type": "keyword" },
"price": { "type": "float" }
}
}
}
Index example documents into the index:
POST /products/_bulk
{"index":{"_id":"1"}}
{"title":"Samsung 55-inch 4K Smart TV","description":"Ultra HD Smart TV with HDR and built-in streaming apps","category":"Electronics","brand":"Samsung","price":599.99}
{"index":{"_id":"2"}}
{"title":"LG 65-inch OLED TV","description":"Premium OLED display with perfect blacks and vibrant colors","category":"Electronics","brand":"LG","price":1299.99}
{"index":{"_id":"3"}}
{"title":"Sony Wireless Headphones","description":"Noise-canceling over-ear headphones with 30-hour battery","category":"Electronics","brand":"Sony","price":199.99}
{"index":{"_id":"4"}}
{"title":"Apple MacBook Pro 14-inch","description":"Professional laptop with M2 chip and Retina display","category":"Computers","brand":"Apple","price":1999.99}
{"index":{"_id":"5"}}
{"title":"Dell Gaming Monitor 27-inch","description":"High refresh rate gaming monitor with G-Sync support","category":"Computers","brand":"Dell","price":399.99}
Step 3: Create a search configuration
A search configuration defines a search strategy to evaluate. The %SearchText% placeholder is replaced with each query from the query set during evaluation:
PUT /_plugins/_search_relevance/search_configurations
{
"name": "baseline",
"query": "{\"query\":{\"multi_match\":{\"query\":\"%SearchText%\",\"fields\":[\"title\",\"description\",\"category\",\"brand\"]}}}",
"index": "products"
}
Step 4: Create a query set
Create a query set containing test queries for evaluation:
PUT /_plugins/_search_relevance/query_sets
{
"name": "Electronics Queries",
"description": "Test queries for electronics products",
"sampling": "manual",
"querySetQueries": [
{"queryText": "smart tv"},
{"queryText": "laptop computer"},
{"queryText": "wireless headphones"}
]
}
Step 5: Generate LLM judgments
Create an LLM judgment that uses your deployed model to evaluate search results. Replace {model_id}, {query_set_id}, and {search_configuration_id} with the IDs returned in previous steps:
PUT /_plugins/_search_relevance/judgments
{
"name": "LLM Judgment via OpenAI",
"description": "Uses GPT-3.5-turbo to evaluate product search results",
"type": "LLM_JUDGMENT",
"modelId": "{model_id}",
"querySetId": "{query_set_id}",
"searchConfigurationList": ["{search_configuration_id}"],
"size": 10,
"tokenLimit": 4000,
"contextFields": ["title", "description", "category"],
"ignoreFailure": false,
"llmJudgmentRatingType": "SCORE0_1",
"promptTemplate": "Rate the relevance of these search results {{hits}} for the query '{{queryText}}' on a scale of 0-1, where 0 is completely irrelevant and 1 is perfectly relevant. Consider the product title, description, and category.",
"overwriteCache": false
}
For a description of all request body parameters, see Judgments.
The judgment process runs asynchronously. To verify the status, retrieve the judgment by its ID:
GET /search-relevance-judgment/_doc/{judgment_id}
When the status field is COMPLETED, the judgmentRatings array contains the generated relevance scores for each query-document pair.
Next steps
You are now ready to run an experiment to evaluate search quality with the LLM-generated judgments. The search configuration and query set that you created during this tutorial can serve as inputs for your first evaluation.