Choosing a workload

The opensearch-benchmark-workloads repository contains a list of workloads that you can use to run your benchmarks. Using a workload similar to your cluster’s use cases can save you time and effort when assessing your cluster’s performance.

For example, say you’re a system architect at a rideshare company. As a rideshare company, you collect and store data based on trip times, locations, and other data related to each rideshare. Instead of building a custom workload and using your own data, which requires additional time, effort, and cost, you can use the nyc_taxis workload to benchmark your cluster because the data inside the workload is similar to the data that you collect.

Criteria for choosing a workload

Consider the following criteria when deciding which workload would work best for benchmarking your cluster:

The cluster’s use case and the size of the cluster. Small clusters usually contain 1–10 nodes and are suitable for development environments. Medium clusters usually contain 11–50 nodes and are used for testing environments that more closely resemble a production cluster.
The data types that your cluster uses compared to the data structure of the documents contained in the workload. Each workload contains an example document so that you can compare data types, or you can view the index mappings and data types in the index.json file.
The query types most commonly used inside your cluster. The operations/default.json file contains information about the query types and workload operations. For a list of common operations, see Common operations.

General search use cases: `nyc_taxis`

For benchmarking clusters built for general search use cases, start with the nyc_taxis workload. It contains the following:

Data type: Ride data from yellow taxis in New York City in 2015.
Cluster requirements: Suitable for small- to medium-sized clusters.

This workload tests the following queries and search functions:

Range queries
Term queries on various fields
Geodistance queries
Aggregations

Vector data: `vectorsearch`

The vectorsearch workload is designed to benchmark vector search capabilities, including performance and accuracy. It contains the following:

Data type: High-dimensional vector data, often representing embeddings of text or images.
Cluster requirements: Requires a cluster with vector search capabilities enabled.

This workload tests the following queries and search functions:

k-NN vector searches
Hybrid searches combining vector similarity with metadata filtering
Indexing performance for high-dimensional vector data

Comprehensive search solutions: `big5`

The big5 workload is a comprehensive benchmark suite for testing various aspects of search engine performance, including overall search engine performance across multiple use cases. It contains the following:

Data type: A mix of different data types, including text, numeric, and structured data.
Cluster requirements: Suitable for medium to large clusters because it’s designed to stress test various components.

This workload tests the following queries and search functions:

Full-text search performance
Aggregation performance
Complex Boolean queries
Sorting and pagination
Indexing performance for various data types

Percolator queries: `percolator`

The percolator workload is designed to test the performance of the percolator query type. It contains the following:

Data type: A set of stored queries and documents to be matched against those queries.
Cluster requirements: Suitable for clusters that make heavy use of the percolator feature.

This workload tests the following queries and search functions:

Indexing performance for storing queries
Matching performance for percolator queries
Scalability with increasing numbers of stored queries

Log data: `http_logs`

For benchmarking clusters built for indexing and search using log data, use the http_logs workload. It contains the following:

Data type: HTTP access logs from the 1998 World Cup website.
Cluster requirements: Suitable for clusters optimized for time-series data and log analytics.

This workload tests the following queries and search functions:

Time range queries
Term queries on fields like status-code or user-agent
Aggregations for metrics like request count and average response size
Cardinality aggregations on fields like ip-address.

Creating a custom workload

If you can’t find an official workload that suits your needs, you can create a custom workload. For more information, see Creating custom workloads.

Criteria for choosing a workload
General search use cases: nyc_taxis
Vector data: vectorsearch
Comprehensive search solutions: big5
Percolator queries: percolator
Log data: http_logs
Creating a custom workload

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.