Synthetic data generation

Introduced 2.0

OpenSearch Benchmark provides a built-in synthetic data generator that can create datasets for any use case at any scale. It currently supports two generation methods:

Random data generation produces fields with randomized values. This is useful for stress testing and evaluating system performance under load.
Rule-based data generation creates data according to user-defined rules. This is helpful for testing specific scenarios, benchmarking query behavior, or simulating domain-specific patterns.

Data generation methods

OpenSearch Benchmark currently supports the following data generation methods.

Generate data using index mappings

Create synthetic data based on your OpenSearch index mappings.

Generate data using custom logic

Build synthetic data using your own scripts or domain-specific rules.

For advanced synthetic data generation capabilities, explore vector generation.

Generating vectors

Generate synthetic dense and sparse vectors with configurable parameters for realistic AI/ML benchmarking scenarios.

Tips and best practices

Learn practical guidance and best practices to optimize your synthetic data generation workflows.

Data generation methods
Tips and best practices

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Synthetic data generation

Data generation methods

Tips and best practices

OpenSearch Links

Get Involved

Resources

Contact Us