Link Search Menu Expand Document Documentation Menu

Synthetic data generation

Introduced 2.0

OpenSearch Benchmark provides a built-in synthetic data generator that can create datasets for any use case at any scale. It currently supports two generation methods:

  • Random data generation produces fields with randomized values. This is useful for stress testing and evaluating system performance under load.
  • Rule-based data generation creates data according to user-defined rules. This is helpful for testing specific scenarios, benchmarking query behavior, or simulating domain-specific patterns.

Data generation methods

OpenSearch Benchmark currently supports the following data generation methods.

Generate data using index mappings

Create synthetic data based on your OpenSearch index mappings.

Generate data using custom logic

Build synthetic data using your own scripts or domain-specific rules.

For advanced synthetic data generation capabilities, explore vector generation.

Generating vectors

Generate synthetic dense and sparse vectors with configurable parameters for realistic AI/ML benchmarking scenarios.

Tips and best practices

Tips and best practices

Learn practical guidance and best practices to optimize your synthetic data generation workflows.

350 characters left

Have a question? .

Want to contribute? or .