Generating vectors

You can generate synthetic dense and sparse vectors from mappings using OpenSearch Benchmark’s synthetic data generator.

Dense vectors

Dense vectors (represented by the knn_vector field type in OpenSearch) are numerical representations of data, such as text or images, in which most or all dimensions have non-zero values. These vectors typically contain floating-point numbers between -1.0 and 1.0, with each dimension contributing to the overall meaning.

Example embedding for the word “dog”:

{
  "embedding": [0.234, -0.567, 0.123, 0.891, -0.234, 0.456, ..., 0.789]
}

Sparse vectors

Sparse vectors (represented by the sparse_vector field type in OpenSearch) are vectors in which most dimensions are zero, represented as key-value pairs of non-zero token IDs and their weights.

Example text: “Korean jindos are hunting dogs that have a reputation for being loyal, independent, and confident.”

Sparse vector representation of example text:

{
  "5432": 0.85,   // "korean" - very important (specific descriptor)
  "7821": 0.78,   // "jindos" - very important (breed name)
  "2": 0.45,      // "dog" - moderately important (general category)
  "9999": 0.32,   // "loyal" - somewhat important (characteristic)
  "1111": 0.12    // "things" - less important (common word)
}

Basic usage

The following examples show how to generate vectors with minimal configuration using only OpenSearch index mappings.

Generating dense vectors

Generate random 128-dimensional vectors with minimal configuration.

1. Create a mapping file (simple-knn-mapping.json):

{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "title": {"type": "text"},
      "my_embedding": {
        "type": "knn_vector",
        "dimension": 128
      }
    }
  }
}

2. Generate data:

opensearch-benchmark generate-data \
  --index-name my-vectors \
  --index-mappings simple-knn-mapping.json \
  --output-path ./output \
  --total-size 1

Generated output

In each of the generated documents, the my_embedding field might appear as follows:

{
  "title": "Sample text 42",
  "my_embedding": [0.234, -0.567, 0.123, ..., 0.891]  // 128 random floats [-1.0, 1.0]
}

Generating sparse vectors

Generate sparse vectors with the default configuration (10 tokens).

1. Create a mapping file (simple-sparse-mapping.json):

{
  "mappings": {
    "properties": {
      "content": {"type": "text"},
      "sparse_embedding": {
        "type": "sparse_vector"
      }
    }
  }
}

2. Generate data (same command pattern):

opensearch-benchmark generate-data \
  --index-name my-sparse \
  --index-mappings simple-sparse-mapping.json \
  --output-path ./output \
  --total-size 1

Generated output

In each of the generated documents, the sparse_embedding field might appear as follows:

{
  "content": "Sample text content",
  "sparse_embedding": {
    "1000": 0.3421,
    "1100": 0.5234,
    "1200": 0.7821,
    "1300": 0.1523,
    "1400": 0.9102,
    "1500": 0.4567,
    "1600": 0.2341,
    "1700": 0.6789,
    "1800": 0.8123,
    "1900": 0.3456
  }
}

Using only an OpenSearch index mapping, OpenSearch Benchmark can generate synthetic dense and sparse vectors. However, this produces basic synthetic vectors. For more realistic distributions and clusterings, we recommend configuring the parameters described in the following section.

Dense vector (k-NN vector) parameters

The following are parameters that you can add to your synthetic data generation configuration file (YAML configuration) to fine-tune the generation of dense vectors. These parameters are used in the field_overrides section with the generate_knn_vector generator. For complete configuration details, see Advanced configuration.

dimension

This parameter specifies the number of dimensions in the vector. Optional.

How to specify: The dimension must be defined in your OpenSearch index mapping file. You can optionally override this value in your YAML configuration using the dimension parameter in field_overrides.

Impact:

Memory: Higher dimensions = more storage
- 128D ≈ 0.5 KB per vector
- 768D ≈ 3 KB per vector
- 1536D ≈ 6 KB per vector
Performance: More dimensions = slower indexing and search
Quality: Must match your actual embedding model’s output

The following table shows common dimension values and their typical use cases.

Dimension	Use case	Example models
128	Lightweight, custom models	Custom embeddings, fast search
384	General purpose	sentence-transformers/all-MiniLM-L6-v2
768	Standard NLP	BERT-Base, DistilBERT, MPNet
1,024	High-quality NLP	BERT-Large
1,536	OpenAI standard	text-embedding-ada-002, text-embedding-3-small
3,072	OpenAI premium	text-embedding-3-large

Example:

field_overrides:
  my_embedding:
    generator: generate_knn_vector
    params:
      dimension: 768  # Override mapping dimension if needed

Best practice: This parameter must match your embedding model’s dimension.

sample_vectors

This parameter provides base vectors to which the generator adds noise, creating realistic variations and clusters. Optional but highly recommended.

Without sample vectors, OpenSearch Benchmark’s synthetic data generator generates random uniform vectors across the entire space, which is unrealistic and offers poor search quality. Providing sample vectors allows OpenSearch Benchmark’s synthetic data generator to create more realistic and natural clusters.

After you prepare a list of sample vectors, insert them as a list of lists, in which each inner list is a complete vector. The following example provides sample vectors in the synthetic data generation configuration file:

field_overrides:
  product_embedding:
    generator: generate_knn_vector
    params:
      dimension: 768
      sample_vectors:
        - [0.12, -0.34, 0.56, ..., 0.23]  # Vector 1 (768 values)
        - [-0.23, 0.45, -0.12, ..., -0.15]  # Vector 2 (768 values)
        - [0.34, 0.21, -0.45, ..., 0.42]  # Vector 3 (768 values)

Use the following guidelines to determine the number of vectors that you provide:

Minimum: 3–5 for basic clustering
Recommended: 5–10 for realistic distribution
Maximum: 20+ for complex multi-cluster scenarios

How to obtain sample vectors:

Option 1 (Recommended): Using actual embeddings from your domain: Use actual embeddings from your domain, representing different semantic clusters. Random generation without sample vectors produces unrealistic data unsuitable for search quality testing.

Option 2: Using sentence-transformers in Python:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Create representative texts from different categories
texts = [
    "Electronics and gadgets",
    "Clothing and fashion",
    "Home and kitchen appliances",
    "Books and literature",
    "Sports and outdoor equipment"
]

embeddings = model.encode(texts)
print(embeddings.tolist())  # Copy to your synthetic data generation configuration file (YAML config)

distribution_type

This parameter specifies the type of noise distribution. Optional. Default is gaussian.

Valid values:

gaussian: Normal distribution N(0, noise_factor)
- Most realistic (natural variation with occasional outliers)
- Produces smooth clusters
- Some values can extend beyond expected range
uniform: Uniform distribution [-noise_factor, +noise_factor]
- Bounded variation (no extreme outliers)
- More predictable results
- Flat probability across range

Configuration:

field_overrides:
  realistic_embedding:
    generator: generate_knn_vector
    params:
      sample_vectors: [...]
      noise_factor: 0.1
      distribution_type: gaussian  # More realistic

  controlled_embedding:
    generator: generate_knn_vector
    params:
      sample_vectors: [...]
      noise_factor: 0.1
      distribution_type: uniform   # More predictable

Best practice: Use gaussian for production-like benchmarks.

noise_factor

This parameter controls the amount of noise added to base vectors:

For gaussian: Standard deviation of normal distribution
For uniform: Range of uniform distribution (±noise_factor)

Optional. Default is 0.1.

The following table shows how different noise_factor values impact the generated data.

`noise_factor`	Effect	Use case
0.01–0.05	Tight clustering, minimal variation	Duplicate detection, near-exact matches
0.1–0.2	Natural variation within topic	General semantic search, recommendations
0.3–0.5	Wide dispersion, diverse concepts	Broad topic matching, discovery
> 0.5	Very scattered, overlapping clusters	Testing edge cases, stress testing

Configuration:

field_overrides:
  tight_clustering:
    generator: generate_knn_vector
    params:
      sample_vectors: [...]
      noise_factor: 0.05  # Tight clusters

  diverse_results:
    generator: generate_knn_vector
    params:
      sample_vectors: [...]
      noise_factor: 0.2   # More variation

Best practice: Start with 0.1, then adjust based on search recall or precision requirements.

normalize

This parameter normalizes vectors after noise addition, making their magnitude (length) exactly 1.0. Optional. Default is false.

The following table shows when to set normalize to true based on your index configuration.

`space_type` in the index mapping	`normalize` value	Explanation
`cosinesimil`	`true`	Cosine similarity depends only on vector direction. Pre-normalizing improves performance because the dot product directly represents cosine similarity.
`l2`	`false`	L2 distance relies on vector magnitude. Normalizing removes magnitude information and reduces accuracy.
`innerproduct`	`false`	Inner product incorporates vector magnitude into the similarity score, so normalization would change the intended scoring behavior.

Real-world model guidance:

OpenAI embeddings: These vectors are pre-normalized, so set normalize to true.
sentence-transformers: Many models output normalized vectors. Review the model documentation; in most cases, normalize should be set to true.
BERT (raw output): Raw BERT embeddings are not normalized. Set normalize to false and rely on the index configuration to perform normalization if needed.

Configuration:

field_overrides:
  # For cosine similarity search
  cosine_embedding:
    generator: generate_knn_vector
    params:
      dimension: 384
      sample_vectors: [...]
      normalize: true  # Required for accurate cosine similarity

  # For L2 distance search
  l2_embedding:
    generator: generate_knn_vector
    params:
      dimension: 768
      sample_vectors: [...]
      normalize: false  # Keep original magnitudes

Best practice: Match your OpenSearch index’s space_type setting.

Sparse vector parameters

The following are parameters that you can add to your synthetic data generation configuration file to fine-tune how sparse vectors are generated. These parameters are used in the field_overrides section with the generate_sparse_vector generator. For complete configuration details, see Advanced configuration.

num_tokens

This parameter specifies the number of token-weight pairs to generate per vector. Optional. Default is 10.

Impact:

Low (5–10): Very sparse, fast search; may miss some relevant documents
Medium (10–25): Balanced performance and recall
High (50–100): Dense sparse representation; comprehensive but slower

The following table shows typical num_tokens values for different models and approaches.

Model/Approach	Typical `num_tokens`	Use case
SPLADE v1	10–15	Standard sparse neural search
SPLADE v2	15–25	Improved recall
DeepImpact	8–12	Efficient sparse search
Custom/Hybrid	20–50	Rich representations

Configuration:

field_overrides:
  sparse_standard:
    generator: generate_sparse_vector
    params:
      num_tokens: 15  # Standard SPLADE-like

  sparse_rich:
    generator: generate_sparse_vector
    params:
      num_tokens: 30  # Richer representation

Best practice: Start with 10--15; increase if recall is insufficient.

min_weight and max_weight

These parameters define the range of token importance weights. Optional. Default min_weight is 0.01; default max_weight is 1.0.

Impact:

min_weight: Excludes low-importance tokens from generation. Tokens with weights below this value are not included.
max_weight: Limits the upper bound of token influence to prevent any single token from dominating the vector.

The following table shows common weight range configurations and their use cases.

Configuration	`min_weight`	`max_weight`	Use case
Standard SPLADE	`0.01`	`1.0`	Default, balanced importance
Narrow range	`0.1`	`0.9`	More uniform importance
Wide range	`0.01`	`2.0`	Strong importance signals
High threshold	`0.05`	`1.0`	Filters low-confidence tokens

Configuration:

field_overrides:
  sparse_balanced:
    generator: generate_sparse_vector
    params:
      num_tokens: 15
      min_weight: 0.01
      max_weight: 1.0

  sparse_uniform:
    generator: generate_sparse_vector
    params:
      num_tokens: 20
      min_weight: 0.2   # Higher minimum
      max_weight: 0.8   # Lower maximum

Constraints:

min_weight must be > 0.0 (OpenSearch requires positive weights).
max_weight must be > min_weight.
Weights are rounded to 4 decimal places.

Best practice: Keep min_weight small (0.01--0.05) to allow nuanced weighting.

token_id_start and token_id_step

These parameters define how token IDs are assigned during vector generation:

token_id_start: Sets the starting token ID in the generated sequence. Default is 1000.
token_id_step: Specifies the increment applied between each consecutive token ID. Default is 100.

Generated sequence: start, start+step, start+2*step, ...

Example with start=1000, step=100, num_tokens=5:

{
  "1000": 0.3421,  // token_id_start
  "1100": 0.5234,  // start + 1*step
  "1200": 0.7821,  // start + 2*step
  "1300": 0.1523,  // start + 3*step
  "1400": 0.9102   // start + 4*step
}

The following table shows different token ID configurations and their use cases.

Configuration	`token_id_start`	`token_id_step`	Use case
Default testing	`1000`	`100`	Helps visually distinguish generated token ranges.
Realistic vocabulary	`0`	`1`	Aligns token IDs with a real model’s vocabulary indexes.
Multi-field generation	`1000`, `5000`, `10000`	`1`	Keeps token ID ranges separate across different fields.
Large vocabulary simulation	`0`	`1`	Supports generation scenarios with vocabularies of `50,000`+ tokens.

Configuration:

field_overrides:
  # Default: easy debugging
  sparse_debug:
    generator: generate_sparse_vector
    params:
      num_tokens: 10
      token_id_start: 1000
      token_id_step: 100

  # Realistic: actual vocab indices
  sparse_realistic:
    generator: generate_sparse_vector
    params:
      num_tokens: 15
      token_id_start: 0
      token_id_step: 1

  # Multiple fields: separate ranges
  sparse_field1:
    generator: generate_sparse_vector
    params:
      token_id_start: 1000

  sparse_field2:
    generator: generate_sparse_vector
    params:
      token_id_start: 5000

Note: Token IDs in the generated data are sequential. In real sparse vectors, IDs may be non-sequential based on the actual vocabulary. This difference does not impact OpenSearch indexing or search functionality.

Best practice: Use a larger token_id_step (for example, 100) for debugging, and set token_id_step to 1 for production-like data.

Choosing simple or complex generation approaches

The following table outlines when to use simple generation versus a more complex, configurable approach based on your testing goals.

Scenario	Recommended approach	Rationale
Learning or quick testing	Simple generation (no additional configuration)	Provides the fastest setup and is sufficient for basic validation.
Load testing	Simple generation	Prioritizes data volume and throughput over vector realism.
Realistic benchmarks	Complex generation (with configuration)	Requires realistic vector clustering and distributions to reflect real-world behavior.
Production simulation	Complex generation	Needs vector characteristics that closely match those produced by the actual embedding model.
Search quality testing	Complex generation	Requires meaningful vector clusters to evaluate recall and precision accurately.

Recommendation: For search quality testing or algorithm comparisons, use a complex configuration with sample vectors to ensure realistic data distributions.

Dense vectors
Sparse vectors
Basic usage
- Generating dense vectors
- Generating sparse vectors
Dense vector (k-NN vector) parameters
Sparse vector parameters
Choosing simple or complex generation approaches

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Generating vectors

Dense vectors

Sparse vectors

Basic usage

Generating dense vectors

Generated output

Generating sparse vectors

Generated output

Dense vector (k-NN vector) parameters

dimension

sample_vectors

distribution_type

noise_factor

normalize

Sparse vector parameters

num_tokens

min_weight and max_weight

token_id_start and token_id_step

Choosing simple or complex generation approaches

OpenSearch Links

Get Involved

Resources

Contact Us