Generating data using index mappings

You can use OpenSearch index mappings to generate synthetic data. This approach offers a balance between automation and customization.

To use this method, save your OpenSearch index mappings to a JSON file:

{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "description": {
        "type": "text"
      },
      "price": {
        "type": "float"
      },
      "created_at": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis"
      },
      "is_available": {
        "type": "boolean"
      },
      "category_id": {
        "type": "integer"
      },
      "tags": {
        "type": "keyword"
      }
    }
  },
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

OpenSearch Benchmark works with any valid index mappings, regardless of complexity. You can provide more complex mappings similar to the following:

Mappings

{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "user": {
        "type": "object",
        "properties": {
          "id": {
            "type": "keyword"
          },
          "email": {
            "type": "keyword"
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              },
              "completion": {
                "type": "completion"
              }
            },
            "analyzer": "standard"
          },
          "address": {
            "type": "object",
            "properties": {
              "street": {
                "type": "text"
              },
              "city": {
                "type": "keyword"
              },
              "state": {
                "type": "keyword"
              },
              "zip": {
                "type": "keyword"
              },
              "location": {
                "type": "geo_point"
              }
            }
          },
          "preferences": {
            "type": "object",
            "dynamic": true
          }
        }
      },
      "orders": {
        "type": "nested",
        "properties": {
          "id": {
            "type": "keyword"
          },
          "date": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "amount": {
            "type": "scaled_float",
            "scaling_factor": 100
          },
          "status": {
            "type": "keyword"
          },
          "items": {
            "type": "nested",
            "properties": {
              "product_id": {
                "type": "keyword"
              },
              "name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword"
                  }
                }
              },
              "quantity": {
                "type": "short"
              },
              "price": {
                "type": "float"
              },
              "categories": {
                "type": "keyword"
              }
            }
          },
          "shipping_address": {
            "type": "object",
            "properties": {
              "street": {
                "type": "text"
              },
              "city": {
                "type": "keyword"
              },
              "state": {
                "type": "keyword"
              },
              "zip": {
                "type": "keyword"
              },
              "location": {
                "type": "geo_point"
              }
            }
          }
        }
      },
      "activity_log": {
        "type": "nested",
        "properties": {
          "timestamp": {
            "type": "date"
          },
          "action": {
            "type": "keyword"
          },
          "ip_address": {
            "type": "ip"
          },
          "details": {
            "type": "object",
            "enabled": false
          }
        }
      },
      "metadata": {
        "type": "object",
        "properties": {
          "created_at": {
            "type": "date"
          },
          "updated_at": {
            "type": "date"
          },
          "tags": {
            "type": "keyword"
          },
          "source": {
            "type": "keyword"
          },
          "version": {
            "type": "integer"
          }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "standard": {
            "type": "text",
            "analyzer": "standard"
          }
        }
      },
      "ranking_scores": {
        "type": "object",
        "properties": {
          "popularity": {
            "type": "float"
          },
          "relevance": {
            "type": "float"
          },
          "quality": {
            "type": "float"
          }
        }
      },
      "permissions": {
        "type": "nested",
        "properties": {
          "user_id": {
            "type": "keyword"
          },
          "role": {
            "type": "keyword"
          },
          "granted_at": {
            "type": "date"
          }
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 2,
    "analysis": {
      "analyzer": {
        "email_analyzer": {
          "type": "custom",
          "tokenizer": "uax_url_email",
          "filter": ["lowercase", "stop"]
        }
      }
    }
  }
}

Generating data

To generate synthetic data using index mappings, use the generate-data subcommand and provide the required index mappings file, index name, output path, and total amount of data to generate:

osb generate-data --index-name <NAME_OF_DATA_CORPORA> --index-mappings <PATH_TO_INDEX_MAPPINGS> --output-path <DESIRED_OUTPUT_PATH> --total-size <TOTAL_SIZE_OF_DATA_CORPORA_GENERATED_IN_GB>

For a complete list of available parameters and their descriptions, see the generate-data command reference.

Example output

The following is an example output when generating 100 GB of data:

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/


[NOTE] ✨ Dashboard link to monitor processes and task streams: [http://127.0.0.1:8787/status]
[NOTE] ✨ For users who are running generation on a virtual machine, consider SSH port forwarding (tunneling) to localhost to view dashboard.
[NOTE] Example of localhost command for SSH port forwarding (tunneling) from an AWS EC2 instance:
ssh -i <PEM_FILEPATH> -N -L localhost:8787:localhost:8787 ec2-user@<DNS>

Total GB to generate: [1]
Average document size in bytes: [412]
Max file size in GB: [40]

100%|███████████████████████████████████████████████████████████████████| 100.07G/100.07G [3:35:29<00:00, 3.98MB/s]

Generated 24271844660 docs in 12000 seconds. Total dataset size is 100.21GB.
✅ Visit the following path to view synthetically generated data: /home/ec2-user/

-----------------------------------
[INFO] ✅ SUCCESS (took 272 seconds)
-----------------------------------

Advanced configuration

You can control how synthetic data is generated by creating a YAML configuration file. The following is an example configuration file that defines custom rules in the MappingGenerationValues parameter:

MappingGenerationValues:
  # For users who want more granular control over how data is generated when providing an OpenSearch mapping
  generator_overrides:
    # Overrides all instances of generators with these settings. Specify type and params
    integer:
      min: 0
      max: 20
    long:
      min: 0
      max: 1000
    float:
      min: 0.0
      max: 1.0
    double:
      min: 0.0
      max: 2000.0
    date:
      start_date: "2020-01-01"
      end_date: "2023-01-01"
      format: "yyyy-mm-dd"
    text:
      must_include: ["lorem", "ipsum"]
    keyword:
      choices: ["alpha", "beta", "gamma"]

  field_overrides:
    # Specify field name as key of dict. For its values, specify generator and its params. Params must adhere to existing params for each generator
    # For nested fields, use dot notation: Example preferences.allergies if allergies is a subfield of preferences object
    title:
      generator: generate_keyword
      params:
        choices: ["Helly R", "Mark S", "Irving B"]

    promo_codes:
      generator: generate_keyword
      params:
        choices: ["HOT_SUMMER", "TREATSYUM!"]

    # Nested fields, use dot notation
    orders.items.product_id:
      generator: generate_keyword
      params:
        choices: ["Python", "English"]

MappingGenerationValues supports the following parameters.

Parameter	Description
`generator_overrides`	Defines custom generator rules for specific OpenSearch field types. Any field that uses the corresponding type will follow these rules. See Generator overrides parameters.
`field_overrides`	Defines generator rules for individual fields by field name. These apply only to the fields explicitly listed. For nested fields, use dot notation (for example, `orders.items.product_id`). See Field overrides parameters.

If both generator_overrides and field_overrides are present, field_overrides take precedence.

Generator overrides parameters

The following parameters are available for each OpenSearch field type in generator_overrides.

Field type	Parameters
`integer`, `long`, `short`, `byte`	`min`, `max`
`float`, `double`	`min`, `max`, `round` (the number of decimal places to round to)
`date`	`start_date`, `end_date`, `format`
`text`	`must_include` (array of terms to include in generated text)
`keyword`	`choices` (array of keywords to randomly select from)

Field overrides parameters

The following generators and their parameters are available for use in field_overrides.

Generator	Parameters
`generate_text`	`must_include` (array of terms to include in generated text)
`generate_keyword`	`choices` (array of keywords to randomly select from)
`generate_integer`	`min`, `max`
`generate_long`	`min`, `max`
`generate_short`	`min`, `max`
`generate_byte`	`min`, `max`
`generate_float`	`min`, `max`, `round` (the number of decimal places to round to)
`generate_double`	`min`, `max`
`generate_boolean`	N/A
`generate_date`	`format`, `start_date`, `end_date`
`generate_ip`	N/A
`generate_geo_point`	N/A
`generate_knn_vector`	`dimension`, `sample_vectors`, `noise_factor`, `distribution_type`, `normalize`. See Generating vectors.
`generate_sparse_vector`	`num_tokens`, `min_weight`, `max_weight`, `token_id_start`, `token_id_step`. See Generating vectors.

Using the configuration

To use your configuration file, provide its full path in the --custom-config parameter:

osb generate-data --index-name <NAME_OF_DATA_CORPORA> --index-mappings <PATH_TO_INDEX_MAPPINGS> --output-path <DESIRED_OUTPUT_PATH> --total-size <TOTAL_SIZE_OF_DATA_CORPORA_GENERATED_IN_GB> --custom-config ~/Desktop/sdg-config.yml

Generating data
Example output
Advanced configuration
- Using the configuration
Related documentation

WAS THIS PAGE HELPFUL?

✔ Yes ✖ No

Tell us why

350 characters left

Have a question? Ask us on the OpenSearch forum.

Want to contribute? Edit this page or create an issue.

Generating data using index mappings

Generating data

Example output

Advanced configuration

Generator overrides parameters

Field overrides parameters

Using the configuration

OpenSearch Links

Get Involved

Resources

Contact Us

Generating data using index mappings

Generating data

Example output

Advanced configuration

Generator overrides parameters

Field overrides parameters

Using the configuration

Related documentation