Link Search Menu Expand Document Documentation Menu

Generating data using index mappings

You can use OpenSearch index mappings to generate synthetic data. This approach offers a balance between automation and customization.

To use this method, save your OpenSearch index mappings to a JSON file:

{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "description": {
        "type": "text"
      },
      "price": {
        "type": "float"
      },
      "created_at": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis"
      },
      "is_available": {
        "type": "boolean"
      },
      "category_id": {
        "type": "integer"
      },
      "tags": {
        "type": "keyword"
      }
    }
  },
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

OpenSearch Benchmark works with any valid index mappings, regardless of complexity. You can provide more complex mappings similar to the following:

Mappings
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "user": {
        "type": "object",
        "properties": {
          "id": {
            "type": "keyword"
          },
          "email": {
            "type": "keyword"
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              },
              "completion": {
                "type": "completion"
              }
            },
            "analyzer": "standard"
          },
          "address": {
            "type": "object",
            "properties": {
              "street": {
                "type": "text"
              },
              "city": {
                "type": "keyword"
              },
              "state": {
                "type": "keyword"
              },
              "zip": {
                "type": "keyword"
              },
              "location": {
                "type": "geo_point"
              }
            }
          },
          "preferences": {
            "type": "object",
            "dynamic": true
          }
        }
      },
      "orders": {
        "type": "nested",
        "properties": {
          "id": {
            "type": "keyword"
          },
          "date": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "amount": {
            "type": "scaled_float",
            "scaling_factor": 100
          },
          "status": {
            "type": "keyword"
          },
          "items": {
            "type": "nested",
            "properties": {
              "product_id": {
                "type": "keyword"
              },
              "name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword"
                  }
                }
              },
              "quantity": {
                "type": "short"
              },
              "price": {
                "type": "float"
              },
              "categories": {
                "type": "keyword"
              }
            }
          },
          "shipping_address": {
            "type": "object",
            "properties": {
              "street": {
                "type": "text"
              },
              "city": {
                "type": "keyword"
              },
              "state": {
                "type": "keyword"
              },
              "zip": {
                "type": "keyword"
              },
              "location": {
                "type": "geo_point"
              }
            }
          }
        }
      },
      "activity_log": {
        "type": "nested",
        "properties": {
          "timestamp": {
            "type": "date"
          },
          "action": {
            "type": "keyword"
          },
          "ip_address": {
            "type": "ip"
          },
          "details": {
            "type": "object",
            "enabled": false
          }
        }
      },
      "metadata": {
        "type": "object",
        "properties": {
          "created_at": {
            "type": "date"
          },
          "updated_at": {
            "type": "date"
          },
          "tags": {
            "type": "keyword"
          },
          "source": {
            "type": "keyword"
          },
          "version": {
            "type": "integer"
          }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          },
          "standard": {
            "type": "text",
            "analyzer": "standard"
          }
        }
      },
      "ranking_scores": {
        "type": "object",
        "properties": {
          "popularity": {
            "type": "float"
          },
          "relevance": {
            "type": "float"
          },
          "quality": {
            "type": "float"
          }
        }
      },
      "permissions": {
        "type": "nested",
        "properties": {
          "user_id": {
            "type": "keyword"
          },
          "role": {
            "type": "keyword"
          },
          "granted_at": {
            "type": "date"
          }
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 2,
    "analysis": {
      "analyzer": {
        "email_analyzer": {
          "type": "custom",
          "tokenizer": "uax_url_email",
          "filter": ["lowercase", "stop"]
        }
      }
    }
  }
}

Generating data

To generate synthetic data using index mappings, use the generate-data subcommand and provide the required index mappings file, index name, output path, and total amount of data to generate:

osb generate-data --index-name <NAME_OF_DATA_CORPORA> --index-mappings <PATH_TO_INDEX_MAPPINGS> --output-path <DESIRED_OUTPUT_PATH> --total-size <TOTAL_SIZE_OF_DATA_CORPORA_GENERATED_IN_GB>

For a complete list of available parameters and their descriptions, see the generate-data command reference.

Example output

The following is an example output when generating 100 GB of data:

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/


[NOTE] ✨ Dashboard link to monitor processes and task streams: [http://127.0.0.1:8787/status]
[NOTE] ✨ For users who are running generation on a virtual machine, consider SSH port forwarding (tunneling) to localhost to view dashboard.
[NOTE] Example of localhost command for SSH port forwarding (tunneling) from an AWS EC2 instance:
ssh -i <PEM_FILEPATH> -N -L localhost:8787:localhost:8787 ec2-user@<DNS>

Total GB to generate: [1]
Average document size in bytes: [412]
Max file size in GB: [40]

100%|███████████████████████████████████████████████████████████████████| 100.07G/100.07G [3:35:29<00:00, 3.98MB/s]

Generated 24271844660 docs in 12000 seconds. Total dataset size is 100.21GB.
✅ Visit the following path to view synthetically generated data: /home/ec2-user/

-----------------------------------
[INFO] ✅ SUCCESS (took 272 seconds)
-----------------------------------

Advanced configuration

You can control how synthetic data is generated by creating a YAML configuration file. The following is an example configuration file that defines custom rules in the MappingGenerationValues parameter:

MappingGenerationValues:
  # For users who want more granular control over how data is generated when providing an OpenSearch mapping
  generator_overrides:
    # Overrides all instances of generators with these settings. Specify type and params
    integer:
      min: 0
      max: 20
    long:
      min: 0
      max: 1000
    float:
      min: 0.0
      max: 1.0
    double:
      min: 0.0
      max: 2000.0
    date:
      start_date: "2020-01-01"
      end_date: "2023-01-01"
      format: "yyyy-mm-dd"
    text:
      must_include: ["lorem", "ipsum"]
    keyword:
      choices: ["alpha", "beta", "gamma"]

  field_overrides:
    # Specify field name as key of dict. For its values, specify generator and its params. Params must adhere to existing params for each generator
    # For nested fields, use dot notation: Example preferences.allergies if allergies is a subfield of preferences object
    title:
      generator: generate_keyword
      params:
        choices: ["Helly R", "Mark S", "Irving B"]

    promo_codes:
      generator: generate_keyword
      params:
        choices: ["HOT_SUMMER", "TREATSYUM!"]

    # Nested fields, use dot notation
    orders.items.product_id:
      generator: generate_keyword
      params:
        choices: ["Python", "English"]

MappingGenerationValues supports the following parameters.

Parameter Description
generator_overrides Defines custom generator rules for specific OpenSearch field types. Any field that uses the corresponding type will follow these rules. See Generator overrides parameters.
field_overrides Defines generator rules for individual fields by field name. These apply only to the fields explicitly listed. For nested fields, use dot notation (for example, orders.items.product_id). See Field overrides parameters.

If both generator_overrides and field_overrides are present, field_overrides take precedence.

Generator overrides parameters

The following parameters are available for each OpenSearch field type in generator_overrides.

Field type Parameters
integer, long, short, byte min, max
float, double min, max, round (the number of decimal places to round to)
date start_date, end_date, format
text must_include (array of terms to include in generated text)
keyword choices (array of keywords to randomly select from)

Field overrides parameters

The following generators and their parameters are available for use in field_overrides.

Generator Parameters
generate_text must_include (array of terms to include in generated text)
generate_keyword choices (array of keywords to randomly select from)
generate_integer min, max
generate_long min, max
generate_short min, max
generate_byte min, max
generate_float min, max, round (the number of decimal places to round to)
generate_double min, max
generate_boolean N/A
generate_date format, start_date, end_date
generate_ip N/A
generate_geo_point N/A
generate_knn_vector dimension, sample_vectors, noise_factor, distribution_type, normalize. See Generating vectors.
generate_sparse_vector num_tokens, min_weight, max_weight, token_id_start, token_id_step. See Generating vectors.

Using the configuration

To use your configuration file, provide its full path in the --custom-config parameter:

osb generate-data --index-name <NAME_OF_DATA_CORPORA> --index-mappings <PATH_TO_INDEX_MAPPINGS> --output-path <DESIRED_OUTPUT_PATH> --total-size <TOTAL_SIZE_OF_DATA_CORPORA_GENERATED_IN_GB> --custom-config ~/Desktop/sdg-config.yml

350 characters left

Have a question? .

Want to contribute? or .