Gretel Amplify

Statistical model that supports high volumes of tabular data generation.

The Gretel Amplify model API is designed to rapidly generate large volumes of synthetic data using statistical models and a hyper efficient multi-processing implementation. While Amplify is effective at learning and recreating distributions and correlations, it typically has a 10-15% drop in synthetic data accuracy versus Gretel's deep learning-based models for tabular data.

Use Cases

The Gretel Amplify model is able to generate large quantities of data from real-world data or synthetic data. Several use cases could include:

  • Creating large amounts of synthetic data to load test an application.

  • Mimic real-world data for pre-production environments.

  • Generate synthetic examples to test a ML model's ability to generalize to new data.

Model creation

The configuration below contains additional options for training a Gretel Amplify model, with the default options displayed.

schema_version: '1.0'

    - amplify:
        data_source: __tmp__
          num_records: null
          target_size_mb: null
          sample_size: 50000
          process_count_cpu: 0
  • data_source (str, required) - __tmp__ or point to a valid and accessible file in CSV, JSON, or JSONL format.

  • num_records (int, optional, defaults to null) - Target number of records to generate

  • target_size_mb (int, optional, defaults to null) - Target file size of the generated data in megabytes, with a maximum value of 5000 (5GB)

  • sample_size (int, optional, defaults to 50000)- The number of records to generate, per core, per iteration

  • process_count_cpu (float, optional, defaults to 0) - The number of processes to use for data generation.

    • If 0 is provided, N-1 CPUs will be set.

    • A number between 0 and 1 (exclusive) represents the % of system CPUs to use as a value. A number > 1 will set the explicit number of processes to use.

  • auto_transform_datetime (bool, optional, defaults to False) - When enabled, every column will be analyzed to determine if it is made up of DateTime objects. For each column that is detected, Amplify will automatically convert DateTimes to Unix Timestamps (epoch seconds) for model training and then after sampling convert them back into a DateTime string.

If both num_records and target_size_mb parameters are null, the model will generate a synthetic dataset of the same size and shape as the training data.

Set either num_records or target_size_mb to a positive integer, or leave both null. If both parameters are set to non-null values, the configuration will be invalid.

Both num_records and target_size_mb are generation targets. Amplify works by generating N*sample_size records until the target output size is reached, where N is the number of CPUs used. This means that the number of records or megabytes generated will always be at least the target size specified, and may occasionally be higher.

Data generation

Example CLI to generate 1000 additional records from a trained Amplify model:

gretel models run \
  --project <project-name> \
  --model-id <model-id> \
  --runner cloud \
  --param num_records 1000 \
  --output .

Example CLI to generate 1 GB from a trained Amplify model:

gretel models run \
  --project <project-name> \
  --model-id <model-id> \
  --runner cloud \
  --param target_size_mb 1000 \
  --output .

Smart seeding

When generating new data, Amplify model can be given a seed dataset, which will be used for conditional data generation. Amplify can seeded on any fields that were present in the training dataset and it supports seeding on multiple fields at the same time.

When seed file is provided, the output of Amplify will contain the same number of records seed dataset had. However, if a seed record is invalid (e.g. value for categorical column that wasn't present in the training dataset), the model won't generate a synthetic record for it.

Example CLI to generate records from a trained Amplify model using seed file:

gretel models run \
  --project <project-name> \
  --model-id <model-id> \
  --runner cloud \
  --in-data my_seed_file.csv \
  --output .

Minimum requirements

Amplify's speed is roughly proportional to the number of CPUs because it employs multi-processing. Therefore, 8-12 core boxes will have optimal speed.

If running this system in local mode (on-premises), the following instance types are recommended.

CPU: Minimum 4 cores, 32GB RAM.

With Amplify, no GPU is required.

Limitations and Biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.

Last updated