Gretel Amplify
Statistical model that supports high volumes of tabular data generation.
The Gretel Amplify model API is designed to rapidly generate large volumes of synthetic data using statistical models and a hyper efficient multi-processing implementation. While Amplify is effective at learning and recreating distributions and correlations, it typically has a 10-15% drop in synthetic data accuracy versus Gretel's deep learning-based models for tabular data.
Use Cases
The Gretel Amplify model is able to generate large quantities of data from real-world data or synthetic data. Several use cases could include:
Creating large amounts of synthetic data to load test an application.
Mimic real-world data for pre-production environments.
Generate synthetic examples to test a ML model's ability to generalize to new data.
Model creation
This model can be selected using the amplify
model tag. Below is an example configuration that may be used to create a Gretel Amplify model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to Create and Train a Model.
The configuration below contains additional options for training a Gretel Amplify model, with the default options displayed.
data_source
(str, required) -__tmp__
or point to a valid and accessible file in CSV, JSON, or JSONL format.num_records
(int, optional, defaults tonull
) - Target number of records to generatetarget_size_mb
(int, optional, defaults tonull
) - Target file size of the generated data in megabytes, with a maximum value of5000
(5GB)sample_size
(int, optional, defaults to50000
)- The number of records to generate, per core, per iterationprocess_count_cpu
(float, optional, defaults to0
) - The number of processes to use for data generation.If 0 is provided, N-1 CPUs will be set.
A number between 0 and 1 (exclusive) represents the % of system CPUs to use as a value. A number > 1 will set the explicit number of processes to use.
auto_transform_datetime
(bool, optional, defaults toFalse
) - When enabled, every column will be analyzed to determine if it is made up of DateTime objects. For each column that is detected, Amplify will automatically convert DateTimes to Unix Timestamps (epoch seconds) for model training and then after sampling convert them back into a DateTime string.
If both num_records
and target_size_mb
parameters are null
, the model will generate a synthetic dataset of the same size and shape as the training data.
Set either num_records
or target_size_mb
to a positive integer, or leave both null
. If both parameters are set to non-null
values, the configuration will be invalid.
Both num_records
and target_size_mb
are generation targets. Amplify works by generating N*sample_size
records until the target output size is reached, where N
is the number of CPUs used. This means that the number of records or megabytes generated will always be at least the target size specified, and may occasionally be higher.
Data generation
Example CLI to generate 1000 additional records from a trained Amplify model:
Example CLI to generate 1 GB from a trained Amplify model:
Also see the example on how to Generate data from a model.
Smart seeding
When generating new data, Amplify model can be given a seed dataset, which will be used for conditional data generation. Amplify can seeded on any fields that were present in the training dataset and it supports seeding on multiple fields at the same time.
When seed file is provided, the output of Amplify will contain the same number of records seed dataset had. However, if a seed record is invalid (e.g. value for categorical column that wasn't present in the training dataset), the model won't generate a synthetic record for it.
Example CLI to generate records from a trained Amplify model using seed file:
Minimum requirements
Amplify's speed is roughly proportional to the number of CPUs because it employs multi-processing. Therefore, 8-12 core boxes will have optimal speed.
If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
With Amplify, no GPU is required.
Limitations and Biases
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
Last updated