Gretel Amplify
Statistical model that supports high volumes of tabular data generation.
Last updated
Statistical model that supports high volumes of tabular data generation.
Last updated
The Gretel Amplify model API is designed to rapidly generate large volumes of synthetic data using statistical models and a hyper efficient multi-processing implementation. While Amplify is effective at learning and recreating distributions and correlations, it typically has a 10-15% drop in synthetic data accuracy versus Gretel's deep learning-based models for tabular data.
The Gretel Amplify model is able to generate large quantities of data from real-world data or synthetic data. Several use cases could include:
Creating large amounts of synthetic data to load test an application.
Mimic real-world data for pre-production environments.
Generate synthetic examples to test a ML model's ability to generalize to new data.
This model can be selected using the amplify
model tag. Below is an example configuration that may be used to create a Gretel Amplify model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to .
The configuration below contains additional options for training a Gretel Amplify model, with the default options displayed.
data_source
(str, required) - __tmp__
or point to a valid and accessible file in CSV, JSON, or JSONL format.
num_records
(int, optional, defaults to null
) - Target number of records to generate
target_size_mb
(int, optional, defaults to null
) - Target file size of the generated data in megabytes, with a maximum value of 5000
(5GB)
sample_size
(int, optional, defaults to 50000
)- The number of records to generate, per core, per iteration
process_count_cpu
(float, optional, defaults to 0
) - The number of processes to use for data generation.
If 0 is provided, N-1 CPUs will be set.
A number between 0 and 1 (exclusive) represents the % of system CPUs to use as a value. A number > 1 will set the explicit number of processes to use.
auto_transform_datetime
(bool, optional, defaults to False
) - When enabled, every column will be analyzed to determine if it is made up of DateTime objects. For each column that is detected, Amplify will automatically convert DateTimes to Unix Timestamps (epoch seconds) for model training and then after sampling convert them back into a DateTime string.
If both num_records
and target_size_mb
parameters are null
, the model will generate a synthetic dataset of the same size and shape as the training data.
Set either num_records
or target_size_mb
to a positive integer, or leave both null
. If both parameters are set to non-null
values, the configuration will be invalid.
Both num_records
and target_size_mb
are generation targets. Amplify works by generating N*sample_size
records until the target output size is reached, where N
is the number of CPUs used. This means that the number of records or megabytes generated will always be at least the target size specified, and may occasionally be higher.
Example CLI to generate 1000 additional records from a trained Amplify model:
Example CLI to generate 1 GB from a trained Amplify model:
When generating new data, Amplify model can be given a seed dataset, which will be used for conditional data generation. Amplify can seeded on any fields that were present in the training dataset and it supports seeding on multiple fields at the same time.
When seed file is provided, the output of Amplify will contain the same number of records seed dataset had. However, if a seed record is invalid (e.g. value for categorical column that wasn't present in the training dataset), the model won't generate a synthetic record for it.
Example CLI to generate records from a trained Amplify model using seed file:
Amplify's speed is roughly proportional to the number of CPUs because it employs multi-processing. Therefore, 8-12 core boxes will have optimal speed.
If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
With Amplify, no GPU is required.
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
Also see the example on how to .