Synthetic Models

Gretel currently has five synthetic models, with more models and supported datatypes coming soon.
  • Gretel LSTM - deep learning model that supports tabular, time-series, and natural language text data.
  • Gretel ACTGAN - adversarial model that supports tabular data, structured numerical data, and high column count data.
  • Gretel Amplify - statistical model that supports high volumes of tabular data generation.
  • Gretel DGAN - adversarial model for time series data.
  • Gretel GPT - generative pre-trained transformer for natural language text generation.

Model Configuration

All Gretel Synthetics models follow a similar configuration file format structure. Here is an example model-config.yaml
schema_version: "1.0"
name: "my-model"
- [model_id]:
data_source: __tmp__
[param_name]: [param_value]
  • [model_id] is replaced with the type of model you wish to train (e.g. synthetics, gpt_x, actgan, timeseries_dgan, amplify).
  • data_source must point to a valid and accessible file in CSV, JSON, or JSONL format.
    • Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
      • Some models have specific data source format requirements
    • data_source: __tmp__ can be used when the source file is specified elsewhere using:
      • --in_data parameter via CLI,
      • parameter via SDK,
      • dataset button via Console.
    • The params object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on the data_source.
      • Parameters are specific to each model type. See a full list of supported parameters in each of the models pages.
Gretel has configuration templates that may be helpful as starting points for creating your model.

Create and Train a Model

Use the following CLI command to create and train a synthetic model.
gretel models create \
--config [config_file_path] \
--name [model_name] \
--runner cloud \
--in-data [data_source] > my-model.json
  • --in_data is optional if data_source specified in the config, and can be used to override the value in the config.
  • --in_data is required if data_source: __tmp__ is used in the config
  • --name is optional, and can be used to override the name specified in the config

Designate project

from gretel_client.projects import create_or_get_unique_project
project = create_or_get_unique_project(name="project-name")

Create model object and submit for training

from gretel_client.helpers import poll
model = project.create_model_obj(model_config=[config], data_source=[training_data])
During training, the following model artifacts are created:
A preview of your synthetic dataset in CSV format.
Log output from the synthetic worker that is helpful for debugging.
HTML report that offers deep insight into the quality of the synthetic model.
A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.
*Not all models produce a Synthetic Data Quality Report. See the models page for more details.

Generate data from a model

Use the gretel models run command to generate data from a synthetic model.
gretel models run --model-id my-model.json \
--runner cloud \
--param num_records [num] \
--in-data [prompts.csv] \
--output .
  • --model-id supports both a model uid and the JSON that models create outputs
  • There are many different --param options, depending on the model.
    • num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.
  • --in_data is optional and used for conditional data generation when supported by the model

Create and submit record handler

# Generate more records from the model
record_handler = model.create_record_handler_obj(
params={"num_records": 100}
There are many different params options, depending on the model.
  • num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.

View results

synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
GitHub - gretelai/gretel-synthetics: Synthetic data generators for structured and unstructured text, featuring differentially private learning.
Check out our GitHub for research, source code and examples including our core synthetic data generation library.