Ask or search…
K

Synthetics

This section covers the model training and generation APIs shared across all Gretel models.

Synthetic Models

Synthetic data models supported by Gretel APIs.
  • Gretel ACTGAN - Adversarial model for tabular, structured numerical, high column count data.
  • Gretel Tabular DP - Graph-based model for tabular data with differential privacy.
  • Gretel GPT - Generative pre-trained transformer for natural language text.
  • Gretel DGAN - Adversarial model for time-series data.
  • Gretel Amplify - Statistical model for high volume tabular data.
  • Gretel LSTM - Language model for tabular, time series, text data.

Supported Features

This section compares features of different generative data models supported by Gretel APIs.
✅ = Supported
✖️ = Not yet supported
Text
LSTM
ACTGAN
Amplify
DGAN
GPT
Tabular DP
Tag
synthetics
actgan
amplify
timeseries_dgan
gpt_x
tabular_dp
Type
Language Model
Generative Adversarial Network
Statistical
Generative Adversarial Network
Language Model
Statistical
Model
LSTM
GAN
Statistical
GAN
Pre-trained Transformer
Probabilistic Graphical Model
Privacy filters
✖️
✖️
✖️
Differential privacy
✖️
✖️
✖️
✖️
✖️
✖️
Tabular
✖️
Time-series
✖️
✖️
✖️
✖️
Natural language
✖️
✖️
✖️
✖️
Conditional generation
✖️
✖️
Pre-trained
✖️
✖️
✖️
✖️
✖️
Gretel cloud
Hybrid cloud
Requires GPU
✖️
✖️
GitHub - gretelai/gretel-synthetics: Synthetic data generators for structured and unstructured text, featuring differentially private learning.
GitHub
Check out our GitHub for research, source code and examples including our core synthetic data generation library.
Need help choosing the right synthetic model? Check out our Benchmark Report for a detailed model comparison based on real world datasets.

Model Configuration

All Gretel Synthetics models follow a similar configuration file format structure. Here is an example model-config.yaml
schema_version: "1.0"
name: "my-model"
models:
- [model_id]:
data_source: __tmp__
params:
[param_name]: [param_value]
  • [model_id] is replaced with the type of model you wish to train (e.g. synthetics, gpt_x, actgan, timeseries_dgan, amplify, tabular_dp).
  • data_source must point to a valid and accessible file in CSV, JSON, or JSONL format.
    • Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
      • Some models have specific data source format requirements
    • data_source: __tmp__ can be used when the source file is specified elsewhere using:
      • --in_data parameter via CLI,
      • parameter via SDK,
      • dataset button via Console.
    • The params object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on the data_source.
      • Parameters are specific to each model type. See a full list of supported parameters in each of the models pages.
Gretel has configuration templates that may be helpful as starting points for creating your model.

Create and Train a Model

CLI
SDK
Use the following CLI command to create and train a synthetic model.
gretel models create \
--config [config_file_path] \
--name [model_name] \
--runner cloud \
--in-data [data_source] > my-model.json
  • --in_data is optional if data_source specified in the config, and can be used to override the value in the config.
  • --in_data is required if data_source: __tmp__ is used in the config
  • --name is optional, and can be used to override the name specified in the config
Designate project
from gretel_client.projects import create_or_get_unique_project
project = create_or_get_unique_project(name="project-name")
Create model object and submit for training
from gretel_client.helpers import poll
model = project.create_model_obj(model_config=[config], data_source=[training_data])
model.submit_cloud()
poll(model)
During training, the following model artifacts are created:
Filename
Description
data_preview.gz
A preview of your synthetic dataset in CSV format.
logs.json.gz
Log output from the synthetic worker that is helpful for debugging.
report.html.gz*
HTML report that offers deep insight into the quality of the synthetic model.
report-json.json.gz*
A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.
*Not all models produce a Synthetic Data Quality Report. See the models page for more details.

Generate data from a model

CLI
SDK
Use the gretel models run command to generate data from a synthetic model.
gretel models run --model-id my-model.json \
--runner cloud \
--param num_records [num] \
--in-data [prompts.csv] \
--output .
  • --model-id supports both a model uid and the JSON that models create outputs
  • There are many different --param options, depending on the model.
    • num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.
  • --in_data is optional and used for conditional data generation when supported by the model
Create and submit record handler
# Generate more records from the model
record_handler = model.create_record_handler_obj(
params={"num_records": 100}
)
record_handler.submit_cloud()
poll(record_handler)
There are many different params options, depending on the model.
  • num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.
View results
synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
synthetic_df
GitHub - gretelai/gretel-synthetics: Synthetic data generators for structured and unstructured text, featuring differentially private learning.
GitHub
Check out our GitHub for research, source code and examples including our core synthetic data generation library.