Synthetics
This section covers the model training and generation APIs shared across all Gretel models.
Synthetic Models
Synthetic data models supported by Gretel APIs.
Gretel Navigator Fine Tuning - LLM-based AI system supporting tabular, time-series, JSON, and natural language text data.
Gretel ACTGAN - Adversarial model for tabular, structured numerical, high column count data.
Gretel Tabular DP - Graph-based model for tabular data with differential privacy.
Gretel GPT - Generative pre-trained transformer for natural language text.
Gretel DGAN - Adversarial model for time-series data.
Gretel Amplify - Statistical model for high volume tabular data.
Gretel LSTM - Language model for tabular, time series, text data.
Supported Features
This section compares features of different generative data models supported by Gretel APIs.
✅ = Supported
✖️ = Not yet supported
Navigator Fine Tuning | ACTGAN | GPT | Tabular DP | DGAN | LSTM | Amplify | |
---|---|---|---|---|---|---|---|
Tag |
|
|
|
|
|
|
|
Type | Language Model | Generative Adversarial Network | Language Model | Statistical | Generative Adversarial Network | Language Model | Statistical |
Model | Pre-trained Transformer | GAN | Pre-trained Transformer | Probabilistic Graphical Model | GAN | LSTM | Statistical |
Privacy filters | ✖️ | ✅ | ✖️ | ✖️ | ✖️ | ✅ | ✅ |
Privacy metrics | ✅ | ✅ | ✖️ | ✅ | ✖️ | ✅ | ✅ |
Differential privacy | ✖️ | ✖️ | ✅ | ✅ | ✖️ | ✖️ | ✖️ |
✅ | ✅ | ✅ | ✅ | ✖️ | ✅ | ✅ | |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
Tabular | ✅ | ✅ | ✖️ | ✅ | ✅ | ✅ | ✅ |
Time-series | ✅ | ✖️ | ✖️ | ✖️ | ✅ | ✅ | ✖️ |
Natural language | ✅ | ✖️ | ✅ | ✖️ | ✖️ | ✅ | ✖️ |
Conditional generation | ✖️ | ✅ | ✅ | ✖️ | ✖️ | ✅ | ✅ |
Pre-trained | ✅ | ✖️ | ✅ | ✖️ | ✖️ | ✖️ | ✖️ |
Gretel cloud | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Hybrid cloud | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Requires GPU | ✅ | ✅ | ✅ | ✖️ | ✅ | ✅ | ✖️ |
Need help choosing the right synthetic model? Check out our Benchmark Report for a detailed model comparison based on real world datasets.
Model Configuration
All Gretel Synthetics models follow a similar configuration file format structure. Here is an example model-config.yaml
[model_id]
is replaced with the type of model you wish to train (e.g.synthetics
,gpt_x
,actgan
,timeseries_dgan
,amplify, tabular_dp
).data_source
must point to a valid and accessible file in CSV, JSON, or JSONL format.Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
Some #models have specific data source format requirements
data_source: __tmp__
can be used when the source file is specified elsewhere using:--in_data
parameter via CLI,parameter via SDK,
dataset
button
via Console.
The
params
object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on thedata_source
.Parameters are specific to each model type. See a full list of supported parameters in each of the #models pages.
Gretel has configuration templates that may be helpful as starting points for creating your model.
Create and Train a Model
Use the following CLI command to create and train a synthetic model.
--in_data
is optional ifdata_source
specified in the config, and can be used to override the value in the config.--in_data
is required ifdata_source: __tmp__
is used in the config--name
is optional, and can be used to override thename
specified in the config
During training, the following model artifacts are created:
Filename | Description |
---|---|
data_preview.gz | A preview of your synthetic dataset in CSV format. |
logs.json.gz | Log output from the synthetic worker that is helpful for debugging. |
report.html.gz* | HTML report that offers deep insight into the quality of the synthetic model. |
report-json.json.gz* | A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically. |
*Not all models produce a Synthetic Data Quality Report. See the #models page for more details.
Generate data from a model
Use the gretel models run
command to generate data from a synthetic model.
--model-id
supports both a modeluid
and the JSON thatmodels create
outputsThere are many different
--param
options, depending on the model.num_records
param is supported by all synthetic models and is used to tell the model how many new rows to generate.
--in_data
is optional and used for conditional data generation when supported by the model
Last updated