Synthetics
This section covers the model training and generation APIs shared across all Gretel models.
Synthetic Models
Gretel offers the following synthetics models:
Tabular Fine-Tuning - Gretel’s flagship LLM-based model for generating privacy-preserving, real-world quality synthetic data across numeric, categorical, text, JSON, and event-based tabular data with up to ~50 columns.
Data types: Numeric, categorical, text, JSON, event-based
Differential privacy: Optional
Formerly known as: Navigator Fine Tuning
Text Fine-Tuning - Gretel’s model for generating privacy-preserving synthetic text using your choice of top performing open-source models.
Data types: Text
Differential privacy: Optional
Formerly known as: GPT
Tabular GAN - Gretel’s model for quickly generating synthetic numeric and categorical data for high-dimensional datasets (>50 columns) while preserving relationships between numeric and categorical columns.
Data types: Numeric, categorical
Differential privacy: NOT supported
Formerly known as: ACTGAN
Tabular DP - Gretel’s model for generating differentially-private data with very low epsilon values (maximum privacy). It is best for basic analytics use cases (e.g. pairwise modeling), and runs on CPU. If your use case is training an ML model to learn deep insights in the data, Tabular Fine-Tuning is your best option.
Data types: Numeric, categorical
Differential privacy: Required; you cannot run without differential privacy
Supported Features
This section compares features of different generative data models supported by Gretel APIs.
✅ = Supported
✖️ = Not yet supported
Tag
navigator_ft
gpt_x
actgan
tabular_dp
timeseries_dgan
Type
Language Model
Language Model
Generative Adversarial Network
Statistical
Generative Adversarial Network
Model
Pre-trained Transformer
Pre-trained Transformer
GAN
Probabilistic Graphical Model
GAN
Privacy filters
✖️
✖️
✅
✖️
✖️
Privacy metrics
✅
✖️
✅
✅
✖️
Differential privacy
✖️
✅
✖️
✅
✖️
✅
✅
✅
✅
✖️
Tabular
✅
✖️
✅
✅
✅
Time-series
✅
✖️
✖️
✖️
✅
Natural language
✅
✅
✖️
✖️
✖️
Conditional generation
✖️
✅
✅
✖️
✖️
Pre-trained
✅
✅
✖️
✖️
✖️
Gretel cloud
✅
✅
✅
✅
✅
Hybrid cloud
✅
✅
✅
✅
✅
Requires GPU
✅
✅
✅
✖️
✅
Need help choosing the right synthetic model? Check out our Benchmark Report for a detailed model comparison based on real world datasets.
Model Configuration
All Gretel Synthetics models follow a similar configuration file format structure. Here is an example model-config.yaml
[model_id]
is replaced with the type of model you wish to train (e.g.navigator_ft
,gpt_x
,actgan
,tabular_dp
).data_source
must point to a valid and accessible file in CSV, JSON, or JSONL format.Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
Some #models have specific data source format requirements
data_source: __tmp__
can be used when the source file is specified elsewhere using:--in_data
parameter via CLI,parameter via SDK,
dataset
button
via Console.
The
params
object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on thedata_source
.Parameters are specific to each model type. See a full list of supported parameters in each of the #models pages.
Gretel has configuration templates that may be helpful as starting points for creating your model.
Create and Train a Model
Use the following CLI command to create and train a synthetic model.
--in_data
is optional ifdata_source
specified in the config, and can be used to override the value in the config.--in_data
is required ifdata_source: __tmp__
is used in the config--name
is optional, and can be used to override thename
specified in the config
During training, the following model artifacts are created:
data_preview.gz
A preview of your synthetic dataset in CSV format.
logs.json.gz
Log output from the synthetic worker that is helpful for debugging.
report.html.gz
HTML report that offers deep insight into the quality of the synthetic model.
report-json.json.gz
A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.
Generate data from a model
Use the gretel models run
command to generate data from a synthetic model.
--model-id
supports both a modeluid
and the JSON thatmodels create
outputsThere are many different
--param
options, depending on the model.num_records
param is supported by all synthetic models and is used to tell the model how many new rows to generate.
--in_data
is optional and used for conditional data generation when supported by the model
Last updated
Was this helpful?