Synthetics

This section covers the model training and generation APIs shared across all Gretel models.

Synthetic Models

Synthetic data models supported by Gretel APIs.

  • Gretel ACTGAN - Adversarial model for tabular, structured numerical, high column count data.

  • Gretel Tabular DP - Graph-based model for tabular data with differential privacy.

  • Gretel GPT - Generative pre-trained transformer for natural language text.

  • Gretel DGAN - Adversarial model for time-series data.

  • Gretel Amplify - Statistical model for high volume tabular data.

  • Gretel LSTM - Language model for tabular, time series, text data.

Supported Features

This section compares features of different generative data models supported by Gretel APIs.

✅ = Supported

✖️ = Not yet supported

LSTMACTGANAmplifyDGANGPTTabular DP

Tag

synthetics

actgan

amplify

timeseries_dgan

gpt_x

tabular_dp

Type

Language Model

Generative Adversarial Network

Statistical

Generative Adversarial Network

Language Model

Statistical

Model

LSTM

GAN

Statistical

GAN

Pre-trained Transformer

Probabilistic Graphical Model

Privacy filters

✖️

✖️

✖️

Differential privacy

✖️

✖️

✖️

✖️

✖️

✖️

Tabular

✖️

Time-series

✖️

✖️

✖️

✖️

Natural language

✖️

✖️

✖️

✖️

Conditional generation

✖️

✖️

Pre-trained

✖️

✖️

✖️

✖️

✖️

Gretel cloud

Hybrid cloud

Requires GPU

✖️

✖️

Need help choosing the right synthetic model? Check out our Benchmark Report for a detailed model comparison based on real world datasets.

Model Configuration

All Gretel Synthetics models follow a similar configuration file format structure. Here is an example model-config.yaml

schema_version: "1.0"
name: "my-model"

models:
  - [model_id]:
      data_source: __tmp__
      params:
          [param_name]: [param_value]
      
  • [model_id] is replaced with the type of model you wish to train (e.g. synthetics, gpt_x, actgan, timeseries_dgan, amplify, tabular_dp).

  • data_source must point to a valid and accessible file in CSV, JSON, or JSONL format.

    • Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.

      • Some #models have specific data source format requirements

    • data_source: __tmp__ can be used when the source file is specified elsewhere using:

      • --in_data parameter via CLI,

      • parameter via SDK,

      • dataset button via Console.

    • The params object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on the data_source.

      • Parameters are specific to each model type. See a full list of supported parameters in each of the #models pages.

Gretel has configuration templates that may be helpful as starting points for creating your model.

Create and Train a Model

Use the following CLI command to create and train a synthetic model.

gretel models create \ 
  --config [config_file_path] \
  --name [model_name] \
  --runner cloud \
  --in-data [data_source] > my-model.json
  • --in_data is optional if data_source specified in the config, and can be used to override the value in the config.

  • --in_data is required if data_source: __tmp__ is used in the config

  • --name is optional, and can be used to override the name specified in the config

During training, the following model artifacts are created:

FilenameDescription

data_preview.gz

A preview of your synthetic dataset in CSV format.

logs.json.gz

Log output from the synthetic worker that is helpful for debugging.

report.html.gz*

HTML report that offers deep insight into the quality of the synthetic model.

report-json.json.gz*

A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.

*Not all models produce a Synthetic Data Quality Report. See the #models page for more details.

Generate data from a model

Use the gretel models run command to generate data from a synthetic model.

gretel models run --model-id my-model.json \
  --runner cloud \
  --param num_records [num] \
  --in-data [prompts.csv] \
  --output .
  • --model-id supports both a model uid and the JSON that models create outputs

  • There are many different --param options, depending on the model.

    • num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.

  • --in_data is optional and used for conditional data generation when supported by the model

Last updated