Synthetics

This section covers the model training and generation APIs shared across all Gretel models.

Synthetic Models

Gretel offers the following synthetics models:

  1. Tabular Fine-Tuning - Gretel’s flagship LLM-based model for generating privacy-preserving, real-world quality synthetic data across numeric, categorical, text, JSON, and event-based tabular data with up to ~50 columns.

    1. Data types: Numeric, categorical, text, JSON, event-based

    2. Differential privacy: Optional

    3. Formerly known as: Navigator Fine Tuning

  2. Text Fine-Tuning - Gretel’s model for generating privacy-preserving synthetic text using your choice of top performing open-source models.

    1. Data types: Text

    2. Differential privacy: Optional

    3. Formerly known as: GPT

  3. Tabular GAN - Gretel’s model for quickly generating synthetic numeric and categorical data for high-dimensional datasets (>50 columns) while preserving relationships between numeric and categorical columns.

    1. Data types: Numeric, categorical

    2. Differential privacy: NOT supported

    3. Formerly known as: ACTGAN

  4. Tabular DP - Gretel’s model for generating differentially-private data with very low epsilon values (maximum privacy). It is best for basic analytics use cases (e.g. pairwise modeling), and runs on CPU. If your use case is training an ML model to learn deep insights in the data, Tabular Fine-Tuning is your best option.

    1. Data types: Numeric, categorical

    2. Differential privacy: Required; you cannot run without differential privacy

Supported Features

This section compares features of different generative data models supported by Gretel APIs.

✅ = Supported

✖️ = Not yet supported

Tabular Fine-Tuning
Text Fine-Tuning
Tabular GAN
Tabular DP
DGAN

Tag

navigator_ft

gpt_x

actgan

tabular_dp

timeseries_dgan

Type

Language Model

Language Model

Generative Adversarial Network

Statistical

Generative Adversarial Network

Model

Pre-trained Transformer

Pre-trained Transformer

GAN

Probabilistic Graphical Model

GAN

Privacy filters

✖️

✖️

✖️

✖️

Privacy metrics

✖️

✖️

Differential privacy

✖️

✖️

✖️

✖️

Tabular

✖️

Time-series

✖️

✖️

✖️

Natural language

✖️

✖️

✖️

Conditional generation

✖️

✖️

✖️

Pre-trained

✖️

✖️

✖️

Gretel cloud

Hybrid cloud

Requires GPU

✖️

Model Configuration

All Gretel Synthetics models follow a similar configuration file format structure. Here is an example model-config.yaml

schema_version: "1.0"
name: "my-model"

models:
  - [model_id]:
      data_source: __tmp__
      params:
          [param_name]: [param_value]
      
  • [model_id] is replaced with the type of model you wish to train (e.g. navigator_ft, gpt_x, actgan, tabular_dp).

  • data_source must point to a valid and accessible file in CSV, JSON, or JSONL format.

    • Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.

      • Some #models have specific data source format requirements

    • data_source: __tmp__ can be used when the source file is specified elsewhere using:

      • --in_data parameter via CLI,

      • parameter via SDK,

      • dataset button via Console.

    • The params object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on the data_source.

      • Parameters are specific to each model type. See a full list of supported parameters in each of the #models pages.

Create and Train a Model

Use the following CLI command to create and train a synthetic model.

gretel models create \ 
  --config [config_file_path] \
  --name [model_name] \
  --runner cloud \
  --in-data [data_source] > my-model.json
  • --in_data is optional if data_source specified in the config, and can be used to override the value in the config.

  • --in_data is required if data_source: __tmp__ is used in the config

  • --name is optional, and can be used to override the name specified in the config

During training, the following model artifacts are created:

Filename
Description

data_preview.gz

A preview of your synthetic dataset in CSV format.

logs.json.gz

Log output from the synthetic worker that is helpful for debugging.

report.html.gz

HTML report that offers deep insight into the quality of the synthetic model.

report-json.json.gz

A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.

Generate data from a model

Use the gretel models run command to generate data from a synthetic model.

gretel models run --model-id my-model.json \
  --runner cloud \
  --param num_records [num] \
  --in-data [prompts.csv] \
  --output .
  • --model-id supports both a model uid and the JSON that models create outputs

  • There are many different --param options, depending on the model.

    • num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.

  • --in_data is optional and used for conditional data generation when supported by the model

Last updated

Was this helpful?