Search…
Gretel-CTGAN
Model type: Adversarial Model that supports tabular data, structured numerical data, and high column count data.
The Gretel CTGAN model API provides access to a generative data model that works with any language or character set. The Gretel CTGAN supports advanced features such as conditional data generation. CTGAN works well with datasets featuring primarily numeric data and high column counts.

Model creation

This model can be selected using the ctgan model tag. Below is an example configuration that may be used to create a Gretel CTGAN model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example to train a model.
The configuration below contains additional options for training a Gretel CTGAN model, with the default options displayed.
schema_version: "1.0"
models:
- ctgan:
data_source: __tmp__
params:
epochs: 100
generator_dim: [256, 256]
discriminator_dim: [256, 256]
generator_lr: 2e-4
discriminator_lr: .00001
batch_size: 500
verbose: true
# Gretel validation pre-/post- processing
validators:
use_numeric_iqr: true
in_set_count: 50
# Gretel privacy filtering
privacy_filters:
outliers: null
similarity: high
  • embedding_dim (int, required, defaults to 128) - Size of the random sample passed to the Generator (z vector).
  • generator_dim (List(int), required, defaults to [256, 256]) - Size of the output samples for each of the Residuals. Adding more numbers to this list will create more Residuals, one for each number. This is equivalent to increasing the depth of the Generator.
  • discriminator_dim (List(int), required, defaults to [256, 256]) - Size of the output samples for each of the discriminator linear layers. A new Linear layer will be created for each number added to this list.
  • generator_lr (float, required, defaults to 2e-4) - Learning rate for the Generator.
  • generator_decay (float, required, defaults to 1e-6) - Weight decay for the Generator's Adam optimizer.
  • discriminator_lr (float, required, defaults to 2e-4) - Learning rate for the discriminator.
  • discriminator_decay (float, required, defaults to 1e-6) - Weight decay for the discriminator's Adam optimizer.
  • batch_size (int, required, defaults to 500) - Determines the number of examples the model see's each step. Importantly, this must be a multiple of 10 as specified by the CTGAN training scheme.
  • epochs (int, required, defaults to 300) - Number of training iterations the model will undergo during training. A larger number will result in longer training times, but potentially higher quality synthetic data.
  • discriminator_steps (int, required, defaults to 1) - The discriminator and Generator take different number of steps per batch. The original WGAN paper took 5 discriminator steps for each Generator step. In this case we default to 1 which follows the original CTGAN implementation.
  • log_frequency (bool, required, defaults to True) - Determines the use of log frequency of categorical counts during conditional sampling. In some cases, switching to False improves performance.
  • verbose (bool, required, defaults to False) - Whether to print training progress during training.
  • pac (int, required, defaults to 10) - Number of samples to group together when applying the discriminator.

Differential Privacy

Differential privacy is currently not supported for the Gretel CTGAN model.

Smart seeding

To use conditional data generation (smart seeding), you can provide an input csv containing the columns and values you want to seed with during data generation. (No changes are needed at model creation time.) Column names in the input file should be a subset of the column names in the training data used for model creation.
Example CLI command to seed the data generation from a trained CTGAN model:
gretel models run \
--project <project-name> \
--model-id <model-id> \
--runner cloud \
--in-data seed.csv \
--output .

Data generation

Example CLI to generate 1000 additional records from a trained CTGAN model:
gretel models run \
--project <project-name> \
--model-id <model-id> \
--runner cloud \
--param num_records 1000 \
--output .
Also see the reference command line example for data generation.

Automated validators

Validators are not currently supported in CTGAN

Model information

The underlying model used is a Conditional Tabular Generative Adversarial Network (CTGAN). There are Generator and Discriminator models that are trained adversarially. The model is initialized from random weights and trained on the customer provided dataset. More details about the underlying model can be found in the original paper. https://arxiv.org/abs/1907.00503
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.
@inproceedings{xu2019modeling,
title={Modeling Tabular data using Conditional GAN},
author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
booktitle={Advances in Neural Information Processing Systems},
year={2019}
}

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is required to run basic language models.
In general, this model trains faster in wall-clock time than comparable LSTMs, but often performs worse on text or high cardinality categorical variables.

Limitations and Biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
CTGAN technical limitations:
  • Multiple high cardinality categorical fields on a large data set can lead to out of memory errors. Consider using the Gretel-LSTM model when data sets have highly unique or text fields.
  • Conditional generation may not produce a record for every seeded row. So you might only get 90 records back after using a seed file with 100 records with smart seeding.