Search…
⌃K

Gretel ACTGAN

Model type: Adversarial Model that supports tabular data, structured numerical data, and high column count data.
The Gretel ACTGAN model API provides access to a generative data model for tabular data. The Gretel ACTGAN supports advanced features such as conditional data generation. ACTGAN works well with datasets featuring primarily numeric data, high column counts, and highly unique categorical fields.

Model creation

This model can be selected using the actgan model tag. Below is an example configuration that may be used to create a Gretel ACTGAN model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example to train a model.
The configuration below contains additional options for training a Gretel ACTGAN model, with the default options displayed.
schema_version: "1.0"
models:
- actgan:
data_source: __tmp__
params:
epochs: 100
generator_dim: [256, 256]
discriminator_dim: [256, 256]
generator_lr: 2e-4
discriminator_lr: .00001
batch_size: 500
verbose: true
binary_encoder_cutoff: 150
# Gretel privacy filtering
privacy_filters:
outliers: null
similarity: high
  • data_source (str, required) - __tmp__ or point to a valid and accessible file in CSV, JSON, or JSONL format.
  • embedding_dim (int, required, defaults to 128) - Size of the random sample passed to the Generator (z vector).
  • generator_dim (List(int), required, defaults to [256, 256]) - Size of the output samples for each of the Residuals. Adding more numbers to this list will create more Residuals, one for each number. This is equivalent to increasing the depth of the Generator.
  • discriminator_dim (List(int), required, defaults to [256, 256]) - Size of the output samples for each of the discriminator linear layers. A new Linear layer will be created for each number added to this list.
  • generator_lr (float, required, defaults to 2e-4) - Learning rate for the Generator.
  • generator_decay (float, required, defaults to 1e-6) - Weight decay for the Generator's Adam optimizer.
  • discriminator_lr (float, required, defaults to 2e-4) - Learning rate for the discriminator.
  • discriminator_decay (float, required, defaults to 1e-6) - Weight decay for the discriminator's Adam optimizer.
  • batch_size (int, required, defaults to 500) - Determines the number of examples the model see's each step. Importantly, this must be a multiple of 10 as specified by the ACTGAN training scheme.
  • epochs (int, required, defaults to 300) - Number of training iterations the model will undergo during training. A larger number will result in longer training times, but potentially higher quality synthetic data.
  • binary_encoder_cutoff (int, required, defaults to 150) - Number of unique categorical values in a column before encoding switches from One Hot to Binary Encoding for the specific column. Decrease this number if you have Out of Memory issues. Will result in faster training times with a potential loss in performance in a few select cases.
  • binary_encoder_nan_handler (str, optional, defaults to mode) - Method for handling invalid generated binary encodings. When generating data, it is possible the model outputs binary encodings that do not map to a real category. This parameter specifies what value to use in this case. Possible choices are: "mode". Note that this will not replace all nans, and the generated data can have nans if the training data has nans.
  • discriminator_steps (int, required, defaults to 1) - The discriminator and Generator take different number of steps per batch. The original WGAN paper took 5 discriminator steps for each Generator step. In this case we default to 1 which follows the original ACTGAN implementation.
  • log_frequency (bool, required, defaults to True) - Determines the use of log frequency of categorical counts during conditional sampling. In some cases, switching to False improves performance.
  • verbose (bool, required, defaults to False) - Whether to print training progress during training.
  • pac (int, required, defaults to 10) - Number of samples to group together when applying the discriminator.
  • data_upsample_limit (int, optional, defaults to 100) - If the training data has fewer than this many records, the data will be automatically upsampled to the specified limit. Setting this to 0 will disable upsampling.

Differential Privacy

Differential privacy is currently not supported for the Gretel ACTGAN model.

Smart seeding

To use conditional data generation (smart seeding), you can provide an input csv containing the columns and values you want to seed with during data generation. (No changes are needed at model creation time.) Column names in the input file should be a subset of the column names in the training data used for model creation.
Example CLI command to seed the data generation from a trained ACTGAN model:
gretel models run \
--project <project-name> \
--model-id <model-id> \
--runner cloud \
--in-data seed.csv \
--output .

Data generation

Example CLI to generate 1000 additional records from a trained ACTGAN model:
gretel models run \
--project <project-name> \
--model-id <model-id> \
--runner cloud \
--param num_records 1000 \
--output .
Also see the reference command line example for data generation.

Automated validators

Validators are not currently supported in ACTGAN

Model information

The underlying model used is an Anyway Conditional Tabular Generative Adversarial Network (ACTGAN). There are Generator and Discriminator models that are trained adversarially. The model is initialized from random weights and trained on the customer provided dataset. This model is an extension of the popular CTGAN model. These algorithmic extensions improve speed, accuracy, and memory usage. More details about the original underlying model can be found in their excellent paper. https://arxiv.org/abs/1907.00503
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.
@inproceedings{xu2019modeling,
title={Modeling Tabular data using Conditional GAN},
author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
booktitle={Advances in Neural Information Processing Systems},
year={2019}
}

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is required to run basic language models.
In general, this model trains faster in wall-clock time than comparable LSTMs, but often performs worse on text or high cardinality categorical variables.

Limitations and Biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
ACTGAN technical limitations:
  • Conditional generation may not produce a record for every seeded row. So you might only get 90 records back after using a seed file with 100 records with smart seeding.