Gretel ACTGAN

Adversarial model that supports tabular data, structured numerical data, and high column count data.

The Gretel ACTGAN model API provides access to a generative data model for tabular data. The Gretel ACTGAN supports advanced features such as conditional data generation. ACTGAN works well with datasets featuring primarily numeric data, high column counts, and highly unique categorical fields.

Model creation

This model can be selected using the actgan model tag. Below is an example configuration that may be used to create a Gretel ACTGAN model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to Create and Train a Model.

The configuration below contains additional options for training a Gretel ACTGAN model, with the default options displayed.

# Default configuration for ACTGAN.

schema_version: "1.0"
name: "tabular-actgan"
models:
  - actgan:
        data_source: __tmp__
        params:
            epochs: auto
            generator_dim: [1024, 1024]
            discriminator_dim: [1024, 1024]
            generator_lr: 0.0001
            discriminator_lr: .00033
            batch_size: auto
        generate:
            num_records: 5000
        privacy_filters:
            outliers: null
            similarity: auto
  • data_source (str, required) - __tmp__ or point to a valid and accessible file in CSV, JSON, or JSONL format.

  • embedding_dim (int, required, defaults to 128) - Size of the random sample passed to the Generator (z vector).

  • generator_dim (List(int), required, defaults to [256, 256]) - Size of the output samples for each of the Residuals. Adding more numbers to this list will create more Residuals, one for each number. This is equivalent to increasing the depth of the Generator.

  • discriminator_dim (List(int), required, defaults to [256, 256]) - Size of the output samples for each of the discriminator linear layers. A new Linear layer will be created for each number added to this list.

  • generator_lr (float, required, defaults to 2e-4) - Learning rate for the Generator.

  • generator_decay (float, required, defaults to 1e-6) - Weight decay for the Generator's Adam optimizer.

  • discriminator_lr (float, required, defaults to 2e-4) - Learning rate for the discriminator.

  • discriminator_decay (float, required, defaults to 1e-6) - Weight decay for the discriminator's Adam optimizer.

  • batch_size (int, required, defaults to 500) - Determines the number of examples the model see's each step. Importantly, this must be a multiple of 10 as specified by the ACTGAN training scheme.

  • epochs (int, required, defaults to 300) - Number of training iterations the model will undergo during training. A larger number will result in longer training times, but potentially higher quality synthetic data.

  • binary_encoder_cutoff (int, required, defaults to 150) - Number of unique categorical values in a column before encoding switches from One Hot to Binary Encoding for the specific column. Decrease this number if you have Out of Memory issues. Will result in faster training times with a potential loss in performance in a few select cases.

  • binary_encoder_nan_handler (str, optional, defaults to mode) - Method for handling invalid generated binary encodings. When generating data, it is possible the model outputs binary encodings that do not map to a real category. This parameter specifies what value to use in this case. Possible choices are: "mode". Note that this will not replace all nans, and the generated data can have nans if the training data has nans.

  • cbn_sample_size (int, optional, defaults to 250,000) - If set, clustering for continuous-valued columns is performed on a sample of the data records. This option can significantly reduce training time on large datasets with only negligible impact on performance. When setting this option to 0 or to a value larger than the data size, no subsetting will be performed.

  • discriminator_steps (int, required, defaults to 1) - The discriminator and Generator take different number of steps per batch. The original WGAN paper took 5 discriminator steps for each Generator step. In this case we default to 1 which follows the original ACTGAN implementation.

  • log_frequency (bool, required, defaults to True) - Determines the use of log frequency of categorical counts during conditional sampling. In some cases, switching to False improves performance.

  • verbose (bool, required, defaults to False) - Whether to print training progress during training.

  • pac (int, required, defaults to 10) - Number of samples to group together when applying the discriminator. Must equally divide batch_size.

  • data_upsample_limit (int, optional, defaults to 100) - If the training data has fewer than this many records, the data will be automatically upsampled to the specified limit. Setting this to 0 will disable upsampling.

  • auto_transform_datetime (bool, optional, defaults to False) - When enabled, every column will be analyzed to determine if it is made up of DateTime objects. For each column that is detected, ACTGAN will automatically convert DateTimes to Unix Timestamps (epoch seconds) for model training and then after sampling convert them back into a DateTime string.

  • conditional_vector_type (str, required, defaults to single_discrete) - Controls conditional vector usage in model architecture which influences the effectiveness and flexibility of the native conditional generation. Possible choices are: "single_discrete", "anyway". single_discrete is the original CTGAN architecture. anyway will improve efficiency of conditional generation by guiding the model towards the requested seed values.

  • conditional_select_mean_columns (float, optional) - Target number of columns to select for conditioning during training. Only used when conditional_vector_type=anyway. Use if typical number of seed columns required for conditional generation is known. The model will be better at conditional generation when using approximately this many seed columns. If set, conditional_select_column_prob must be empty.

  • conditional_select_column_prob (float, optional) - Probability of selecting a column for conditioning during training. Only used when conditional_vector_type=anyway. If set, conditional_select_mean_columns must be empty.

  • reconstruction_loss_coef (float, required, defaults to 1.0) - Multiplier on reconstruction loss. Higher values should provide more efficient conditional generation. Only used when conditional_vector_type=anyway.

  • force_conditioning (bool or auto, required, defaults to auto) - When True, skips rejection sampling and directly sets the requested seed values in generated data. Conditional generation will be faster when enabled, but may reduce quality of generated data. If True with single_discrete, all correlation between seed columns and generated columns is lost! auto chooses a preferred value for force_conditioning based on the other configured parameters, logs will show what value was chosen.

Differential privacy is currently not supported for the Gretel ACTGAN model.

Smart seeding

To use conditional data generation (smart seeding), you can provide an input csv containing the columns and values you want to seed with during data generation. (No changes are needed at model creation time.) Column names in the input file should be a subset of the column names in the training data used for model creation. All seed column data types (string, int, float) are supported when conditional_vector_type=anyway and conditional generation is more efficient, so that setting is preferred when conditional generation is a priority. Conditional generation with string data type seed columns only is also available when conditional_vector_type=single_discrete.

Example CLI command to seed the data generation from a trained ACTGAN model:

gretel models run \
  --project <project-name> \
  --model-id <model-id> \
  --runner cloud \
  --in-data seed.csv \
  --output .

Data generation

Example CLI to generate 1000 additional records from a trained ACTGAN model:

gretel models run \
  --project <project-name> \
  --model-id <model-id> \
  --runner cloud \
  --param num_records 1000 \
  --output .

Automated validators

Validators are not currently supported in ACTGAN

Model information

The underlying model used is an Anyway Conditional Tabular Generative Adversarial Network (ACTGAN). There are Generator and Discriminator models that are trained adversarially. The model is initialized from random weights and trained on the customer provided dataset. This model is an extension of the popular CTGAN model. These algorithmic extensions improve speed, accuracy, memory usage, and conditional generation.

More details about the original underlying model can be found in their excellent paper. https://arxiv.org/abs/1907.00503

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.

@inproceedings{xu2019modeling,
  title={Modeling Tabular data using Conditional GAN},
  author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2019}
}

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.

CPU: Minimum 4 cores, 32GB RAM.

GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is required to run basic language models.

In general, this model trains faster in wall-clock time than comparable LSTMs, but often performs worse on text or high cardinality categorical variables.

Limitations and Biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.

ACTGAN technical limitations:

  • When force_conditioning=False (the default with conditional_vector_type=single_discrete), conditional generation may not produce a record for every seeded row. So you might only get 90 records back after using a seed file with 100 records with smart seeding. Use conditional_vector_type=anyway to increase the likelihood of generating all requested seed rows. The parameter force_conditioning=True is also available to guarantee a row is generated for all seed rows, but with the possibility of lower data quality.

Last updated