Gretel Tabular GAN

Adversarial model that supports tabular data, structured numerical data, and high column count data.

The Gretel Tabular GAN model API provides access to a generative data model for tabular data. Gretel Tabular GAN supports advanced features such as conditional data generation. Tabular GAN works well with datasets featuring primarily numeric data, high column counts, and highly unique categorical fields.

Step configuration

The config below shows all the default configuration for Tabular GAN. Descriptions of all parameters are listed below.

train:
  params:
    epochs: auto
    generator_dim: [1024, 1024]
    discriminator_dim: [1024, 1024]
    generator_lr: 0.0001
    discriminator_lr: .00033
    batch_size: auto
    auto_transform_datetimes: False
generate:
  num_records: 5000

Train parameters

params - Parameters that control the model training process:
- embedding_dim (int, required, defaults to 128) - Size of the random sample passed to the Generator (z vector).
- generator_dim (List(int), required, defaults to [256, 256]) - Size of the output samples for each of the Residuals. Adding more numbers to this list will create more Residuals, one for each number. This is equivalent to increasing the depth of the Generator.
- discriminator_dim (List(int), required, defaults to [256, 256]) - Size of the output samples for each of the discriminator linear layers. A new Linear layer will be created for each number added to this list.
- generator_lr (float, required, defaults to 2e-4) - Learning rate for the Generator.
- generator_decay (float, required, defaults to 1e-6) - Weight decay for the Generator's Adam optimizer.
- discriminator_lr (float, required, defaults to 2e-4) - Learning rate for the discriminator.
- discriminator_decay (float, required, defaults to 1e-6) - Weight decay for the discriminator's Adam optimizer.
- batch_size (int, required, defaults to 500) - Determines the number of examples the model see's each step. Importantly, this must be a multiple of 10 as specified by the Tabular GAN training scheme.
- epochs (int, required, defaults to 300) - Number of training iterations the model will undergo during training. A larger number will result in longer training times, but potentially higher quality synthetic data.
- binary_encoder_cutoff (int, required, defaults to 150) - Number of unique categorical values in a column before encoding switches from One Hot to Binary Encoding for the specific column. Decrease this number if you have Out of Memory issues. Will result in faster training times with a potential loss in performance in a few select cases.
- binary_encoder_nan_handler (str, optional, defaults to mode) - Method for handling invalid generated binary encodings. When generating data, it is possible the model outputs binary encodings that do not map to a real category. This parameter specifies what value to use in this case. Possible choices are: "mode". Note that this will not replace all nans, and the generated data can have nans if the training data has nans.
- cbn_sample_size (int, optional, defaults to 250,000) - If set, clustering for continuous-valued columns is performed on a sample of the data records. This option can significantly reduce training time on large datasets with only negligible impact on performance. When setting this option to 0 or to a value larger than the data size, no subsetting will be performed.
- discriminator_steps (int, required, defaults to 1) - The discriminator and Generator take different number of steps per batch. The original WGAN paper took 5 discriminator steps for each Generator step. In this case we default to 1 which follows the original Tabular GAN implementation.
- log_frequency (bool, required, defaults to True) - Determines the use of log frequency of categorical counts during conditional sampling. In some cases, switching to False improves performance.
- verbose (bool, required, defaults to False) - Whether to print training progress during training.
- pac (int, required, defaults to 10) - Number of samples to group together when applying the discriminator. Must equally divide batch_size.
- data_upsample_limit (int, optional, defaults to 100) - If the training data has fewer than this many records, the data will be automatically upsampled to the specified limit. Setting this to 0 will disable upsampling.
- auto_transform_datetime (bool, optional, defaults to False) - When enabled, every column will be analyzed to determine if it is made up of DateTime objects. For each column that is detected, Tabular GAN will automatically convert DateTimes to Unix Timestamps (epoch seconds) for model training and then after sampling convert them back into a DateTime string.
- conditional_vector_type (str, required, defaults to single_discrete) - Controls conditional vector usage in model architecture which influences the effectiveness and flexibility of the native conditional generation. Possible choices are: "single_discrete", "anyway". single_discrete is the original CTGAN architecture. anyway will improve efficiency of conditional generation by guiding the model towards the requested seed values.
- conditional_select_mean_columns (float, optional) - Target number of columns to select for conditioning during training. Only used when conditional_vector_type=anyway. Use if typical number of seed columns required for conditional generation is known. The model will be better at conditional generation when using approximately this many seed columns. If set, conditional_select_column_prob must be empty.
- conditional_select_column_prob (float, optional) - Probability of selecting a column for conditioning during training. Only used when conditional_vector_type=anyway. If set, conditional_select_mean_columns must be empty.
- reconstruction_loss_coef (float, required, defaults to 1.0) - Multiplier on reconstruction loss. Higher values should provide more efficient conditional generation. Only used when conditional_vector_type=anyway.
- force_conditioning (bool or auto, required, defaults to auto) - When True, skips rejection sampling and directly sets the requested seed values in generated data. Conditional generation will be faster when enabled, but may reduce quality of generated data. If True with single_discrete, all correlation between seed columns and generated columns is lost! auto chooses a preferred value for force_conditioning based on the other configured parameters, logs will show what value was chosen.

Generate parameters

num_records (int, required, defaults to 5000) - Number of records to generate.

Differential privacy is not supported for the Gretel Tabular GAN model.

Model information

The underlying model used is an Anyway Conditional Tabular Generative Adversarial Network (ACTGAN). There are Generator and Discriminator models that are trained adversarially. The model is initialized from random weights and trained on the customer provided dataset. This model is an extension of the popular CTGAN model. These algorithmic extensions improve speed, accuracy, memory usage, and conditional generation.

More details about the original underlying model can be found in their excellent paper. https://arxiv.org/abs/1907.00503

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.

@inproceedings{xu2019modeling,
  title={Modeling Tabular data using Conditional GAN},
  author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2019}
}

Limitations and Biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.

PreviousGretel Text Fine-Tuning NextBenchmark Report

Last updated 1 month ago

Was this helpful?