Gretel Tabular GAN
Adversarial model that supports tabular data, structured numerical data, and high column count data.
The Gretel Tabular GAN model API provides access to a generative data model for tabular data. Gretel Tabular GAN supports advanced features such as conditional data generation. Tabular GAN works well with datasets featuring primarily numeric data, high column counts, and highly unique categorical fields.
Step configuration
The config below shows all the default configuration for Tabular GAN. Descriptions of all parameters are listed below.
Train parameters
params
- Parameters that control the model training process:embedding_dim
(int, required, defaults to128
) - Size of the random sample passed to the Generator (z vector).generator_dim
(List(int), required, defaults to[256, 256]
) - Size of the output samples for each of the Residuals. Adding more numbers to this list will create more Residuals, one for each number. This is equivalent to increasing the depth of the Generator.discriminator_dim
(List(int), required, defaults to[256, 256]
) - Size of the output samples for each of the discriminator linear layers. A new Linear layer will be created for each number added to this list.generator_lr
(float, required, defaults to2e-4
) - Learning rate for the Generator.generator_decay
(float, required, defaults to1e-6
) - Weight decay for the Generator's Adam optimizer.discriminator_lr
(float, required, defaults to2e-4
) - Learning rate for the discriminator.discriminator_decay
(float, required, defaults to1e-6
) - Weight decay for the discriminator's Adam optimizer.batch_size
(int, required, defaults to500
) - Determines the number of examples the model see's each step. Importantly, this must be a multiple of10
as specified by the Tabular GAN training scheme.epochs
(int, required, defaults to300
) - Number of training iterations the model will undergo during training. A larger number will result in longer training times, but potentially higher quality synthetic data.binary_encoder_cutoff
(int, required, defaults to150
) - Number of unique categorical values in a column before encoding switches from One Hot to Binary Encoding for the specific column. Decrease this number if you have Out of Memory issues. Will result in faster training times with a potential loss in performance in a few select cases.binary_encoder_nan_handler
(str, optional, defaults tomode
) - Method for handling invalid generated binary encodings. When generating data, it is possible the model outputs binary encodings that do not map to a real category. This parameter specifies what value to use in this case. Possible choices are: "mode". Note that this will not replace all nans, and the generated data can have nans if the training data has nans.cbn_sample_size
(int, optional, defaults to250,000
) - If set, clustering for continuous-valued columns is performed on a sample of the data records. This option can significantly reduce training time on large datasets with only negligible impact on performance. When setting this option to0
or to a value larger than the data size, no subsetting will be performed.discriminator_steps
(int, required, defaults to1
) - The discriminator and Generator take different number of steps per batch. The original WGAN paper took 5 discriminator steps for each Generator step. In this case we default to1
which follows the original Tabular GAN implementation.log_frequency
(bool, required, defaults toTrue
) - Determines the use of log frequency of categorical counts during conditional sampling. In some cases, switching to False improves performance.verbose
(bool, required, defaults toFalse
) - Whether to print training progress during training.pac
(int, required, defaults to10
) - Number of samples to group together when applying the discriminator. Must equally dividebatch_size
.data_upsample_limit
(int, optional, defaults to100
) - If the training data has fewer than this many records, the data will be automatically upsampled to the specified limit. Setting this to0
will disable upsampling.auto_transform_datetime
(bool, optional, defaults toFalse
) - When enabled, every column will be analyzed to determine if it is made up of DateTime objects. For each column that is detected, Tabular GAN will automatically convert DateTimes to Unix Timestamps (epoch seconds) for model training and then after sampling convert them back into a DateTime string.conditional_vector_type
(str, required, defaults tosingle_discrete
) - Controls conditional vector usage in model architecture which influences the effectiveness and flexibility of the native conditional generation. Possible choices are: "single_discrete", "anyway".single_discrete
is the original CTGAN architecture.anyway
will improve efficiency of conditional generation by guiding the model towards the requested seed values.conditional_select_mean_columns
(float, optional) - Target number of columns to select for conditioning during training. Only used whenconditional_vector_type=anyway
. Use if typical number of seed columns required for conditional generation is known. The model will be better at conditional generation when using approximately this many seed columns. If set,conditional_select_column_prob
must be empty.conditional_select_column_prob
(float, optional) - Probability of selecting a column for conditioning during training. Only used whenconditional_vector_type=anyway
. If set,conditional_select_mean_columns
must be empty.reconstruction_loss_coef
(float, required, defaults to1.0
) - Multiplier on reconstruction loss. Higher values should provide more efficient conditional generation. Only used whenconditional_vector_type=anyway
.force_conditioning
(bool orauto
, required, defaults toauto
) - When True, skips rejection sampling and directly sets the requested seed values in generated data. Conditional generation will be faster when enabled, but may reduce quality of generated data. If True withsingle_discrete
, all correlation between seed columns and generated columns is lost!auto
chooses a preferred value forforce_conditioning
based on the other configured parameters, logs will show what value was chosen.
Generate parameters
num_records
(int, required, defaults to5000
) - Number of records to generate.
Model information
The underlying model used is an Anyway Conditional Tabular Generative Adversarial Network (ACTGAN). There are Generator and Discriminator models that are trained adversarially. The model is initialized from random weights and trained on the customer provided dataset. This model is an extension of the popular CTGAN model. These algorithmic extensions improve speed, accuracy, memory usage, and conditional generation.
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.
Limitations and Biases
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
Last updated
Was this helpful?