Gretel Tabular DP
Statistical model for synthetic data generation with strong differential privacy guarantees.
The Gretel Tabular DP model API provides access to a probabilistic graphical model for generating synthetic tabular data with strong differential privacy guarantees. Gretel Tabular DP works well on datasets with primarily categorical variables, relatively low cardinality (<100 unique categories per variable) and under 100 variables.
Model creation
This model can be selected using the tabular_dp
model tag. Below is an example configuration to create a Gretel Tabular DP model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to Create and Train a Model.
data_source
(str, required) -__tmp__
or point to a valid and accessible file in CSV format.epsilon
(float, required, defaults to1
) - Privacy loss parameter for differential privacy.delta
(float orauto
, required, defaults toauto
) - Probability of accidentally leaking information. It is typically set to be less than1/n
, wheren
is the number of training records. By default,delta
is automatically set based on the characteristics of your dataset to be less than or equal to1/n^1.5
. You can also choose your own value fordelta
. Decreasingdelta
(for example to1/n^2
, which corresponds todelta: 0.000004
for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.infer_domain
(bool, required, defaults toTrue
) - Whether to determine the data domain (i.e. min/max for continuous attributes, number of categories for categorical attributes) exactly using the training data. Otherwise the domain must be provided in the config.True
by default. IfFalse
,domain
parameter must be specified.domain
- Domain of each attribute in the dataset. For numeric variables, only the min and max should be specified (int or float). For categorical variables, only the number of categories should be specified (int). See below for an example of a configuration with domain specified for a dataset containing three variables - state, age and capital gains.
Including in a workflow
To reference the default tabular-dp configuration in a workflow, use the following, e.g.
Data generation
Example CLI script to generate 1000 additional records from a trained Tabular DP model:
Model information
The underlying model is a probabilistic graphical model (PGM), which is estimated using low dimensional distributions measured with differential privacy. This model follows three steps:
Automatically select a subset of correlated pairs of variables using a differentially private algorithm.
Measure distributions of the selected pairs with differential privacy. These distributions, called marginals, are effectively large contingency tables of noisy counts.
Estimate a PGM that captures the relationship as described by the noisy marginals, and sample synthetic data from the PGM.
More details about the model can be found in the paper Winning the NIST Contest: A scalable and general approach to differentially private synthetic data.
Minimum requirements
If running this system in local mode (on-premises), the following instance type is recommended. Note that a GPU is not required.
CPU: Minimum 4 cores, 16GB RAM.
Limitations and biases
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
Conditional generation is not supported.
Privacy Filters are not supported. This is because privacy filters directly utilize training records to provide privacy protections. The process does not involve any addition of calibrated noise. Hence, enabling privacy filters would invalidate the differential privacy guarantee.
Gretel Tabular DP is not appropriate for time series data where maintaining correlations across sequential records is important, as the underlying model has an assumption of independence between records.
Gretel Tabular DP is not appropriate for text data if novel text is desired in the synthetic data. Use Gretel GPT to generate differentially private synthetic text.
Last updated