Gretel Tabular DP
Statistical model for synthetic data generation with strong differential privacy guarantees.
Last updated
Statistical model for synthetic data generation with strong differential privacy guarantees.
Last updated
The Gretel Tabular DP model API provides access to a probabilistic graphical model for generating synthetic tabular data with strong differential privacy guarantees. Gretel Tabular DP works well on datasets with primarily categorical variables, relatively low cardinality (<100 unique categories per variable) and under 100 variables.
This model can be selected using the tabular_dp
model tag. Below is an example configuration to create a Gretel Tabular DP model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to .
data_source
(str, required) - __tmp__
or point to a valid and accessible file in CSV format.
epsilon
(float, required, defaults to 1
) - Privacy loss parameter for differential privacy.
delta
(float or auto
, required, defaults to auto
) - Probability of accidentally leaking information. It is typically set to be less than 1/n
, where n
is the number of training records. By default, delta
is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.5
. You can also choose your own value for delta
. Decreasing delta
(for example to 1/n^2
, which corresponds to delta: 0.000004
for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.
infer_domain
(bool, required, defaults to True
) - Whether to determine the data domain (i.e. min/max for continuous attributes, number of categories for categorical attributes) exactly using the training data. Otherwise the domain must be provided in the config. True
by default. If False
, domain
parameter must be specified.
domain
- Domain of each attribute in the dataset. For numeric variables, only the min and max should be specified (int or float). For categorical variables, only the number of categories should be specified (int). See below for an example of a configuration with domain specified for a dataset containing three variables - state, age and capital gains.
To reference the default tabular-dp configuration in a workflow, use the following, e.g.
Example CLI script to generate 1000 additional records from a trained Tabular DP model:
The underlying model is a probabilistic graphical model (PGM), which is estimated using low dimensional distributions measured with differential privacy. This model follows three steps:
Automatically select a subset of correlated pairs of variables using a differentially private algorithm.
Measure distributions of the selected pairs with differential privacy. These distributions, called marginals, are effectively large contingency tables of noisy counts.
Estimate a PGM that captures the relationship as described by the noisy marginals, and sample synthetic data from the PGM.
More details about the model can be found in the paper Winning the NIST Contest: A scalable and general approach to differentially private synthetic data.
If running this system in local mode (on-premises), the following instance type is recommended. Note that a GPU is not required.
CPU: Minimum 4 cores, 16GB RAM.
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
Conditional generation is not supported.
Privacy Filters are not supported. This is because privacy filters directly utilize training records to provide privacy protections. The process does not involve any addition of calibrated noise. Hence, enabling privacy filters would invalidate the differential privacy guarantee.
Gretel Tabular DP is not appropriate for time series data where maintaining correlations across sequential records is important, as the underlying model has an assumption of independence between records.
Gretel Tabular DP is not appropriate for text data if novel text is desired in the synthetic data. Use Gretel GPT to generate differentially private synthetic text.