# Gretel Tabular DP

Statistical model for synthetic data generation with strong differential privacy guarantees.

The Gretel Tabular DP model API provides access to a probabilistic graphical model for generating synthetic tabular data with strong differential privacy guarantees. Gretel Tabular DP works well on datasets with primarily categorical variables, relatively low cardinality (<100 unique categories per variable) and under 100 variables.

This model can be selected using the

`tabular_dp`

model tag. Below is an example configuration to create a Gretel Tabular DP model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to Create and Train a Model.# Default configuration for Gretel Tabular DP to generate synthetic data with

# differential privacy guarantees

schema_version: "1.0"

name: "tabular-dp"

models:

- tabular_dp:

data_source: __tmp__

params:

epsilon: 1

delta: auto

infer_domain: True

domain: null

`data_source`

(str,*required)*-`__tmp__`

or point to a valid and accessible file in CSV format.`delta`

(float or`auto`

,*required*, defaults to`auto`

) - Probability of accidentally leaking information. It is typically set to be less than`1/n`

, where`n`

is the number of training records. By default,`delta`

is automatically set based on the characteristics of your dataset to be less than or equal to`1/n^1.5`

. You can also choose your own value for`delta`

. Decreasing`delta`

(for example to`1/n^2`

, which corresponds to`delta: 0.000004`

for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.`infer_domain`

(bool,*required*, defaults to`True`

) - Whether to determine the data domain (i.e. min/max for continuous attributes, number of categories for categorical attributes) exactly using the training data. Otherwise the domain must be provided in the config.`True`

by default. If`False`

,`domain`

parameter must be specified.`domain`

- Domain of each attribute in the dataset. For numeric variables, only the min and max should be specified (int or float). For categorical variables, only the number of categories should be specified (int). See below for an example of a configuration with domain specified for a dataset containing three variables - state, age and capital gains.# Configuration for Gretel Tabular DP with domain specified for each variableschema_version: "1.0"name: "tabular-dp-with-domain"models:- tabular_dp:data_source: __tmp__params:epsilon: 1.0delta: autoinfer_domain: Falsedomain:state:num_categories: 50age:min: 0max: 99capital_gains:min: -10000.50max: 1999999.99

Example CLI script to generate 1000 additional records from a trained Tabular DP model:

gretel models run \

--project <project-name> \

--model-id <model-id> \

--runner cloud \

--param num_records 1000 \

--output .

The underlying model is a probabilistic graphical model (PGM), which is estimated using low dimensional distributions measured with differential privacy. This model follows three steps:

- 1.Automatically select a subset of correlated pairs of variables using a differentially private algorithm.
- 2.Measure distributions of the selected pairs with differential privacy. These distributions, called marginals, are effectively large contingency tables of noisy counts.
- 3.Estimate a PGM that captures the relationship as described by the noisy marginals, and sample synthetic data from the PGM.

More details about the model can be found in the paper Winning the NIST Contest: A scalable and general approach to differentially private synthetic data.

If running this system in local mode (on-premises), the following instance type is recommended. Note that a GPU is not required.

CPU: Minimum 4 cores, 16GB RAM.

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.

- Conditional generation is not supported.
- Privacy Filters are not supported. This is because privacy filters directly utilize training records to provide privacy protections. The process does not involve any addition of calibrated noise. Hence, enabling privacy filters would invalidate the differential privacy guarantee.
- Gretel Tabular DP is not appropriate for time series data where maintaining correlations across sequential records is important, as the underlying model has an assumption of independence between records.
- Gretel Tabular DP is not appropriate for text data if novel text is desired in the synthetic data.

Last modified 1mo ago