Gretel Tabular DP

Statistical model for synthetic data generation with strong differential privacy guarantees.

The Gretel Tabular DP model API provides access to a probabilistic graphical model for generating synthetic tabular data with strong differential privacy guarantees. Gretel Tabular DP works well on datasets with primarily categorical variables, relatively low cardinality (<100 unique categories per variable) and under 100 variables.

Model creation

This model can be selected using the tabular_dp model tag. Below is an example configuration to create a Gretel Tabular DP model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to Create and Train a Model.

# Default configuration for Gretel Tabular DP to generate synthetic data with
# differential privacy guarantees

schema_version: "1.0"
name: "tabular-dp"
models:
  - tabular_dp:
      data_source: __tmp__
      params:
        epsilon: 1
        delta: auto
        infer_domain: True
        domain: null
  • data_source (str, required) - __tmp__ or point to a valid and accessible file in CSV format.

  • epsilon (float, required, defaults to 1) - Privacy loss parameter for differential privacy.

  • delta (float or auto, required, defaults to auto) - Probability of accidentally leaking information. It is typically set to be less than 1/n, where n is the number of training records. By default, delta is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.5. You can also choose your own value for delta. Decreasing delta (for example to 1/n^2, which corresponds to delta: 0.000004 for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.

  • infer_domain (bool, required, defaults to True) - Whether to determine the data domain (i.e. min/max for continuous attributes, number of categories for categorical attributes) exactly using the training data. Otherwise the domain must be provided in the config. True by default. If False, domain parameter must be specified.

  • domain - Domain of each attribute in the dataset. For numeric variables, only the min and max should be specified (int or float). For categorical variables, only the number of categories should be specified (int). See below for an example of a configuration with domain specified for a dataset containing three variables - state, age and capital gains.

    # Configuration for Gretel Tabular DP with domain specified for each variable

    schema_version: "1.0"
    name: "tabular-dp-with-domain"
    models:
      - tabular_dp:
          data_source: __tmp__
          params:
            epsilon: 1.0
            delta: auto
            infer_domain: False
            domain:
              state:
                num_categories: 50
              age:
                min: 0
                max: 99
              capital_gains:
                min: -10000.50
                max: 1999999.99

Including in a workflow

To reference the default tabular-dp configuration in a workflow, use the following, e.g.

actions:
  # s3-crawl ommitted for brevity
  - name: model-train-run
    type: gretel_model
    input: s3-crawl
    config:
      project_id: proj_1
      model: synthetics/tabular-differential-privacy
      run_params:
        params:
          num_records_multiplier: 1.0
      training_data: "{outputs.s3-crawl.dataset.files.data}"

Data generation

Example CLI script to generate 1000 additional records from a trained Tabular DP model:

gretel models run \
  --project <project-name> \
  --model-id <model-id> \
  --runner cloud \
  --param num_records 1000 \
  --output .

Model information

The underlying model is a probabilistic graphical model (PGM), which is estimated using low dimensional distributions measured with differential privacy. This model follows three steps:

  1. Automatically select a subset of correlated pairs of variables using a differentially private algorithm.

  2. Measure distributions of the selected pairs with differential privacy. These distributions, called marginals, are effectively large contingency tables of noisy counts.

  3. Estimate a PGM that captures the relationship as described by the noisy marginals, and sample synthetic data from the PGM.

More details about the model can be found in the paper Winning the NIST Contest: A scalable and general approach to differentially private synthetic data.

Minimum requirements

If running this system in local mode (on-premises), the following instance type is recommended. Note that a GPU is not required.

CPU: Minimum 4 cores, 16GB RAM.

Limitations and biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.

  • Conditional generation is not supported.

  • Privacy Filters are not supported. This is because privacy filters directly utilize training records to provide privacy protections. The process does not involve any addition of calibrated noise. Hence, enabling privacy filters would invalidate the differential privacy guarantee.

  • Gretel Tabular DP is not appropriate for time series data where maintaining correlations across sequential records is important, as the underlying model has an assumption of independence between records.

  • Gretel Tabular DP is not appropriate for text data if novel text is desired in the synthetic data. Use Gretel GPT to generate differentially private synthetic text.

Last updated