LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page
  • Model creation
  • Including in a workflow
  • Data generation
  • Model information
  • Minimum requirements
  • Limitations and biases

Was this helpful?

Export as PDF
  1. Create Synthetic Data
  2. Gretel Safe Synthetics
  3. Synthetics

Gretel Tabular DP

Statistical model for synthetic data generation with strong differential privacy guarantees.

Last updated 1 year ago

Was this helpful?

The Gretel Tabular DP model API provides access to a probabilistic graphical model for generating synthetic tabular data with strong differential privacy guarantees. Gretel Tabular DP works well on datasets with primarily categorical variables, relatively low cardinality (<100 unique categories per variable) and under 100 variables.

Model creation

This model can be selected using the tabular_dp model tag. Below is an example configuration to create a Gretel Tabular DP model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to .

# Default configuration for Gretel Tabular DP to generate synthetic data with
# differential privacy guarantees

schema_version: "1.0"
name: "tabular-dp"
models:
  - tabular_dp:
      data_source: __tmp__
      params:
        epsilon: 1
        delta: auto
        infer_domain: True
        domain: null
  • data_source (str, required) - __tmp__ or point to a valid and accessible file in CSV format.

  • epsilon (float, required, defaults to 1) - for differential privacy.

  • delta (float or auto, required, defaults to auto) - . It is typically set to be less than 1/n, where n is the number of training records. By default, delta is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.5. You can also choose your own value for delta. Decreasing delta (for example to 1/n^2, which corresponds to delta: 0.000004 for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.

  • infer_domain (bool, required, defaults to True) - Whether to determine the data domain (i.e. min/max for continuous attributes, number of categories for categorical attributes) exactly using the training data. Otherwise the domain must be provided in the config. True by default. If False, domain parameter must be specified.

  • domain - Domain of each attribute in the dataset. For numeric variables, only the min and max should be specified (int or float). For categorical variables, only the number of categories should be specified (int). See below for an example of a configuration with domain specified for a dataset containing three variables - state, age and capital gains.

    # Configuration for Gretel Tabular DP with domain specified for each variable

    schema_version: "1.0"
    name: "tabular-dp-with-domain"
    models:
      - tabular_dp:
          data_source: __tmp__
          params:
            epsilon: 1.0
            delta: auto
            infer_domain: False
            domain:
              state:
                num_categories: 50
              age:
                min: 0
                max: 99
              capital_gains:
                min: -10000.50
                max: 1999999.99

Including in a workflow

To reference the default tabular-dp configuration in a workflow, use the following, e.g.

actions:
  # s3-crawl ommitted for brevity
  - name: model-train-run
    type: gretel_model
    input: s3-crawl
    config:
      project_id: proj_1
      model: synthetics/tabular-differential-privacy
      run_params:
        params:
          num_records_multiplier: 1.0
      training_data: "{outputs.s3-crawl.dataset.files.data}"

Data generation

Example CLI script to generate 1000 additional records from a trained Tabular DP model:

gretel models run \
  --project <project-name> \
  --model-id <model-id> \
  --runner cloud \
  --param num_records 1000 \
  --output .

Model information

The underlying model is a probabilistic graphical model (PGM), which is estimated using low dimensional distributions measured with differential privacy. This model follows three steps:

  1. Automatically select a subset of correlated pairs of variables using a differentially private algorithm.

  2. Measure distributions of the selected pairs with differential privacy. These distributions, called marginals, are effectively large contingency tables of noisy counts.

  3. Estimate a PGM that captures the relationship as described by the noisy marginals, and sample synthetic data from the PGM.

Minimum requirements

If running this system in local mode (on-premises), the following instance type is recommended. Note that a GPU is not required.

CPU: Minimum 4 cores, 16GB RAM.

Limitations and biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.

  • Conditional generation is not supported.

  • Gretel Tabular DP is not appropriate for time series data where maintaining correlations across sequential records is important, as the underlying model has an assumption of independence between records.

  • Gretel Tabular DP is not appropriate for text data if novel text is desired in the synthetic data. Use Gretel GPT to generate differentially private synthetic text.

More details about the model can be found in the paper .

are not supported. This is because privacy filters directly utilize training records to provide privacy protections. The process does not involve any addition of calibrated noise. Hence, enabling privacy filters would invalidate the differential privacy guarantee.

Winning the NIST Contest: A scalable and general approach to differentially private synthetic data
Privacy Filters
Privacy loss parameter
Probability of accidentally leaking information
#data-generation