LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page
  • Primary Protection Mechanisms
  • Model Configuration
  • Understanding your Data Privacy Score

Was this helpful?

Export as PDF
  1. Create Synthetic Data
  2. Gretel Safe Synthetics
  3. Synthetics

Privacy Protection

Use Gretel's privacy protection mechanisms to prevent adversarial attacks and better meet your data sharing needs.

PreviousBenchmark ReportNextEvaluate

Last updated 1 month ago

Was this helpful?

In addition to the privacy inherent in the use of synthetic data, we can add supplemental protection by means of Gretel's privacy mechanisms. These file configuration settings help to ensure that the generated data is safe from adversarial attacks.

Primary Protection Mechanisms

There are three privacy protection mechanisms:

Differential Privacy: Differential Privacy is supported with (numeric, categorical, and free text data), (free text data only), and (numeric and categorical data only, when very small ε < 5 is required). To enable Differential Privacy for Tabular Fine-Tuning and Text Fine-Tuning, you need to set dp: true. Tabular DP always runs with differential privacy.

Similarity Filters: Similarity filters ensure that no synthetic record is overly similar to a training record. Overly similar training records can be a severe privacy risk as adversarial attacks commonly exploit such records to gain insights into the original data. Similarity Filtering is enabled by the privacy_filters.similarity configuration setting. Similarity filters are available for Gretel Tabular GAN.

Allowed values are null, auto, medium, and high. A value of medium will filter out any synthetic record that is an exact duplicate of a training record, while high will filter out any synthetic record that is 99% similar or more to a training record. auto is equivalent to medium for most datasets, but can fall back to null if the similarity filter prevents the synthetic model from generating the requested number of records. However, if differential privacy is enabled, auto similarity filters will always be equivalent to null.

Outlier Filters: Outlier filters ensure that no synthetic record is an outlier with respect to the training dataset. Outliers revealed in the synthetic dataset can be exploited by Membership Inference Attacks, Attribute Inference, and a wide variety of other adversarial attacks. They are a serious privacy risk. Outlier Filtering is enabled by the privacy_filters.outliers configuration setting. Outlier filters are available for Gretel Tabular GAN.

Allowed values are null, auto, medium, and high. A value of medium will filter out any synthetic record that has a very high likelihood of being an outlier, while high will filter out any synthetic record that has a medium to high likelihood of being an outlier. auto is equivalent to medium for most datasets, but can fall back to null if the outlier filter prevents the synthetic model from generating the requested number of records. However, if differential privacy is enabled, auto outlier filters will always be equivalent to null.

Model Configuration

Synthetic model training and generation are driven by a configuration file. Here is an example configuration with differential privacy enabled for Tabular Fine-Tuning.

schema_version: "1.0"
name: default
task:
  name: tabular_ft
  config:
      train:
        privacy_params:
          dp: true
          epsilon: 8.0
          per_sample_max_grad_norm: 1.0

Here is an example configuration with privacy filters set for Gretel Tabular GAN.

schema_version: "1.0"
name: default
task:
  name: tabular_gan
  config:
    train:
      privacy_filters:
        outliers: medium
        similarity: medium

Understanding your Data Privacy Score

Your Data Privacy Score is calculated by measuring the protection of your data against simulated adversarial attacks.

Values can range from Excellent to Poor, and we provide a list detailing whether your Data Privacy Score is sufficient for a given data-sharing use case.

We provide a summary of the protection level against Membership Inference Attacks and Attribute Inference Attacks.

For each metric, we provide a breakdown of the attack results that contributed to the score.

Membership Inference Protection is a measure of how well-protected your data is from membership inference attacks. A membership inference attack is a type of privacy attack on machine learning models where an adversary aims to determine whether a particular data sample was part of the model's training dataset. By exploiting the differences in the model's responses to data points from its training set versus those it has never seen before, an attacker can attempt to infer membership. This type of attack can have critical privacy implications, as it can reveal whether specific individuals' data was used to train the model. To simulate this attack, we take a 5% holdout from the training data prior to training the model. Based on directly analyzing the synthetic output, a high score indicates that your training data is well-protected from this type of attack. The score is based on 360 simulated attacks, and the percentages indicate how many fell into each protection level.

Attribute Inference Protection is a measure of how well-protected your data is from attribute inference attacks. An attribute inference attack is a type of privacy attack on machine learning models where an adversary seeks to infer missing attributes or sensitive information about individuals from their data that was used to train the model. By leveraging the model's output, the attacker can attempt to predict unknown attributes of a data sample. This type of attack poses significant privacy risks, as it can uncover sensitive details about individuals that were not intended to be revealed by the data owners. Based on directly analyzing the synthetic output, an overall high score indicates that your training data is well-protected from this type of attack. For a specific attribute, a high score indicates that even when other attributes are known, that specific attribute is difficult to predict.

Tabular Fine-Tuning
Text Fine-Tuning
Tabular DP
Data Privacy and Privacy Configuration Scores in the Gretel Synthetic Report
Membership Inference Protection graph
Attribute Inference Protection graph