LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page
  • Basic Concepts & Approaches
  • Technical Implementation
  • Regulatory Compliance & Industry Applications
  • Privacy Measurements & Protection
  • Best Practices & Implementation

Was this helpful?

Export as PDF
  1. Create Synthetic Data
  2. Gretel Safe Synthetics
  3. Evaluate

Data Privacy 101

Overview of concepts and tips on optimizing Privacy and Accuracy with Synthetic Data

PreviousTips to Improve Synthetic Data QualityNextSDK

Last updated 5 months ago

Was this helpful?

Basic Concepts & Approaches

Q: What levels of privacy protection does Gretel offer and when should each be used?

A: Gretel offers three levels of privacy protection:

  1. Level 1 (Basic) - Data masking using Transform API

  • Best for: Initial PII removal, development environments

  • Use case: Internal analytics with low-sensitivity data

  • Limitations: Vulnerable to data linkage, membership inference, and re-identification attacks

  • Practical example, consider a healthcare record after using Transform to remove PII attributes:

    Original: {name: "Jane Smith", age: 34, condition: "diabetes", zip: 90210, height: 5'4"} Masked: {age: 34, condition: "diabetes", zip: 90210, height: 5'4"}

    While direct PII is removed, an attacker with access to a voter database could still identify Jane by matching the combination of age, zip code, and height, demonstrating why Transform alone isn't sufficient for sensitive data

  • See:

  1. Level 2 - Synthetic data generation

  • Best for: Development environments, internal testing

  • Use case: Training non-production models, data exploration

  • Features: Maintains statistical properties while generating new data, including altering distributions

  • See:

  1. Level 3 - Differential privacy-enabled synthetic data

  • Best for: Production data, external sharing, regulated industries

  • Use case: Training production models, sharing data with partners

  • Features: Mathematical privacy guarantees, protection against inference attacks

Q: What privacy parameters does Gretel use and how do they compare to other organizations?

A: Gretel uses:

  • Delta (δ) = With Gretel, this is automatically set to 1/n^1.2, where n is the number of examples in the dataset

For comparison:

  • US Census Bureau uses ε = 17.14 for US person data

  • Google uses ε = 6.92 and δ = 10^-5 for next-word prediction on Android

  • Apple's Safari browser uses ε values between 8 and 16

Technical Implementation

Q: What is the technical implementation behind Gretel's differential privacy?

A: Gretel uses:

  • Differential Privacy Stochastic Gradient Descent (DP-SGD) algorithm

  • Noise addition during optimization

  • Gradient clipping to prevent memorization

  • Fine-tuning of only ~1% of total model weights

Q: What models does Gretel use for synthetic data generation?

A: For differentially private synthetic data, Gretel uses:

  • Open small language models such as Phi or Llama model as the base

  • LoRA for efficient fine-tuning

  • Training happens on Gretel community cloud or customer infrastructure for hybrid deployments

Q: What are the minimum data requirements for using differential privacy?

A: Gretel recommends:

  • At least a few thousand examples for effective DP implementation

  • 5,000-8,000 records can provide reasonable performance

  • Smaller datasets may require larger epsilon values (8-10) to balance privacy and utility

Regulatory Compliance & Industry Applications

Q: What are the regulatory considerations for using differential privacy?

A: Differential privacy is strongly recommended for:

  • GDPR compliance when processing EU citizen data

  • HIPAA compliance for healthcare data

  • CCPA compliance for California consumer data

  • Any regulated industry where data privacy is paramount

Q: What industries benefit most from differential privacy?

A: Key industry applications include:

Healthcare & Life Sciences:

  • Sharing electronic health records (EHR)

  • Patient diagnosis and symptom analysis

  • Treatment research without compromising patient privacy

Financial Services:

  • Fraud detection system development

  • Customer service chatbot training

  • Analysis of customer interactions

Customer Support:

  • Training data for support systems

  • Analysis of customer feedback

  • Call center log processing

Q: What types of text data can benefit from differential privacy?

A: Common text data types include:

  • Customer feedback transcripts

  • Call center logs

  • Internal reports and documents

  • Customer reviews

  • Chat logs

  • Product feedback

  • Medical records and patient descriptions

Privacy Measurements & Protection

Q: How does Gretel measure and verify privacy protection?

A: Gretel provides comprehensive privacy measurements:

PII Replay Detection:

  • Identifies sensitive information from training data in synthetic output

  • Measures unique value overlap between original and synthetic data

  • Provides column-level analysis of PII exposure

Privacy Attack Protection:

  • Membership inference attack simulation (360 scenarios)

  • Attribute inference attack simulation

  • Direct data leakage detection

  • Privacy scores (optimal range: 60-90)

Q: How should organizations interpret PII replay metrics?

A: Context is crucial when interpreting PII replay:

Expected Replay Rates:

  • Common fields (first names, states): Some replay is normal and expected

  • Unique identifiers (email, SSNs): Should see zero replay

  • Sensitive combinations (full names, age + zip code): Should show significantly reduced replay rates

Example Interpretation:

  • First names: 30-40% replay may be acceptable (limited name pool)

  • Full names: <1% replay indicates good privacy protection

  • Location data: High replay for common fields (states) is expected

Best Practices & Implementation

Q: What are the best practices for minimizing privacy risks while maintaining data utility?

A: Key recommendations include:

  1. Pre-processing:

  • Run Transform before synthetic generation for privacy-centric use cases

  • Remove unnecessary sensitive columns

  1. Model Selection:

  • Use Navigator Fine Tuning for better privacy

  • Enable differential privacy for sensitive data

  1. Validation:

  • Monitor PII replay metrics

  • Test downstream task performance using Gretel Evaluate API

Q: What performance can be expected from differentially private synthetic data?

A: Based on Gretel's testing on the Yelp Restaurant Reviews dataset:

  • Achieves downstream accuracy within 1% of non-private models on scaled datasets (100k+ examples), and within 10% of the accuracy for smaller datasets (10k+ examples)

  • Synthetic Quality Score (SQS) of 86 out of 100

  • Text semantics similarity score of 94/100 compared to real-world data

  • Processing 1M reviews (632 MB) takes approximately 40 hrs fine-tuning time on a single A10G in Gretel Cloud, or 12 hours on a single A100.

Q: How should organizations balance privacy and utility?

A: Organizations should:

  • Target privacy and utility scores between 60-95+

  • Adjust epsilon values based on sensitivity requirements

  • Consider downstream use cases when setting privacy parameters

  • Use privacy reports to verify protection levels

  • Test synthetic data in actual applications to ensure utility

  • Collaborate on privacy and evaluation settings with your compliance and InfoSec teams

See: or with differential privacy

Epsilon (ε) = Configurable based on privacy use cases and utility requirements, 1.0 (for formal guarantees), 8.0 (balanced), even up to 20 for practical protections with reduced formal guarantees

Different model architectures supported for different needs (e.g., Navigator Fine-tuning across all text modalities, Gretel GPT for purely text, or for categorical data)

Gretel Transform
Gretel Navigator Fine-Tuning
Gretel Navigator Fine-Tuning
Gretel GPT
we recommend
Gretel Tabular DP