Data Privacy 101
Overview of concepts and tips on optimizing Privacy and Accuracy with Synthetic Data
Last updated
Was this helpful?
Overview of concepts and tips on optimizing Privacy and Accuracy with Synthetic Data
Last updated
Was this helpful?
A: Gretel offers three levels of privacy protection:
Level 1 (Basic) - Data masking using Transform API
Best for: Initial PII removal, development environments
Use case: Internal analytics with low-sensitivity data
Limitations: Vulnerable to data linkage, membership inference, and re-identification attacks
Practical example, consider a healthcare record after using Transform to remove PII attributes:
Original: {name: "Jane Smith", age: 34, condition: "diabetes", zip: 90210, height: 5'4"} Masked: {age: 34, condition: "diabetes", zip: 90210, height: 5'4"}
While direct PII is removed, an attacker with access to a voter database could still identify Jane by matching the combination of age, zip code, and height, demonstrating why Transform alone isn't sufficient for sensitive data
See:
Level 2 - Synthetic data generation
Best for: Development environments, internal testing
Use case: Training non-production models, data exploration
Features: Maintains statistical properties while generating new data, including altering distributions
See:
Level 3 - Differential privacy-enabled synthetic data
Best for: Production data, external sharing, regulated industries
Use case: Training production models, sharing data with partners
Features: Mathematical privacy guarantees, protection against inference attacks
A: Gretel uses:
Delta (δ) = With Gretel, this is automatically set to 1/n^1.2, where n is the number of examples in the dataset
For comparison:
US Census Bureau uses ε = 17.14 for US person data
Google uses ε = 6.92 and δ = 10^-5 for next-word prediction on Android
Apple's Safari browser uses ε values between 8 and 16
A: Gretel uses:
Differential Privacy Stochastic Gradient Descent (DP-SGD) algorithm
Noise addition during optimization
Gradient clipping to prevent memorization
Fine-tuning of only ~1% of total model weights
A: For differentially private synthetic data, Gretel uses:
Open small language models such as Phi or Llama model as the base
LoRA for efficient fine-tuning
Training happens on Gretel community cloud or customer infrastructure for hybrid deployments
A: Gretel recommends:
At least a few thousand examples for effective DP implementation
5,000-8,000 records can provide reasonable performance
Smaller datasets may require larger epsilon values (8-10) to balance privacy and utility
A: Differential privacy is strongly recommended for:
GDPR compliance when processing EU citizen data
HIPAA compliance for healthcare data
CCPA compliance for California consumer data
Any regulated industry where data privacy is paramount
A: Key industry applications include:
Healthcare & Life Sciences:
Sharing electronic health records (EHR)
Patient diagnosis and symptom analysis
Treatment research without compromising patient privacy
Financial Services:
Fraud detection system development
Customer service chatbot training
Analysis of customer interactions
Customer Support:
Training data for support systems
Analysis of customer feedback
Call center log processing
A: Common text data types include:
Customer feedback transcripts
Call center logs
Internal reports and documents
Customer reviews
Chat logs
Product feedback
Medical records and patient descriptions
A: Gretel provides comprehensive privacy measurements:
PII Replay Detection:
Identifies sensitive information from training data in synthetic output
Measures unique value overlap between original and synthetic data
Provides column-level analysis of PII exposure
Privacy Attack Protection:
Membership inference attack simulation (360 scenarios)
Attribute inference attack simulation
Direct data leakage detection
Privacy scores (optimal range: 60-90)
A: Context is crucial when interpreting PII replay:
Expected Replay Rates:
Common fields (first names, states): Some replay is normal and expected
Unique identifiers (email, SSNs): Should see zero replay
Sensitive combinations (full names, age + zip code): Should show significantly reduced replay rates
Example Interpretation:
First names: 30-40% replay may be acceptable (limited name pool)
Full names: <1% replay indicates good privacy protection
Location data: High replay for common fields (states) is expected
A: Key recommendations include:
Pre-processing:
Run Transform before synthetic generation for privacy-centric use cases
Remove unnecessary sensitive columns
Model Selection:
Use Navigator Fine Tuning for better privacy
Enable differential privacy for sensitive data
Validation:
Monitor PII replay metrics
Test downstream task performance using Gretel Evaluate API
A: Based on Gretel's testing on the Yelp Restaurant Reviews dataset:
Achieves downstream accuracy within 1% of non-private models on scaled datasets (100k+ examples), and within 10% of the accuracy for smaller datasets (10k+ examples)
Synthetic Quality Score (SQS) of 86 out of 100
Text semantics similarity score of 94/100 compared to real-world data
Processing 1M reviews (632 MB) takes approximately 40 hrs fine-tuning time on a single A10G in Gretel Cloud, or 12 hours on a single A100.
A: Organizations should:
Target privacy and utility scores between 60-95+
Adjust epsilon values based on sensitivity requirements
Consider downstream use cases when setting privacy parameters
Use privacy reports to verify protection levels
Test synthetic data in actual applications to ensure utility
Collaborate on privacy and evaluation settings with your compliance and InfoSec teams
See: or with differential privacy
Epsilon (ε) = Configurable based on privacy use cases and utility requirements, 1.0 (for formal guarantees), 8.0 (balanced), even up to 20 for practical protections with reduced formal guarantees
Different model architectures supported for different needs (e.g., Navigator Fine-tuning across all text modalities, Gretel GPT for purely text, or for categorical data)