Privacy Protection
Fine-tune Gretel's privacy protection filters to prevent adversarial attacks and better meet your data sharing needs.
In addition to the privacy inherent in the use of synthetic data, we can add supplemental protection by means of Gretel's privacy filters. These file configuration settings help to ensure that the generated data is safe from adversarial attacks.

Primary Protection Filters

There are four privacy protection mechanisms:
Overfitting Prevention: This mechanism ensures that the synthetic model will stop training before it has a chance to overfit. When a model is overfit, it will start to memorize the training data as opposed to learning generalized patterns in the data. This is a severe privacy risk as overfit models are commonly exploited by adversaries seeking to gain insights into the original data. Overfitting prevention is enabled using the validation_split and early_stopping configuration settings.
Both these settings are booleans. Setting validation-split to Truewill automatically set aside 20% of the training data (randomly) as validation data to prevent overfitting.
We recommend keeping validation-splitenabled, except in the following scenarios:
  1. 1.
    Time-series data: information leakage could occur, making the validation set less useful.
  2. 2.
    Anomaly detection use cases: you may wish to ensure that the model is trained on all positive samples
  3. 3.
    Small dataset: if your training dataset is very small, you may need to ensure all samples are present for model training.
Similarity Filters: Similarity filters ensure that no synthetic record is overly similar to a training record. Overly similar training records can be a severe privacy risk as adversarial attacks commonly exploit such records to gain insights into the original data. Similarity Filtering is enabled by the privacy_filters.similarity configuration setting. A value of medium will filter out any synthetic record that is an exact duplicate of a training record.
Allowed values are null, medium, and high. A value of similarity: high will filter out any synthetic record that is 99% similar or more to a training record.
Outlier Filters: Outlier filters ensure that no synthetic record is an outlier with respect to the training dataset. Outliers revealed in the synthetic dataset can be exploited by Membership Inference Attacks, Attribute Inference, and a wide variety of other adversarial attacks. They are a serious privacy risk. Outlier Filtering is enabled by the privacy_filters.outliers configuration setting.
Allowed values are null, medium, and high.A value of outliers: medium will filter out any synthetic record that has a very high likelihood of being an outlier. A value of outliers: high will filter out any synthetic record that has a medium to high likelihood of being an outlier.
Differential Privacy: We provide an experimental implementation of DP-SGD that modifies the optimizer to offer provable guarantees of privacy, enabling safe training on private data. Differential Privacy can cause a hit to utility, often requiring larger datasets to work well, but it uniquely provides privacy guarantees against both known and unknown attacks on data. Differential Privacy can be enabled by setting dp: True and can be modified using the associated configuration settings: dp_noise_multiplier, dp_l2_norm_clip and dp_microbatches. These settings can be used to adjust the privacy vs accuracy balance of the synthetic dataset.
If Differential Privacy is disabled, dp_noise_multiplier, dp_l2_norm_clip and dp_microbatchesvalues will be ignored.

Model Configuration

Synthetic model training and generation are driven by a configuration file. Here is an example configuration with commonly used privacy settings.
schema_version: "1.0"
- synthetics:
data_source: __tmp__
early_stopping: True
validation_split: True
dp: False
dp_noise_multiplier: 0.001
dp_l2_norm_clip: 5.0
dp_microbatches: 1
outliers: medium
similarity: medium

Understanding Privacy Protection Levels

The Privacy Protection Level (PPL) is calculated based on the enabled privacy mechanisms and displayed in the Gretel Performance Report. The top of the report displays a gauge showing the score for the generated synthetic data.
Privacy Protection Level in the Gretel Synthetic Report
Values can range from Excellent to Poor, and we provide a matrix with the recommended Privacy Protection Levels for a given data sharing use case.
Data sharing use case
Very Good
Internally, within the same team
Internally, across different teams
Externally, with trusted partners
Externally, public availability
We also provide a summary of available and enabled privacy protections.
Privacy Settings At A Glance