Synthetic Data Quality
Assess the accuracy and privacy of your synthetic data.
Last updated
Assess the accuracy and privacy of your synthetic data.
Last updated
The Data Quality Report gives both a summary and detailed overview of the quality of generated synthetic data. It takes as input both the original training data as well as the new synthetic data and assesses how well the statistical integrity of the training data was maintained and the privacy level of the synthetic data.
At the very top of the report, a Synthetic Quality Score is shown which represents the quality of the synthetic data that was generated. Above, we show a score of 80 which is excellent.
If you click on the question mark on the right, you will get the above description of what the Synthetic Quality Score is:
The quality score is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. In this sense, the Synthetic Quality Score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. If you do not require statistical symmetry, as might be the case in a testing or demo environment, a lower score may be just as acceptable.
Additionally, you will see a table, shown below, detailing common synthetic data use cases, and the SQS level recommended for each.
With any use case other than Excellent, you can always try and improve your model with our Tips to Improve Synthetic Data Quality. If your score is Very Poor then there is some inherent problem with your data that makes it unsuitable for synthetics. While this is rare in our experience, significant tuning may still be able to put you back in the ballpark.
Your Data Privacy Score analyzes the synthetic output to measure how well protected your original data is from adversarial attacks. It combines results from two common attacks: Membership Inference and Attribute Inference.
Membership Inference Protection measures how well you are protected from an adversary attempting to determine if specific data points were part of the training set. Attribute Inference Protection measures how well you are protected from an adversary trying to predict sensitive attributes of the data used in training, given other attributes.
Your Privacy Configuration Score is determined by the privacy mechanisms you've enabled in the synthetic configuration. By nature, synthetic data is inherently more private than real-world data.
The table below details the recommended data-sharing use cases based on the Data Privacy and Privacy Configuration Scores.
For use cases that require especially high levels of privacy, we recommend applying the following techniques or filters to try to increase the score:
Use the Outlier Filter to ensure that no synthetic record is an outlier with respect to the training space. You can enable this filter by setting privacy_filters.outliers: privacy_filters.outliers: [medium, high]
.
Use the Similarity Filter to ensure that no synthetic record is overly similar to a training record. You can enable this filter by setting privacy_filters.similarity: privacy_filters.similarity: [medium, high]
.
Set privacy_filters.outliers to auto
which will try for medium
, and fall back to turning the filter off if it prevents the synthetic model from generating the requested number of records.
Underfit the model to generate output that is less similar to the input. In all model types, you can reduce epochs to underfit or prevent overfitting. In LSTM, you can also set validation_split: True
and early_stopping: True
in the configuration.
Apply Differential Privacy, or reduce epsilon if Differential Privacy is applied.
Increase your training dataset size to reduce the influence of individual data points on the overall model.
You can learn more about the Data Privacy and Privacy Configuration Scores on our Privacy Protection page.
The Synthetic Quality Score is computed by taking a weighted combination of three individual quality metrics (described in more detail below): Field Distribution Stability, Field Correlation Stability, and Deep Structure Stability. Towards the top of the report, summary statistics are given showing how well the synthetic data scored within each metric. The summary statistics also include row and column counts for the training and synthetic data, as well as whether any training lines were duplicated.
The row count is the number of records or lines in the training (or synthetic) dataset. The column count is the number of fields in the dataset. The number of training rows used can directly impact the quality of the synthetic data created. The more examples available when training a model, the easier it is for the model to accurately learn the distributions and correlations in the data. Always strive to have a minimum of 3000 training examples, but increasing that to 5000 or even 50,000 is even better.
The more synthetic rows generated, the easier it is to deduce whether the statistical integrity of the data remains intact. If your Synthetic Data Quality Score isn't as high as you'd like it to be, make sure you’ve generated at least 5000 synthetic data records.
The Training Lines Duplicated value is an important way of ensuring the privacy of the generated synthetic data. In almost all situations, this value should be 0. The only exception would be if the training data itself contained a multitude of duplicate rows. If this is the situation, simply remove the duplicate rows before training.
The Privacy Configuration Score is based on how you configured your privacy settings. The report provides a summary of the privacy protections, indicating the setting you selected for each in the report. The Protections include Outlier Filtering, Similarity Filtering, Overfitting Prevention, and Differential Privacy.
The Data Privacy Score is computed by averaging the scores from Membership Inference Protection and Attribute Inference Protection. Towards the top of the report, summary statistics are given showing how well the synthetic data scored within each metric.
At the end of the report, you can see the breakdown of how the Membership Inference Protection and Attribute Inference Protection metrics are calculated.
Personally Identifiable Information (PII) Replay is only supported in Evaluate reports because it requires comparing original data to the final synthetic output, not intermediary results. Thus, you will not see it in an SQS report that gets automatically generated after a model run.
PII Replay counts the number of total and unique instances of personally identifiable information (PII) from your reference data that show up in your synthetic output. Lower counts under "Synthetic Data PII Replay" indicate higher privacy.
Note that some PII replay is often expected (and even desirable). In general, you can expect entities that are rarer (and therefore have many possible values), like full address or full name, to have lower amounts of PII replay than entities that are more common (and therefore have fewer possible values), like first name or US state. To reduce the amount of PII replay, run Gretel Transform prior to synthetics.
Results appear in a table, with one row per dataset column analyzed for PII Replay. For each dataset column, the table shows total instances of PII in the training data, how many were unique, and how many were replayed in the synthetic output—making it easy to spot privacy risks at a glance.
Let’s look at the example image to understand how to interpret the results. You can see from the header that the reference data had 2850 rows while the output data for which we are measuring PII Replay had 1000 rows.
In the first row, we can see the dataset column named FirstName was found to be of PII Type first_name. In the reference data, all 2850 rows were labeled with this entity type. Of those 2850 values, 1563 were unique.
Comparing to the output data, we see that 786 rows out of the 1000 in the output data had a value that matched one of the 1583 from the original. Of those 1583 unique values, 170 were replayed (11%).
The rest of the rows can be read in the same way.
Following the Privacy Protection Summary section, you’ll see an overview of all the fields in the training data (example below). The high-level Field Distribution Stability score is computed by taking the average of the individual Field Distribution Stability scores, shown in the rightmost column below. To better understand a field's Distribution Stability score, click on the field name to be taken to a graph comparing the training and synthetic distributions.
The training field overview table also shows the count of unique and missing field values, the average length of each field, as well as its datatype. When a dataset contains a large number of highly unique fields or a large amount of missing data, these characteristics can impede the model's ability to accurately learn the statistical structure of the data. Exceptionally long fields can also have the same impact. Read Tips to Improve Synthetic Data Quality for advice on how best to handle fields like these.
To measure Field Correlation Stability, the correlation between every pair of fields is computed first in the training data, and then in the synthetic data. The absolute difference between these values is then computed and averaged across all fields. The lower this average value is, the higher the Field Correlation Stability quality score will be.
To aid in the comparison of field correlations, the report shows heatmaps for both the training data and the synthetic data, as well as for the computed difference of correlation values. To view the details of what each square in a heatmap refers to, simply hover over the square with your cursor. The hover text will show you the two fields involved, as well as the correlation in the training data, the correlation in the synthetic data, and the difference between the two.
and To verify the statistical integrity of deeper, multi-field distributions and correlations, Gretel compares a Principal Component Analysis (PCA) computed first on the original data, then again on the synthetic data. The idea behind PCA is to capture in just a few features the essential shape of all the features. These new features are what is referred to as the Principal Components.
Gretel computes a synthetic quality score by comparing the distributional distance between the principal components in the original data and those in the synthetic data. The closer the principal components are, the higher the synthetic quality score will be. As PCA is a very common approach used in machine learning for both dimensionality reduction and visualization, this metric gives immediate feedback as to the utility of the synthetic data for machine learning purposes.
Field Distribution Stability is a measure of how closely the field distributions in the synthetic data mirror those in the original data. For each field we use a common approach for comparing two distributions referred to as the Jensen-Shannon Distance. The lower the JS Distance score is on average across all fields, the higher the Field Distribution Stability quality score will be.
To aid in the comparison of original versus synthetic field distributions, the report shows a bar chart or histogram for each field. To view the details of what each bar represents, simply hover over the bar with your cursor.
Membership Inference Protection is a measure of how well-protected your data is from membership inference attacks. A membership inference attack is a type of privacy attack on machine learning models where an adversary aims to determine whether a particular data sample was part of the model's training dataset. By exploiting the differences in the model's responses to data points from its training set versus those it has never seen before, an attacker can attempt to infer membership. This type of attack can have critical privacy implications, as it can reveal whether specific individuals' data was used to train the model. To simulate this attack, we take a 5% holdout from the training data prior to training the model. Based on directly analyzing the synthetic output, a high score indicates that your training data is well-protected from this type of attack. The score is based on 360 simulated attacks, and the percentages indicate how many fell into each protection level.
Attribute Inference Protection is a measure of how well-protected your data is from attribute inference attacks. An attribute inference attack is a type of privacy attack on machine learning models where an adversary seeks to infer missing attributes or sensitive information about individuals from their data that was used to train the model. By leveraging the model's output, the attacker can attempt to predict unknown attributes of a data sample. This type of attack poses significant privacy risks, as it can uncover sensitive details about individuals that were not intended to be revealed by the data owners. Based on directly analyzing the synthetic output, an overall high score indicates that your training data is well-protected from this type of attack. For a specific attribute, a high score indicates that even when other attributes are known, that specific attribute is difficult to predict.
By default, the privacy metrics are turned on and the quasi-identifier count used in the simulated attribute inference attacks is 3. You can adjust the quasi-identifier count or turn off the privacy metrics by editing the config. If you set skip: true
, it overrides both skip_mia
and skip_aia
.
When running Evaluate, you can also choose to turn on or off PII Replay by setting the skip
parameter for pii_replay
. You can specify which entities to search for by including them in a list under the entities
parameter. You can find the supported entities under the Transform model documentation, as this is what PII Replay uses to classify PII.
The Gretel Synthetic Data Quality Report can be used as a quick data quality summary simply by viewing the graphics at the top of the report. It can also be used for more in-depth analysis of the integrity of specific distributions and correlations, as well as information about the privacy protection of your data. If your use case requires statistical symmetry and you’d prefer a higher synthetic data quality score, read Tips to Improve Synthetic Data Quality for a multitude of ideas for improving your model.