Synthetic Data Quality

Assess the accuracy and privacy of your synthetic data.

The Data Quality Report gives both a summary and detailed overview of the quality of generated synthetic data. It takes as input both the original training data as well as the new synthetic data and assesses how well the statistical integrity of the training data was maintained and the privacy level of the synthetic data.

Synthetic Quality Score (SQS)

At the very top of the report, a Synthetic Quality Score is shown which represents the quality of the synthetic data that was generated. Above, we show a score of 80 which is excellent.

If you click on the question mark on the right, you will get the above description of what the Synthetic Quality Score is:

The quality score is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. In this sense, the Synthetic Quality Score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. If you do not require statistical symmetry, as might be the case in a testing or demo environment, a lower score may be just as acceptable.

Additionally, you will see a table, shown below, detailing common synthetic data use cases, and the SQS level recommended for each.

With any use case other than Excellent, you can always try and improve your model with our Tips to Improve Synthetic Data Quality. If your score is Very Poor then there is some inherent problem with your data that makes it unsuitable for synthetics. While this is rare in our experience, significant tuning may still be able to put you back in the ballpark.

Data Privacy and Privacy Configuration Scores

Your Data Privacy Score analyzes the synthetic output to measure how well protected your original data is from adversarial attacks. It combines results from two common attacks: Membership Inference and Attribute Inference.

Membership Inference Protection measures how well you are protected from an adversary attempting to determine if specific data points were part of the training set. Attribute Inference Protection measures how well you are protected from an adversary trying to predict sensitive attributes of the data used in training, given other attributes.

Your Privacy Configuration Score is determined by the privacy mechanisms you've enabled in the synthetic configuration. By nature, synthetic data is inherently more private than real-world data.

The table below details the recommended data-sharing use cases based on the Data Privacy and Privacy Configuration Scores.

For use cases that require especially high levels of privacy, we recommend applying the following techniques or filters to try to increase the score:

  • Use the Outlier Filter to ensure that no synthetic record is an outlier with respect to the training space. You can enable this filter by setting privacy_filters.outliers: privacy_filters.outliers: [medium, high].

  • Use the Similarity Filter to ensure that no synthetic record is overly similar to a training record. You can enable this filter by setting privacy_filters.similarity: privacy_filters.similarity: [medium, high].

  • Set privacy_filters.outliers to auto which will try for medium, and fall back to turning the filter off if it prevents the synthetic model from generating the requested number of records.

  • Underfit the model to generate output that is less similar to the input. In all model types, you can reduce epochs to underfit or prevent overfitting. In LSTM, you can also set validation_split: True and early_stopping: True in the configuration.

  • Apply Differential Privacy, or reduce epsilon if Differential Privacy is applied.

  • Increase your training dataset size to reduce the influence of individual data points on the overall model.

You can learn more about the Data Privacy and Privacy Configuration Scores on our Privacy Protection page.

Synthetic Quality Summary Statistics

The Synthetic Quality Score is computed by taking a weighted combination of three individual quality metrics (described in more detail below): Field Distribution Stability, Field Correlation Stability, and Deep Structure Stability. Towards the top of the report, summary statistics are given showing how well the synthetic data scored within each metric. The summary statistics also include row and column counts for the training and synthetic data, as well as whether any training lines were duplicated.

The row count is the number of records or lines in the training (or synthetic) dataset. The column count is the number of fields in the dataset. The number of training rows used can directly impact the quality of the synthetic data created. The more examples available when training a model, the easier it is for the model to accurately learn the distributions and correlations in the data. Always strive to have a minimum of 3000 training examples, but increasing that to 5000 or even 50,000 is even better.

The more synthetic rows generated, the easier it is to deduce whether the statistical integrity of the data remains intact. If your Synthetic Data Quality Score isn't as high as you'd like it to be, make sure you’ve generated at least 5000 synthetic data records.

The Training Lines Duplicated value is an important way of ensuring the privacy of the generated synthetic data. In almost all situations, this value should be 0. The only exception would be if the training data itself contained a multitude of duplicate rows. If this is the situation, simply remove the duplicate rows before training.

Privacy Configuration

The Privacy Configuration Score is based on how you configured your privacy settings. The report provides a summary of the privacy protections, indicating the setting you selected for each in the report. The Protections include Outlier Filtering, Similarity Filtering, Overfitting Prevention, and Differential Privacy.

Data Privacy Summary

The Data Privacy Score is computed by averaging the scores from Membership Inference Protection and Attribute Inference Protection. Towards the top of the report, summary statistics are given showing how well the synthetic data scored within each metric.

At the end of the report, you can see the breakdown of how these metrics are calculated.

Training Field Overview

Following the Privacy Protection Summary section, you’ll see an overview of all the fields in the training data (example below). The high-level Field Distribution Stability score is computed by taking the average of the individual Field Distribution Stability scores, shown in the rightmost column below. To better understand a field's Distribution Stability score, click on the field name to be taken to a graph comparing the training and synthetic distributions.

The training field overview table also shows the count of unique and missing field values, the average length of each field, as well as its datatype. When a dataset contains a large number of highly unique fields or a large amount of missing data, these characteristics can impede the model's ability to accurately learn the statistical structure of the data. Exceptionally long fields can also have the same impact. Read Tips to Improve Synthetic Data Quality for advice on how best to handle fields like these.

Synthetic Data Quality Metrics

Field Correlation Stability

To measure Field Correlation Stability, the correlation between every pair of fields is computed first in the training data, and then in the synthetic data. The absolute difference between these values is then computed and averaged across all fields. The lower this average value is, the higher the Field Correlation Stability quality score will be.

To aid in the comparison of field correlations, the report shows heatmaps for both the training data and the synthetic data, as well as for the computed difference of correlation values. To view the details of what each square in a heatmap refers to, simply hover over the square with your cursor. The hover text will show you the two fields involved, as well as the correlation in the training data, the correlation in the synthetic data, and the difference between the two.

Deep Structure Stability

and To verify the statistical integrity of deeper, multi-field distributions and correlations, Gretel compares a Principal Component Analysis (PCA) computed first on the original data, then again on the synthetic data. The idea behind PCA is to capture in just a few features the essential shape of all the features. These new features are what is referred to as the Principal Components.

Gretel computes a synthetic quality score by comparing the distributional distance between the principal components in the original data and those in the synthetic data. The closer the principal components are, the higher the synthetic quality score will be. As PCA is a very common approach used in machine learning for both dimensionality reduction and visualization, this metric gives immediate feedback as to the utility of the synthetic data for machine learning purposes.

Field Distribution Stability

Field Distribution Stability is a measure of how closely the field distributions in the synthetic data mirror those in the original data. For each field we use a common approach for comparing two distributions referred to as the Jensen-Shannon Distance. The lower the JS Distance score is on average across all fields, the higher the Field Distribution Stability quality score will be.

To aid in the comparison of original versus synthetic field distributions, the report shows a bar chart or histogram for each field. To view the details of what each bar represents, simply hover over the bar with your cursor.

Data Privacy Metrics

Membership Inference Protection

Membership Inference Protection is a measure of how well-protected your data is from membership inference attacks. A membership inference attack is a type of privacy attack on machine learning models where an adversary aims to determine whether a particular data sample was part of the model's training dataset. By exploiting the differences in the model's responses to data points from its training set versus those it has never seen before, an attacker can attempt to infer membership. This type of attack can have critical privacy implications, as it can reveal whether specific individuals' data was used to train the model. To simulate this attack, we take a 5% holdout from the training data prior to training the model. Based on directly analyzing the synthetic output, a high score indicates that your training data is well-protected from this type of attack. The score is based on 360 simulated attacks, and the percentages indicate how many fell into each protection level.

Attribute Inference Protection

Attribute Inference Protection is a measure of how well-protected your data is from attribute inference attacks. An attribute inference attack is a type of privacy attack on machine learning models where an adversary seeks to infer missing attributes or sensitive information about individuals from their data that was used to train the model. By leveraging the model's output, the attacker can attempt to predict unknown attributes of a data sample. This type of attack poses significant privacy risks, as it can uncover sensitive details about individuals that were not intended to be revealed by the data owners. Based on directly analyzing the synthetic output, an overall high score indicates that your training data is well-protected from this type of attack. For a specific attribute, a high score indicates that even when other attributes are known, that specific attribute is difficult to predict.

By default, the privacy metrics are turned on and the quasi-identifier count used in the simulated attribute inference attacks is 3. You can adjust the quasi-identifier count or turn off the privacy metrics by editing the config. If you set skip: true, it overrides both skip_mia and skip_aia.

privacy_metrics:
  quasi_identifier_count: 3
  skip: true
  skip_mia: true
  skip_aia: true

Conclusion

The Gretel Synthetic Data Quality Report can be used as a quick data quality summary simply by viewing the graphics at the top of the report. It can also be used for more in-depth analysis of the integrity of specific distributions and correlations, as well as information about the privacy protection of your data. If your use case requires statistical symmetry and you’d prefer a higher synthetic data quality score, read Tips to Improve Synthetic Data Quality for a multitude of ideas for improving your model.

Last updated