Synthetic Data Quality

Assess the accuracy and privacy of your synthetic data.

Introduction

The Data Quality Report gives both a summary and detailed overview of the quality of generated synthetic data. It takes as input both the original training data as well as the new synthetic data and assesses how well the statistical integrity of the training data was maintained and the privacy level of the synthetic data.

Synthetic Data Quality Score (SQS)

At the very top of the report, a Synthetic Data Quality Score is shown which represents the quality of the synthetic data that was generated. Above, we show a score of 93 which is excellent.

If you click on the question mark on the right, you will get the above description for what the Synthetic Data Quality Score is:

The quality score is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. In this sense, the Synthetic Data Quality Score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. If you do not require statistical symmetry, as might be the case in a testing or demo environment, a lower score may be just as acceptable.

Additionally, you will see a table, shown below, detailing common synthetic data use cases, and the SQS level recommended for each.

With any use case other than Excellent, you can always try and improve your model with our Tips to Improve Synthetic Data Quality. If your score is Very Poor then there is some inherent problem with your data that makes it unsuitable for synthetics. While this is rare in our experience, significant tuning may still be able to put you back in the ballpark.

Privacy Protection Level (PPL)

Your Privacy Protection Level (PPL) is determined by the privacy mechanisms you've enabled in the synthetic configuration. By nature, synthetic data is inherently more private than real-world data. The table below details the PPL recommended for different data sharing use cases.

For use cases that require especially high levels of privacy, the use of these mechanisms help to ensure that your synthetic data is safe from adversarial attacks. There are four primary protection mechanisms you can add to the creation of synthetic data for additional privacy protection.

  • The Outlier Filter ensures that no synthetic record is an outlier with respect the training space, and is enabled with the privacy_filters.outliers: [medium, high].

  • The Similarity Filter ensures that no synthetic record is overly similar to a training record. This filter is enabled in the configuration by setting privacy_filters.similarity: [medium, high].

  • You can also set privacy_filters.outliers to auto which will try for medium, and fall back to turning the filter off if it prevents the synthetic model from generating the requested number of records.

  • Overfitting Prevention ensures that model training stops before it has a chance to overfit and is enabled using the validation_split: True and early_stopping: True configuration settings.

  • Differential Privacy, when using the Tabular DP model.

You can learn more about PPL on our Privacy Protection page.

Data Summary Statistics

The Synthetic Data Quality Score is computed by taking a weighted combination of three individual quality metrics (described in more detail below): Field Distribution Stability, Field Correlation Stability and Deep Structure Stability. Towards the top of the report, summary statistics are given showing how well the synthetic data scored within each metric. The summary statistics also include row and column counts for the training and synthetic data, as well as whether any training lines were duplicated.

The row count is the number of records or lines in the training (or synthetic) dataset. The column count is the number of fields in the dataset. The number of training rows used can directly impact the quality of the synthetic data created. The more examples available when training a model, the easier it is for the model to accurately learn the distributions and correlations in the data. Always strive to have a minimum of 3000 training examples, but increasing that to 5000 or even 50,000 is even better.

The more synthetic rows generated, the easier it is to deduce whether the statistical integrity of the data remains intact. If your Synthetic Data Quality Score isn't as high as you'd like it to be, make sure you’ve generated at least 5000 synthetic data records.

The Training Lines Duplicated value is an important way of ensuring the privacy of the generated synthetic data. In almost all situations, this value should be 0. The only exception would be if the training data itself contained a multitude of duplicate rows. If this is the situation, simply remove the duplicate rows before training.

Training Field Overview

Following the Privacy Protection Summary section, you’ll see an overview of the all the fields in the training data (example below). The high level Field Distribution Stability score is computed by taking the average of the individual Field Distribution Stability scores, shown in the right most column below. To better understand a field's Distribution Stability score, click on the field name to be taken to a graph comparing the training and synthetic distributions.

The training field overview table also shows the count of unique and missing field values, the average length of each field, as well as its datatype. When a dataset contains a large number of highly unique fields, or a large amount of missing data, these characteristics can impede the model's ability to accurately learn the statistical structure of the data. Exceptionally long fields can also have the same impact. Read Tips to Improve Synthetic Data Quality for advice on how best to handle fields like these.

Synthetic Data Quality Metrics

Field Correlation Stability

To measure Field Correlation Stability, the correlation between every pair of fields is computed first in the training data, and then in the synthetic data. The absolute difference between these values is then computed and averaged across all fields. The lower this average value is, the higher the Field Correlation Stability quality score will be.

To aid in the comparison of field correlations, the report shows heatmaps for both the training data and the synthetic data, as well as for the computed difference of correlation values. To view the details of what each square in a heatmap refers to, simply hover over the square with your cursor. The hover text will show you the two fields involved, as well as the correlation in the training data, the correlation in the synthetic data and the difference between the two.

Deep Structure Stability

To verify the statistical integrity of deeper, multi-field distributions and correlations, Gretel compares a Principal Component Analysis (PCA) computed first on the original data, then again on the synthetic data. The idea behind PCA is to capture in just a few features the essential shape of all the features. These new features are what is referred to as the Principal Components.

Gretel computes a synthetic quality score by comparing the distributional distance between the principal components in the original data and those in the synthetic data. The closer the principal components are, the higher the synthetic quality score will be. As PCA is a very common approach used in machine learning for both dimensionality reduction and visualization, this metric gives immediate feedback as to the utility of the synthetic data for machine learning purposes.

Field Distribution Stability

Field Distribution Stability is a measure of how closely the field distributions in the synthetic data mirror those in the original data. For each field we use a common approach for comparing two distributions referred to as the Jensen-Shannon Distance. The lower the JS Distance score is on average across all fields, the higher the Field Distribution Stability quality score will be.

To aid in the comparison of original versus synthetic field distributions, the report shows a bar chart or histogram for each field. To view the details of what each bar represents, simply hover over the bar with your cursor.

Conclusion

The Gretel Synthetic Data Quality Report can be used as a quick data quality summary simply by viewing the graphics at the top of the report. It can also be used for more in depth analysis on the integrity of specific distributions and correlations. If your use case requires statistical symmetry and you’d prefer a higher synthetic data quality score, read Tips to Improve Synthetic Data Quality for a multitude of ideas for improving your model.

Last updated