Synthetic Quality & Privacy Report

Assess the accuracy and privacy of your synthetic data.

The Synthetic Quality & Privacy Report gives both a summary and detailed overview of the quality of generated synthetic data. It takes as input both the original training data as well as the new synthetic data and assesses how well the statistical integrity of the training data was maintained and how well-protected the sensitive information of the training data is.

Synthetic Quality Score (SQS)

At the top of the report, a Synthetic Quality Score is shown which represents the quality of the synthetic data that was generated. Above, we show a score of 9.0 out of 10 which is excellent.

If you click on the Information icon for the SQS section, you will get the following description of what the Synthetic Quality Score is:

The SQS (Synthetic Quality Score) is a measure of how well the output data matches the reference data. A higher SQS indicates a better match. SQS is comprised of five metrics. Column Correlation Stability, Deep Structure Stability, and Column Distribution Stability are calculated and then averaged for any numeric and categorical columns in the datasets. Text Structure Similarity and Text Semantic Similarity are calculated and then averaged for any free text columns in the datasets. Finally, to calculate the overall SQS, a weighted average of the numeric and categorical metrics and the text metrics is taken based on the number of numeric, categorical, and free text columns in the datasets. If there are no columns of the required type, the metrics will show up as N/A.

Additionally, you will see a list, shown below, detailing common synthetic data use cases, and whether your SQS is high enough for that use case. The green checkmark indicates your data quality is sufficient.

With any use case other than Excellent, you can always try and improve your model with our Tips to Improve Synthetic Data Quality. If your score is Very Poor then there is some inherent problem with your data that makes it unsuitable for synthetics. While this is rare in our experience, significant tuning may still be able to put you back in the ballpark.

Synthetic Quality Subscores

The Synthetic Quality Score is computed by taking a weighted combination of five individual quality metrics (described in more detail below): Column Distribution Stability, Deep Structure Stability, Column Correlation Stability, Text Structure Similarity, and Text Semantic Similarity.

Column Correlation Stability

To measure Column Correlation Stability, the correlation between every pair of columns is computed first in the reference data, and then in the output data. The absolute difference between these values is then computed and averaged across all fields. The lower this average value is, the higher the Columnld Correlation Stability quality score will be.

To aid in the comparison of column correlations, the report shows heatmaps for both the training data and the synthetic data, as well as for the computed difference of correlation values. To view the details of what each square in a heatmap refers to, simply hover over the square with your cursor. The hover text will show you the two fields involved, as well as the correlation in the training data, the correlation in the synthetic data, and the difference between the two.

Deep Structure Stability

To verify the statistical integrity of deeper, multi-field distributions and correlations, Gretel compares a Principal Component Analysis (PCA) computed first on the reference data, then again on the output data. The idea behind PCA is to reduce the dimensionality of the data while capturing maximum variability of the dataset. The goal is to capture in just a few features the essential shape of all the features. These new features are the Principal Components.

Reducing dataset dimensions helps ML algorithms train faster and better. It also makes it feasible to now graph the dataset and get a sense of it's shape. As a synthetic data quality metric, you can think of it as both a deep distribution and deep correlation stability check, as well as an overall visual quality check. It is becoming increasingly more common to go beyond simple field histogram checks and look as well at the distributional stability of multi-term combinations. As the principal components are derived from all the features, you can think of their distribution as a kind of "all feature" distribution. You can also think of them as an "all feature" correlation check, as the principal components are essentially the eigenvectors of the all feature covariance matrix.

Gretel computes a score by comparing the distributional distance between the principal components in the original data and those in the synthetic data. The closer the principal components are, the higher the synthetic quality score will be. As PCA is a very common approach used in machine learning for both dimensionality reduction and visualization, this metric gives immediate feedback as to the utility of the synthetic data for machine learning purposes.

To understand PCA better, we'll take the example of a dataset with just two columns as graphed below. You can think of PCA to be like fitting an ellipsoid to the data where the axis of the ellipsoid (e.g. the directions in the data of most variability) represent the principal components.

Now imagine holding in your hand a more complex multidimensional object; the goal of PCA is to keep rotating that object until the view you see has maximal width. The line stretching from one end of that maximal width to the other is our first principal component. Now holding the object horizontally steady, rotate the object toward you until in your view is the maximal height. That axis will be our second principal component. Hence the idea is to keep finding the axis with maximum variability that is always perpendicular to the previous axis chosen. As a result, the new dimensions you have chosen have zero correlation with each other, which is another reason why PCA can often improve not just the speed but the accuracy of some ML algorithms.

The approach to creating a PCA score in Gretel code is first to ensure the training and synthetic datasets are of equal size. If they aren’t, then sampling is done. Then, after normalizing the two datasets, the sklearn PCA function is used to compute the first two principal components in both the training and then in the synthetic datasets. We set a random seed to ensure any recall to the reporting metric will always return the same result. We compute the Jensen-Shannon Distributional Distance score between the first principal components in each dataset, and then between the second principal components. The average of these two scores is the final raw metric score. It will range between 0 and 1, lower meaning better. The metric raw score is then converted to the 0 to 10 score that you see in the report, higher meaning better.

Column Distribution Stability

Column Distribution Stability is a measure of how closely the field distributions in the synthetic data mirror those in the original data. For each field we use a common approach for comparing two distributions referred to as the Jensen-Shannon Distance. The lower the JS Distance score is on average across all columns, the higher the Column Distribution Stability score will be.

To aid in the comparison of reference versus output column distributions, the report shows a bar chart or histogram for each column. To view the details of what each bar represents, simply hover over the bar with your cursor.

Text Structure Similarity

Text Structure Similarity is a measure of how well the output data matches the reference data in terms of the structure of the text for any free text columns that are present. A higher similarity indicates a better match. The text structure similarity is calculated using the cosine similarity between the word embeddings of the reference and output data for any free text columns.

Text Semantic Similarity

Text Semantic Similarity is a measure of how well any free text columns in the output data match the reference data in terms of the meaning of the text. A higher similarity indicates a better match (less replaying the text in the reference data). The text semantic similarity is calculated using the cosine similarity between the sentence embeddings of the reference and output data. To penalize replaying of the reference texts, we take a 5% holdout from the training data prior to training the model. Text Semantic Similarity is more accurate with larger sample sizes of reference and output datasets. Text Semantic Similarity is not available when no holdout set is provided or when the sample sizes are too small.

Data Privacy Score (DPS)

Your Data Privacy Score analyzes the synthetic output to measure how well protected your original data is from adversarial attacks. It combines results from two common attacks: Membership Inference and Attribute Inference. It returns a score out of 10, where higher is better.

Membership Inference Protection measures how well you are protected from an adversary attempting to determine if specific data points were part of the training set. Attribute Inference Protection measures how well you are protected from an adversary trying to predict sensitive attributes of the data used in training, given other attributes.

The list below details the recommended data-sharing use cases based on the Data Privacy Score.

For use cases that require especially high levels of privacy, we recommend applying the following techniques or filters to try to increase the score:

Underfit the model to generate output that is less similar to the input. In all model types, you can reduce epochs to underfit or prevent overfitting.
Apply Differential Privacy, or reduce epsilon if Differential Privacy is applied.
Increase your training dataset size to reduce the influence of individual data points on the overall model.

You can learn more about the Data Privacy Score on our Privacy Protection page.

Data Privacy Subscores

The Data Privacy Score is computed by averaging the scores from Membership Inference Protection and Attribute Inference Protection. Towards the top of the report, summary statistics are given showing how well the synthetic data scored within each metric.

Membership Inference Protection

Membership Inference Protection is a measure of how well-protected your data is from membership inference attacks. A membership inference attack is a type of privacy attack on machine learning models where an adversary aims to determine whether a particular data sample was part of the model's training dataset. By exploiting the differences in the model's responses to data points from its training set versus those it has never seen before, an attacker can attempt to infer membership. This type of attack can have critical privacy implications, as it can reveal whether specific individuals' data was used to train the model. To simulate this attack, we take a 5% holdout from the training data prior to training the model. Based on directly analyzing the synthetic output, a high score indicates that your training data is well-protected from this type of attack. The score is based on 360 simulated attacks, and the percentages indicate how many fell into each protection level.

Attribute Inference Protection

Attribute Inference Protection is a measure of how well-protected your data is from attribute inference attacks. An attribute inference attack is a type of privacy attack on machine learning models where an adversary seeks to infer missing attributes or sensitive information about individuals from their data that was used to train the model. By leveraging the model's output, the attacker can attempt to predict unknown attributes of a data sample. This type of attack poses significant privacy risks, as it can uncover sensitive details about individuals that were not intended to be revealed by the data owners. Based on directly analyzing the synthetic output, an overall high score indicates that your training data is well-protected from this type of attack. For a specific attribute, a high score indicates that even when other attributes are known, that specific attribute is difficult to predict.

PII Replay

Personally Identifiable Information (PII) Replay is only supported in Evaluate reports because it requires comparing original data to the final synthetic output, not intermediary results. Thus, you will not see it in an SQS report that gets automatically generated after a model run.

PII Replay counts the number of total and unique instances of personally identifiable information (PII) from your reference data that show up in your synthetic output. Lower counts under "Synthetic Data PII Replay" indicate higher privacy.

Note that some PII replay is often expected (and even desirable). In general, you can expect entities that are rarer (and therefore have many possible values), like full address or full name, to have lower amounts of PII replay than entities that are more common (and therefore have fewer possible values), like first name or US state. To reduce the amount of PII replay, run Gretel Transform prior to synthetics.

Results appear in a table, with one row per dataset column analyzed for PII Replay. For each dataset column, the table shows total instances of PII in the training data, how many were unique, and how many were replayed in the synthetic output—making it easy to spot privacy risks at a glance.

Let’s look at the example image to understand how to interpret the results. You can see from the header that the reference data had 2850 rows while the output data for which we are measuring PII Replay had 1000 rows.

In the first row, we can see the dataset column named FirstName was found to be of PII Type first_name. In the reference data, all 2850 rows were labeled with this entity type. Of those 2850 values, 1563 were unique.

Comparing to the output data, we see that 786 rows out of the 1000 in the output data had a value that matched one of the 1583 from the original. Of those 1583 unique values, 170 were replayed (11%).

The rest of the rows can be read in the same way.

Privacy metrics in the configuration

By default, the privacy metrics are turned on and the quasi-identifier count used in the simulated attribute inference attacks is 3. You can adjust the quasi-identifier count or turn off the privacy metrics by editing the config. If you set skip: true, it overrides both skip_mia and skip_aia.

When running Evaluate, you can also choose to turn on or off PII Replay by setting the skip parameter for pii_replay. You can specify which entities to search for by including them in a list under the entities parameter. You can find the supported entities under the Transform model documentation, as this is what PII Replay uses to classify PII.

privacy_metrics:
  quasi_identifier_count: 3
  skip: false
  skip_mia: false
  skip_aia: false
pii_replay:
  skip: false
  entities:
    - first_name
    - last_name
    - ...

Conclusion

The Gretel Synthetic Quality & Privacy Report can be used as to quickly evaluate your synthetic output simply by viewing the graphics at the top of the report. It can also be used for more in-depth analysis of the integrity of specific distributions and correlations, as well as information about the privacy protection of your data. If your use case requires statistical symmetry and you need a higher Synthetic Quality Score, read Tips to Improve Synthetic Data Quality for a multitude of ideas for improving your model. If your use case requires the highest levels of privacy and you need to increase the Data Privacy Score, read Privacy Protection for mechanisms to tweak.

PreviousEvaluate NextTips to Improve Synthetic Data Quality

Last updated 2 months ago

Was this helpful?