The Gretel Synthetic Report gives both a summary and detailed overview of the quality of generated synthetic data. It takes as input both the original training data as well as the new synthetic data and assesses how well the statistical integrity of the training data was maintained.
At the very top of the Gretel Synthetic Report, a Synthetic Data Quality Score is shown which represents the quality of the synthetic data that was generated. Here, we show a score of 94 which is excellent.
The quality score is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. In this sense, the Synthetic Data Quality Score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. If you do not require statistical symmetry, as might be the case in a testing or demo environment, a lower score may be just as acceptable.
If you click on the question mark on the right, you will get the above description for what the Synthetic Data Quality Score is. If you click on the question mark to the left, you’ll get the below table showing how to interpret your score. When your score is Excellent or Good, any of the listed use cases are viable for your synthetic data. When your score is Moderate, then the viable use cases are more limited. With any use case other than Excellent, you can always try and improve your model with our tips and advice here. If your score is Very Poor then there is some inherent problem with your data that makes it unsuitable for synthetics. While this is rare in our experience, significant tuning may still be able to put you back in the ballpark.
The Synthetic Data Quality Score is computed by taking a weighted combination of the individual quality metrics: Field Distribution Stability, Field Correlation Stability and Deep Structure Stability. Towards the top of the report, summary statistics are given showing how well the synthetic data scored within each metric. The summary statistics also include row and column counts for the training and synthetic data, as well as whether any training lines were duplicated.
The row count is the number of records or lines in the training (or synthetic) dataset. The column count is the number of fields in the dataset. The number of training rows used can directly impact the quality of the synthetic data created. The more examples available when training a model, the easier it is for the model to accurately learn the distributions and correlations in the data. Always strive to have a minimum of 3000 training examples, but increasing that to 5000 or even 50,000 is even better.
The more synthetic rows generated, the easier it is to deduce whether the statistical integrity of the data remains intact. If your Synthetic Data Quality Score isn't as high as you'd like it to be, make sure you’ve generated at least 5000 synthetic data records.
The Training Lines Duplicated value is an important way of ensuring the privacy of the generated synthetic data. In almost all situations, this value should be 0. The only exception would be if the training data itself contained a multitude of duplicate rows. If this is the situation, simply remove the duplicate rows before training.
Following the Data Summary Statistics section, you’ll see an overview of the all the fields in the training data (example below). The high level Field Distribution Stability score is computed by taking the average of the individual Field Distribution Stability scores, shown in the right most column below. To better understand a field's Distribution Stability score, click on the field name to be taken to a graph comparing the training and synthetic distributions.
The training field overview table also shows the count of unique and missing field values, the average length of each field, as well as its datatype. When a dataset contains a large number of highly unique fields, or a large amount of missing data, these characteristics can impede the model's ability to accurately learn the statistical structure of the data. Exceptionally long fields can also have the same impact. Read here for advice on how best to handle fields like these.
To measure Field Correlation Stability, the correlation between every pair of fields is computed first in the training data, and then in the synthetic data. The absolute difference between these values is then computed and averaged across all fields. The lower this average value is, the higher the Field Correlation Stability quality score will be.
To aid in the comparison of field correlations, a heatmap is shown for both the training data and the synthetic data, as well as a heatmap for the computed difference of correlation values. To view the details of what each square in the heatmap refers to, simply hover over the square with your cursor. The hover text will show you the two fields involved, as well as the correlation in the training data, the correlation in the synthetic data and the difference between the two. To more easily match squares between matrices, choose the “Toggle Spike Lines” option in the Plotly modebar in the upper right region of the graph.
To verify the statistical integrity of deeper, multi-field distributions and correlations, Gretel compares a Principal Component Analysis (PCA) computed first on the original data, then again on the synthetic data. The idea behind PCA is to capture in just a few features the essential shape of all the features. These new features are what is referred to as the Principal Components.
To understand PCA better, we'll take the example of a dataset with just two columns as graphed below. You can think of PCA to be like fitting an ellipsoid to the data where the axis of the ellipsoid (e.g. the directions in the data of most variability) represent the principal components.
Now imagine holding in your hand a more complex multidimensional object; the goal of PCA is to keep rotating that object until the view you see has maximal width. The line stretching from one end of that maximal width to the other is our first principal component. Now holding the object horizontally steady, rotate the object toward you until in your view is the maximal height. That axis will be our second principal component. Hence the idea is to keep finding the axis with maximum variability that is always perpendicular to the previous axis chosen. As a result the new dimensions that are created capture the essence of the essential shape of the data.
Gretel computes a synthetic quality score by comparing the distributional distance between the principal components in the original data and those in the synthetic data. The closer the principal components are, the higher the synthetic quality score will be. An example principal component comparison is shown below. As PCA is a very common approach used in machine learning for both dimensionality reduction and visualization, this metric gives immediate feedback as to the utility of the synthetic data for machine learning purposes.
Field Distribution Stability is a measure of how closely the field distributions in the synthetic data mirror those in the original data. For each field we use a common approach for comparing two distributions referred to as the Jensen-Shannon Distance. The lower the JS Distance score is on average across all fields, the higher the Field Distribution Stability quality score will be.
To aid in the comparison of original versus synthetic field distributions, a bar chart or histogram is shown for each field. To view the details of what each bar in a graph refers to, simply hover over the bar with your cursor. To more easily compare the training and synthetic values associated with each bar, choose the "Compare data on hover" option in the Plotly modebar in the upper right region of the graph. By enabling this option, the hover text will show both values for the closest pair of bars as opposed to just the value for the individual closest bar.
The Gretel Synthetic Report can be used as a quick data quality summary simply by viewing the graphics at the top of the report. It can also be used for more in depth analysis on the integrity of specific distributions and correlations. If your use case requires statistical symmetry and you’d prefer a higher synthetic data quality score, read here for a multitude of ideas for improving your model.