CLI and SDK Fundamentals

Synthetics

Transforms

Connectors (beta)

Evaluation Framework

Synthetic Data Quality

Evaluating Data Sampling Methods with a Synthetic Quality Score

An evaluation of the effect of sampling procedures on the quality of synthetic tabular data using Gretel.ai's Synthetic Quality Score (SQS).

Introduction

While much recent research in probabilistic generative modeling has focused on visual domains such as images, a large portion of synthetic data use cases revolve around structured data such as CSVs and Databases. At Gretel, we have built generative modeling tools to learn distributions from customer data and generate realistic synthetic data from those distributions.

As with any ML project, a key aspect is measuring a modelâ€™s performance. We have developed a tool called the **S**ynthetic **Q**uality **S**core (SQS) to ensure that the tabular output of our generative models maintains important statistical information found in the ground-truth data. In this post, we'll walkthrough the process of how we calculate and interpret our synthetic quality score, and how you can use it to evaluate the effect that sampling methods have on the quality of synthetically-generated tabular data.

Calculating Gretel's Synthetic Quality Score (SQS)

Figure 1: Comparing correlations of the ground truth dataset and the synthetic dataset.

The SQS Formula

Using the distances between correlation values, PCAs, and feature distributions we construct a single scalar for each of the three features above. These real scalar values are in the range [0,100] and are smooth. These two properties are ensured by fitting polynomials over a large number of real tabular datasets. The SQS is then calculated and normalized. An SQS of 100 means the synthetic data is a perfect match and 0 means the model fit was quite poor.

With this measure in place we can now explore how probabilistic sampling can change the quality of synthetically generated data. We use a decoder only Transformer with 1.5 billion parameters fine tuned on the UCI Adult dataset.

The Transformer is trained to model a categorical distribution over a finite vocabulary. The vocabulary consists of encoding tokens of the dataset. There are many ways to sample from this distribution over the vocabulary. The first is to directly **sample **from the temperature adjusted distribution which performs very well due to the alignment between training and testing. Alternatively we can restrict the categorical distribution to the **top k** tokens or the **nucleus **of top p mass or to the set of tokens with high **typical** information content. These methods perform slightly worse, which is surprising given their prevalence in many NLP generative modeling tasks. Worse of all is **beam **search, which likely fails due to the sharpness of the categorical distribution.

Our method of choice is an ensemble of sampling methods which performs as well as sampling directly from the categorical distribution while reducing the variance of SQS performance.

â€‹

Last modified 13d ago