Synthetic Text Data Quality

Measure your text data semantic and structural similarity.

Introduction

The Synthetic Text Data Quality Score (or Text SQS) is an assessment of how well generated synthetic data maintains the same semantic and structural properties as the original dataset.

Below we describe each component of the report.

Text Synthetic Data Quality Score (Text SQS)

At the very top of the report, the Text Synthetic Data Quality Score (Text SQS) is shown which represents the quality of the text synthetic data.

Text SQS is computed by taking a weighted combination of the individual quality metrics: Text Semantic Similarity and Text Structure Similarity. We’ll discuss each score more specifically in the next section.

The score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead.

The 50+ languages supported by the report are: ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw.

One way to interpret the Text SQS is to take a look at what use cases the generated synthetic data would be suitable for. If the score is “Poor” or “Very Poor”, read our tips for a multitude of ideas to improve the score.

If you don't require semantic or structural symmetry, as might be the case in a testing or demo environment, a lower score may be just as acceptable.

Data Summary Statistics

The report also shows statistics about the training data and the generated synthetic data in the Data Summary section.

The row count is the number of records or lines in the training (or synthetic) dataset. The column count is the number of fields in the dataset. The number of training rows used can directly impact the quality of the synthetic data created. The more examples available when training a model, the easier it is for the model to accurately learn the distributions in the data.

The Training Lines Duplicated value is an important way of ensuring the privacy of the generated synthetic data. In almost all situations, this value should be 0. The only exception would be if the training data itself contained a multitude of duplicate rows. If this is the situation, simply remove the duplicate rows before training.

Missing values refers to how many records are empty strings. Unique values is the count of unique records in the data. Average character, word and sentence count is calculated across all records in the dataset. These attributes give you a sense of the shape and size of the training set and synthetic data.

To compute the text semantic and structural similarity scores, training and synthetic records are downsampled to 80 rows or the training data rows, whichever is smaller, which is demonstrably statistically robust and more efficient for NLP models. This does not affect the number of records used for training of the language model to generate synthetic records.

Text Semantic Similarity Score

The Text Semantic Similarity Score is a value in the range of 0–100, which shows if the real and the synthetic texts have the same meaning evaluated across all of the text data. An embedding model is used to vectorize the text to a one-dimensional vector of size 512. The cosine similarity of the average embedded vectors across all records of the training and synthetic texts is calculated and a score is assigned based on the similarity.

A higher score assures the user that they can enable the synthetic text data in downstream semantic text classification tasks in place of the original text samples.

Principal Component Analysis for Semantic Similarity

The report also includes the Principal Component Analysis (PCA) plots to demonstrate semantic similarity. we observe the relation across the first four principal components of the average embedded vectors in the real and synthetic text along with the variance ratio explained by each component. The diagonal plots, on the other hand, show the distribution of each principal component for the real and synthetic texts plotted on top of each other. Similar real and synthetic scatter matrices and distribution plots depict a higher semantic similarity score, which gives a user more confidence in replacing the original text with the synthetic for semantic text classification tasks.

Text Structure Similarity Score

The Text Structure Similarity Score is a measure of how closely the sentence, average words per sentence, and characters per words distributions in the synthetic data mirror those in the original data. Structural similarity indicates how the average characters per word, words per sentence, and sentence counts compare between the datasets. We use Jensen-Shannon divergence to calculate the distance between the real vs synthetic distribution across the above statistics in the entire text dataset. Similar real and synthetic distributions result in a higher text structure similarity score, which is the scaled average of the three distance values. For better structure similarity, you can also change the maximum number of generated tokens in the config which is 100 by default.

Try it out!

You can generate and dig into a report yourself by selecting the “Generate natural language text using GPT” card on the Gretel Console. If you're looking to generate text from a prompt, use the Gretel Model Playground to prompt Gretel GPT.

Last updated