Comment on page
Tips to Improve Synthetic Data Quality
As shown in the outline below, there are many ways you can improve your quality score if it isn’t as high as you’d like it to be.
The synthetic data quality score is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. If your use case depends on this, then higher scores can infer higher utility.
The number of training records used can directly impact the quality of the synthetic data created. The more examples available when training a model, the easier it is for the model to accurately learn the distributions and correlations in the data. Always strive to have a minimum of 3000 training examples, but increasing that to 5000 or even 50,000 is even better.
The more synthetic records generated, the easier it is to deduce whether the statistical integrity of the data remains intact. If your Synthetic Data Quality Score isn't as high as you'd like it to be, and you’ve generated less than 5000 synthetic records, generating more synthetic records is a good way to deduce if there really is a quality issue. If there is, read below for a multitude of ideas for improving your model.
As in any statistical or machine learning analysis, the first step is to clean up your data. Tending to the following issues can be vital in creating quality synthetic data:
Assess the extent of your missing values. A moderate amount of missing data can be easily handled by the synthetic model. An excessive amount can lead to difficulties in accurately learning the statistical structure of the data. Decide if it's appropriate to drop columns or rows with missing data, or if it's more appropriate to attempt to fill in the missing fields using for example the median or techniques such as KNN imputation.
Study the correlation matrix in the Synthetic Performance Report, and remove unnecessary, highly correlated fields. This is particularly important when the dataset contains a large number of columns.
A large number of highly unique fields, or just one highly unique field that is exceptionally long, such as an ID can cause the model to struggle in its attempt to learn the patterns in the data. If possible, consider removing the field or fields before synthetic data generation, and adding them back in afterwards.
If training records are duplicated in error, then any statistical analysis of the data will be impacted. A large number of duplicated records can also cause the model to see it as a pattern it needs to learn resulting in the potential duplication of private information in the generated synthetic data.
Assess whether there are anomalies in the data, and if so, whether they are errors or true data points. In general, the synthetic model will usually be robust to anomalies, but on the occasion that it's not, the replay of an anomaly in the synthetic data can potentially constitute a serious breach in privacy.
If a long categorical field can be replaced with a simpler set of integer labels, do so. If a long numeric field can be grouped into a smaller number of discrete bins, do so. If a floating point value has excessive levels of precision, remove them. This step is rarely needed, but if you find the model is struggling it may help improve performance.
Test reading your training CSV file into a Pandas DataFrame. Sanity check that the columns are as expected and that there are no warnings about columns with multiple data type
Fields with an average length of more than 30 characters can cause the model to struggle. Consider whether the field is really necessary. If so, consider anonymizing that field separately with our Blueprint that automatically finds PII in text and uses atomic transformations to anonymize. The remaining fields can be generated with a synthetic model, and the troublesome field added back in afterwards.
When one field is a derivation of other fields, the integrity of that relationship may not be maintained in the synthetic data. One example is if a field is the sum of two or more other fields. Another is if one date is always a certain number of days away from another date. If maintaining the relationship is important, remove the derived field from the training dataset, and recompute it after synthetic data generation.
When the distribution of one field is vital to the dataset, such as a date in a time series, and exactness of the distribution is not maintained in the synthetic data, consider using that field as a "seed" for generating the remaining fields. Refer to our blog on smart seeding for more information.
When a field is highly unbalanced and you wish to mediate that, refer to our blog on smart seeding for how to use seeds to balance your data. If the synthetic data is to be used to build an AI system and the field in question is unbalanced due to demographic bias, refer to our blog on automatically reducing AI bias. If the field in question is also the target of a machine learning task, refer to our blog for boosting a massively imbalanced dataset that uses a SMOTE like technique to steer the synthetic model to offset the imbalance.
The synthetic model actually thrives on structure; however, if you have a field with an excessively long, complicated structure, consider dividing that field into multiple easier fields.
You may find these tips more targeted and helpful to improve the synthetic data quality for text or natural language:
Making your training data more consistent in length can help with better model training and results.
Check that your training data is text. Most large language models are trained primarily on natural language, so if your training data is not text, consider cleaning your data.
You can substitute a different pretrained model in Gretel's GPTx model configuration. A larger or newer model may generate better results because of more training data and/or other model improvements like newer architecture
If your training data is a non-English language, consider substituting a model in Gretel's GPTx configuration that has been trained on that language (e.g. a Korean version of GPT). Or if your training data has a particular use case, there could be pretrained models more specific to your use case that can generate better results (e.g. biomedical text model or instruction model).