Tips to Improve Synthetic Data Quality
As shown in the outline below, there are many ways you can improve your quality score if it isn’t as high as you’d like it to be.
The synthetic data quality score is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. If your use case depends on this, then higher scores can infer higher utility.
Gretel's default synthetic data configuration is designed to work with a variety of datasets. That said, if your model does not train correctly with your source dataset, or you would like to improve your score we recommend starting by trying one of our other managed configurations on Github.
Increase your training data
The number of training records used can directly impact the quality of the synthetic data created. The more examples available when training a model, the easier it is for the model to accurately learn the distributions and correlations in the data. Always strive to have a minimum of 3000 training examples, but increasing that to 5000 or even 50,000 is even better.
Increase your synthetic data
The more synthetic records generated, the easier it is to deduce whether the statistical integrity of the data remains intact. If your Synthetic Data Quality Score isn't as high as you'd like it to be, and you’ve generated less than 5000 synthetic records, generating more synthetic records is a good way to deduce if there really is a quality issue. If there is, read below for a multitude of ideas for improving your model.
Clean your data first
As in any statistical or machine learning analysis, the first step is to clean up your data. Tending to the following issues can be vital in creating quality synthetic data:
Handle missing values
Assess the extent of your missing values. A moderate amount of missing data can be easily handled by the synthetic model. An excessive amount can lead to difficulties in accurately learning the statistical structure of the data. Decide if it's appropriate to drop columns or rows with missing data, or if it's more appropriate to attempt to fill in the missing fields using for example the median or techniques such as KNN imputation.
Remove redundant fields
Study the correlation matrix in the Synthetic Performance Report, and remove unnecessary, highly correlated fields. This is particularly important when the dataset contains a large number of columns.
Consider removing highly unique fields
A large number of highly unique fields, or just one highly unique field that is exceptionally long, such as an ID can cause the model to struggle in its attempt to learn the patterns in the data. If possible, consider removing the field or fields before synthetic data generation, and adding them back in afterwards.
Remove duplicate records
If training records are duplicated in error, then any statistical analysis of the data will be impacted. A large number of duplicated records can also cause the model to see it as a pattern it needs to learn resulting in the potential duplication of private information in the generated synthetic data.
Deal with anomalies
Assess whether there are anomalies in the data, and if so, whether they are errors or true data points. In general, the synthetic model will usually be robust to anomalies, but on the occasion that it's not, the replay of an anomaly in the synthetic data can potentially constitute a serious breach in privacy.
Simplify your data where possible
If a long categorical field can be replaced with a simpler set of integer labels, do so. If a long numeric field can be grouped into a smaller number of discrete bins, do so. If a floating point value has excessive levels of precision, remove them. This step is rarely needed, but if you find the model is struggling it may help improve performance.
Check CSV formats and column data types
Test reading your training CSV file into a Pandas DataFrame. Sanity check that the columns are as expected and that there are no warnings about columns with multiple data type
Handling Tough Fields
Exceptionally long fields
Fields with an average length of more than 30 characters can cause the model to struggle. Consider whether the field is really necessary. If so, consider anonymizing that field separately with our Blueprint that automatically finds PII in text and uses atomic transformations to anonymize. The remaining fields can be generated with a synthetic model, and the troublesome field added back in afterwards.
Connected fields
When one field is a derivation of other fields, the integrity of that relationship may not be maintained in the synthetic data. One example is if a field is the sum of two or more other fields. Another is if one date is always a certain number of days away from another date. If maintaining the relationship is important, remove the derived field from the training dataset, and recompute it after synthetic data generation.
Fields with highly critical distributions
When the distribution of one field is vital to the dataset, such as a date in a time series, and exactness of the distribution is not maintained in the synthetic data, consider using that field as a "seed" for generating the remaining fields. Refer to our blog on smart seeding for more information.
Highly unbalanced fields
When a field is highly unbalanced and you wish to mediate that, refer to our blog on smart seeding for how to use seeds to balance your data. If the synthetic data is to be used to build an AI system and the field in question is unbalanced due to demographic bias, refer to our blog on automatically reducing AI bias. If the field in question is also the target of a machine learning task, refer to our blog for boosting a massively imbalanced dataset that uses a SMOTE like technique to steer the synthetic model to offset the imbalance.
Highly structured fields
The synthetic model actually thrives on structure; however, if you have a field with an excessively long, complicated structure, consider dividing that field into multiple easier fields.
Tips for natural language data
You may find these tips more targeted and helpful to improve the synthetic data quality for text or natural language:
Consider truncating each row of data
Making your training data more consistent in length can help with better model training and results.
Remove non-standard characters
Check that your training data is text. Most large language models are trained primarily on natural language, so if your training data is not text, consider cleaning your data.
Try a larger or newer model
You can substitute a different pretrained model in Gretel's GPTx model configuration. A larger or newer model may generate better results because of more training data and/or other model improvements like newer architecture
Try a different model trained on your use case
If your training data is a non-English language, consider substituting a model in Gretel's GPTx configuration that has been trained on that language (e.g. a Korean version of GPT). Or if your training data has a particular use case, there could be pretrained models more specific to your use case that can generate better results (e.g. biomedical text model or instruction model).
Last updated