Synthetic Data Utility
Analyze performance on ML models of your synthetic data
Last updated
Analyze performance on ML models of your synthetic data
Last updated
The Data Utility Report gives both a summarized and a detailed view of synthetic and real data trained and evaluated on ML classification/regression models. To get a Data Utility Report, first set up a Evaluate classification task or Evaluate regression task.
When these tasks are selected, synthetic and real data are used separately to train the downstream models, and results from both are shown in the Data Utility Report for easy comparison.
The Machine Learning Quality Score (MQS) provides an at-a-glance comparison of synthetic and real data. Synthetic and real data are used to separately train downstream models (classifiers or regression models), using the open-source AutoML PyCaret library.
The results from the training are aggregated, then the top models' results are used to calcuate MQS. The MQS is the ratio of the average score from the top-performing models trained on synthetic data to the average score from the top-performing models trained on real data.
Sometimes, MQS may be >100%, indicating that synthetic data outperformed real data - an exciting result! The MQS gives you a quick evaluation of whether your synthetic data is ready to be used for your ML training, experimentation and deployment pipeline.
It's important to know the size and shape of the two datasets, which is what the summary of data stats shows. In this case, the number of rows of the real and synthetic datasets should always be equivalent, since this contributes to a realiable comparison of ML performance. If your input data had more than 5000 rows, the two datasets may be downsampled to 5000 rows each in order to perform a fast ML training and evaluation comparison to generate the report.
Next is a visual representation of the three best models trained on synthetic data vs. three best models trained on real-world data. The averages of these were used to calculate MQS.
In a classification report, the default metric is "accuracy", which is also reflected in the report, but you can change this in the metric
parameter of the Evaluate classification task.
In a regression report, the default metric is "R2" but you can indicate a different metric to optimize the regression models in the metric
parameter of the Evaluate regression task.
Whichever metric was used can be seen in the header of this section of the Data Utility Report in this form: Top models by [metric] - synthetic vs. real data
All metrics, not just the one selected for model optimization, are available in this table of results. If you selected specific models to train in the configuration of the Evaluate task, you'll see those models here. If you didn't indicate a subset, you'll see all the downstream models' results here.