Classification
Evaluate synthetic data vs. real-world data on classification models
The classification
Evaluate task will generate a Gretel Synthetic Data Utility Report.
Learn more about the sections of the Data Utility Report
Customers frequently ask whether synthetic data is of high enough quality to train downstream machine learning tasks. Classifiers, for example, require highly accurate data before they can be usefully deployed.
The Gretel Evaluate Classification task uses the open source AutoML PyCaret library under the hood to evaluate the quality of your generated synthetic data on commonly used downstream machine learning classifiers, and gives you the results in an easy-to-understand HTML report.
Low-code using Gretel Console
You can kick off this evaluation directly in the Gretel Console. Start by using this example: Generate synthetic data + evaluate ML performance
This example includes a sample dataset (the publicly available bank marketing dataset) and the default blueprint:
Gretel LSTM model to generate synthetic data
classification
Evaluate task with default parameters
You can leave the config as is and simply click "Begin training" or edit the configuration with the synthetic model and optional classification
parameters best suited for your use case.
Supported models and metrics
By default, all models will be used in the classifier model training. You can select specific models to use by passing in a list of strings from the following set:
If you want to change the metric that the classifiers will use to optimize for, you can select one metric from classification_metrics
below. The default metric is "acc" (accuracy).
SDK
You can use the classification
Evaluate task in two ways:
1. As a parameter of a Gretel synthetics model, or
2. Compare two datasets directly: a synthetic dataset and a real-world dataset
Option 1: Train and generate synthetic data, then evaluate on classification models
Here's a basic example generating synthetic data using Gretel ACTGAN and the real-world bank marketing dataset, then adding classification
evaluation to create the Data Utility Report:
You can then run the model and save the report using:
Even when using the Evaluate SDK, you can find model details and report download options in the Gretel Console -- simply navigate to the bank-marketing-classification-example
project.
Option 2: BYO synthetic and real data to compare
If you already have generated synthetic data in the form of a CSV, JSON(L) or Pandas Dataframe, you can also use this Evaluate task to analyze the two datasets.
The Gretel SDK provides Python classes specifically to run reports. The DownstreamClassificationReport()
class uses evaluate
with classification
task to generate a Data Utility Report. A basic usage is below:
For more examples, please follow the Jupyter notebook or open in Google Colab.
Gretel Synthetic Data Utility Report
The Evaluate task creates a Data Utility Report with the results of the analysis. You'll see a high-level ML Quality Score (MQS) which gives you an at-a-glance understanding of how your synthetic dataset performed. For more info about the report, checkout this page about each section.
Logs and Results
You can view logs both in the SDK environment or go to the project in the Gretel Console to follow along with the model training progress and download the results of the evaluation.
Last updated