Regression
Evaluate synthetic data vs. real-world data on regression models
The
regression
Evaluate task will generate a Gretel Synthetic Data Utility Report. The Gretel Evaluate Regression task uses the open source AutoML PyCaret library to evaluate the quality of your generated synthetic data on commonly used ML regression models, and gives you the results in an easy-to-understand HTML report.
You can kick off this evaluation directly in the Gretel Console. Start with this workflow:
Generate synthetic data + evaluate ML performance
To use this blueprint, click "Edit" in the configuration editor and change the parameters to fit your dataset. You can also change the synthetic model from the Gretel LSTM default to any other synthetic model.
Example configuration with Gretel LSTM:
schema_version: "1.0"
name: "synthetic-evaluate"
models:
- synthetics:
data_source: __tmp__
params:
epochs: auto
vocab_size: auto
learning_rate: auto
batch_size: auto
rnn_units: auto
generate:
num_records: 5000
privacy_filters:
outliers: auto
similarity: auto
evaluate:
task: regression
target: "age" ### Change this to reflect the target/label header of your dataset
You can copy the configuration above and edit it to fit your use case. Then, click "Begin training" to kick off the model process.
By default, all models will be used in the training to create the evaluation results. You can select specific models to use by passing in a list of strings from the following set:
# All regression models
regression_models = [
"lr",
"lasso",
"ridge",
"en",
"lar",
"llar",
"omp",
"br",
"ard",
"par",
"ransac",
"tr",
"huber",
"kr",
"svm",
"knn",
"dt",
"rf",
"et",
"ada",
"gbr",
"mlp",
"xgboost",
"lightgbm",
"dummy"
]
If you want to change the metric that the classifiers will use to optimize for, you can select one metric from
regression_metrics
below. The default metric is "R2" (R-squared).# Select a metric
regression_metrics = [
"mae",
"mse",
"rmse",
"r2",
"rmsle",
"mape"
]
You can use the
regression
Evaluate task in two ways: 1. As a parameter of a Gretel synthetics model, or
2. Compare two datasets directly: a synthetic dataset and a real-world dataset
Here's a basic example generating synthetic data using Gretel LSTM and the publicly available heart disease dataset, then adding
regression
evaluation to create the Data Utility Report:from gretel_client.helpers import poll
from gretel_client.projects.models import read_model_config
from gretel_client.projects import create_or_get_unique_project
# Create a project with a name that describes this use case
project = create_or_get_unique_project(name="heart-disease-regression-example")
# You can modify this to select a dataset of your choice
dataset_path = "https://gretel-datasets.s3.amazonaws.com/processed_cleveland_heart_disease_uci/data.csv"
# Modify Gretel LSTM config to add Evaluate task
config = read_model_config("synthetics/tabular-lstm")
config["models"][0]["synthetics"]["evaluate"] = {
"task": 'regression',
"target": 'thalach', # maximum heart rate column -- change this for your dataset!
}
You can then run the model and save the report using:
## Train and run the model
model = project.create_model_obj(
model_config=config,
data_source=dataset_path
)
model.submit_cloud()
poll(model)
# Save all artifacts
model.download_artifacts("/tmp")
Even when using the Evaluate SDK, you can find model details and report download options in the Gretel Console -- simply navigate to the
heart-disease-regression-example
project.
If you already have generated synthetic data in the form of a CSV, JSON(L) or Pandas Dataframe, you can also use this Evaluate task to analyze the two datasets.
The Gretel SDK provides Python classes specifically to run reports. The
DownstreamRegressionReport()
class uses evaluate
with regression
task to generate a Data Utility Report. A basic usage is below:# Use Evaluate SDK
from gretel_client.evaluation.downstream_regression_report import DownstreamRegressionReport
# Create a project
from gretel_client.projects import create_or_get_unique_project
project = create_or_get_unique_project(name="heart-disease-regression-example-2")
# Params
# NOTE: These data sources may also be Pandas DataFrames!
data_source = "synthetic.csv"
ref_data = "real.csv"
# Target to predict, REQUIRED for evaluate model
target = "target"
# Default holdout value
# test_holdout = 0.2
# Supply a subset if you do not want all of these, default is to use all of them
# models = regression_models
# Metric to use for ordering results, default is "r2" (R-squared) for regression.
# metric = "r2"
# Create a downstream regression report
evaluate = DownstreamRegressionReport(
project=project,
target=target,
data_source=data_source,
ref_data=ref_data,
# holdout=test_holdout,
# models=models,
# metric=metric,
# runner_mode="cloud",
)
The Evaluate task creates a Data Utility Report with the results of the analysis. You'll see a high-level ML Quality Score (MQS) which gives you an at-a-glance understanding of how your synthetic dataset performed. For more info about the report, checkout this page about each section.
The Evaluate task creates a Data Utility Report with the results of the analysis. You'll see a high-level ML Quality Score (MQS) which gives you an at-a-glance understanding of how your synthetic dataset performed. For more info about the report, checkout this page about each section.
You can view logs both in the SDK environment or go to the project in the Gretel Console to follow along with the model training progress and download the results of the evaluation.
Last modified 5mo ago