Search…
⌃K

Synthesize Tabular Data

Use Gretel's LSTM model to generate tabular synthetic data.
In this example, we will use the Gretel LSTM model to create tabular synthetic data. In this example, we will train our model using an United States Census dataset on adult income. You can synthesize tabular data using any of Gretel's interfaces.
CLI
SDK

Create Project

First, we will create a project to host your model and artifacts.
gretel projects create --display-name synth-tabular-data --set-default

Get Training Data

Download and preview the dataset we will use to train the synthetic model on.
wget https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv
head -n 10 USAdultIncome5k.csv
The head command previews the first 10 rows of the dataset we will synthesize.
age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
42,Private,255847,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,4386,0,48,United-States,>50K
34,Private,111567,HS-grad,9,Never-married,Transport-moving,Own-child,White,Male,0,0,40,United-States,<=50K
34,Private,263307,Bachelors,13,Never-married,Sales,Unmarried,Black,Male,0,0,45,?,<=50K
69,Private,174474,10th,6,Separated,Machine-op-inspct,Not-in-family,White,Female,0,0,28,Peru,<=50K
26,Private,260614,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
29,Private,201155,9th,5,Never-married,Sales,Not-in-family,White,Female,0,0,48,United-States,<=50K
20,?,124242,Some-college,10,Never-married,?,Own-child,White,Female,0,0,40,United-States,<=50K
26,Private,60722,Bachelors,13,Never-married,Prof-specialty,Own-child,Asian-Pac-Islander,Female,0,0,40,United-States,<=50K
28,Private,331381,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States,<=50K

Train the synthetic model

gretel models create --runner cloud --config synthetics/tabular-lstm --in-data USAdultIncome5k.csv --output . --name synth-income-model

Outputs

The --output parameter specifies where the model artifacts will be saved. In this example --output . creates several files in your local directory. For models trained in the Gretel Cloud, model artifacts can be downloaded at any time with the following command: gretel models get --model-id [model id] --output . . The following model artifacts are created:
Filename
Description
data_preview.gz
A preview of your synthetic dataset in CSV format.
report.html.gz
HTML report that offers deep insight into the quality of the synthetic model.
report-json.json.gz
A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.
logs.json.gz
Log output from the synthetic worker that is helpful for debugging.

Generate synthetic data

Now we will use our trained synthetic model to generate more synthetic data. Copy the model ID returned by the gretel models create command.
gretel records generate --model-id [model id] model-data.json --runner cloud --num-records 5000 --max-invalid 5000 --output .
The following model artifacts are created during a generation job:
Filename
Description
data.gz
Your synthetic dataset in CSV format.
logs.json.gz
Log output from the synthetic worker that is helpful for debugging.
Open In Colab

Create tabular synthetic data with the Python SDK

This notebook will walk you through the process of creating your own synthetic data using Gretel's Python SDK from a CSV or a DataFrame of your choosing.
To run this notebook, you will need an API key from the Gretel Console.

Getting Started

%%capture
!pip install -U gretel-client
# Specify your Gretel API key
import pandas as pd
from gretel_client import configure_session
pd.set_option("max_colwidth", None)
configure_session(api_key="prompt", cache="yes", validate=True)
# Create a project
from gretel_client.projects import create_or_get_unique_project
project = create_or_get_unique_project(name="synthetic-data")

Create the synthetic data configuration

Load the default configuration template. This template will work well for most datasets. View other synthetic configuration templates here.
import json
from gretel_client.projects.models import read_model_config
config = read_model_config("synthetics/default")
# Set the model epochs to 50
config["models"][0]["synthetics"]["params"]["epochs"] = 50
print(json.dumps(config, indent=2))

Load and preview the source dataset

Specify a data source to train the model on. This can be a local file, web location, or HDFS file.
# Load and preview the DataFrame to train the synthetic model on.
import pandas as pd
dataset_path = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv"
df = pd.read_csv(dataset_path)
df.to_csv("training_data.csv", index=False)
df

Train the synthetic model

In this step, we will task the worker running in the Gretel cloud, or locally, to train a synthetic model on the source dataset.
from gretel_client.helpers import poll
model = project.create_model_obj(model_config=config, data_source="training_data.csv")
model.submit_cloud()
poll(model)
# View the synthetic data
synthetic_df = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")
synthetic_df

View the synthetic data quality report

# Generate report that shows the statistical performance between the training and synthetic data
import IPython
from smart_open import open
IPython.display.HTML(data=open(model.get_artifact_link("report")).read(), metadata=dict(isolated=True))

Generate unlimited synthetic data

You can now use the trained synthetic model to generate as much synthetic data as you like.
# Generate more records from the model
record_handler = model.create_record_handler_obj(
params={"num_records": 100, "max_invalid": 500}
)
record_handler.submit_cloud()
poll(record_handler)
synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
synthetic_df
Last modified 1mo ago