Synthesize Tabular Data
Use Gretel's LSTM model to generate tabular synthetic data.
In this example, we will use the Gretel LSTM model to create tabular synthetic data. In this example, we will train our model using an United States Census dataset on adult income. You can synthesize tabular data using any of Gretel's interfaces.
CLI
SDK
First, we will create a project to host your model and artifacts.
gretel projects create --display-name synth-tabular-data --set-default
Download and preview the dataset we will use to train the synthetic model on.
wget https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv
head -n 10 USAdultIncome5k.csv
The
head
command previews the first 10 rows of the dataset we will synthesize.age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
42,Private,255847,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,4386,0,48,United-States,>50K
34,Private,111567,HS-grad,9,Never-married,Transport-moving,Own-child,White,Male,0,0,40,United-States,<=50K
34,Private,263307,Bachelors,13,Never-married,Sales,Unmarried,Black,Male,0,0,45,?,<=50K
69,Private,174474,10th,6,Separated,Machine-op-inspct,Not-in-family,White,Female,0,0,28,Peru,<=50K
26,Private,260614,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
29,Private,201155,9th,5,Never-married,Sales,Not-in-family,White,Female,0,0,48,United-States,<=50K
20,?,124242,Some-college,10,Never-married,?,Own-child,White,Female,0,0,40,United-States,<=50K
26,Private,60722,Bachelors,13,Never-married,Prof-specialty,Own-child,Asian-Pac-Islander,Female,0,0,40,United-States,<=50K
28,Private,331381,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States,<=50K
gretel models create --runner cloud --config synthetics/tabular-lstm --in-data USAdultIncome5k.csv --output . --name synth-income-model
The
--output
parameter specifies where the model artifacts will be saved. In this example --output .
creates several files in your local directory. For models trained in the Gretel Cloud, model artifacts can be downloaded at any time with the following command: gretel models get --model-id [model id] --output .
. The following model artifacts are created:Filename | Description |
---|---|
data_preview.gz | A preview of your synthetic dataset in CSV format. |
report.html.gz | HTML report that offers deep insight into the quality of the synthetic model. |
report-json.json.gz | A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically. |
logs.json.gz | Log output from the synthetic worker that is helpful for debugging. |
Now we will use our trained synthetic model to generate more synthetic data. Copy the model ID returned by the
gretel models create
command. gretel records generate --model-id [model id] model-data.json --runner cloud --num-records 5000 --max-invalid 5000 --output .
The following model artifacts are created during a generation job:
Filename | Description |
---|---|
data.gz | Your synthetic dataset in CSV format. |
logs.json.gz | Log output from the synthetic worker that is helpful for debugging. |
This notebook will walk you through the process of creating your own synthetic data using Gretel's Python SDK from a CSV or a DataFrame of your choosing.
%%capture
!pip install -U gretel-client
# Specify your Gretel API key
import pandas as pd
from gretel_client import configure_session
pd.set_option("max_colwidth", None)
configure_session(api_key="prompt", cache="yes", validate=True)
# Create a project
from gretel_client.projects import create_or_get_unique_project
project = create_or_get_unique_project(name="synthetic-data")
Load the default configuration template. This template will work well for most datasets. View other synthetic configuration templates here.
import json
from gretel_client.projects.models import read_model_config
config = read_model_config("synthetics/default")
# Set the model epochs to 50
config["models"][0]["synthetics"]["params"]["epochs"] = 50
print(json.dumps(config, indent=2))
Specify a data source to train the model on. This can be a local file, web location, or HDFS file.
# Load and preview the DataFrame to train the synthetic model on.
import pandas as pd
dataset_path = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv"
df = pd.read_csv(dataset_path)
df.to_csv("training_data.csv", index=False)
df
In this step, we will task the worker running in the Gretel cloud, or locally, to train a synthetic model on the source dataset.
from gretel_client.helpers import poll
model = project.create_model_obj(model_config=config, data_source="training_data.csv")
model.submit_cloud()
poll(model)
# View the synthetic data
synthetic_df = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")
synthetic_df
# Generate report that shows the statistical performance between the training and synthetic data
import IPython
from smart_open import open
IPython.display.HTML(data=open(model.get_artifact_link("report")).read(), metadata=dict(isolated=True))
You can now use the trained synthetic model to generate as much synthetic data as you like.
# Generate more records from the model
record_handler = model.create_record_handler_obj(
params={"num_records": 100, "max_invalid": 500}
)
record_handler.submit_cloud()
poll(record_handler)
synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
synthetic_df
Last modified 5mo ago