Synthesize Tabular Data

Use Gretel's ACTGAN model to generate tabular synthetic data.

In this example, we will generate synthetic tabular data using Gretel's ACTGAN model. The model will be trained from scratch on the United States Census Adult Income dataset.

To accomplish the above, we will submit training and generation jobs to the Gretel Cloud. Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.

Create Project

First, we will create a project to host your model and artifacts.

gretel projects create --display-name synth-tabular-data --set-default

Get Training Data

Download and preview the dataset we will use to train the synthetic model on.

wget https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv
head -n 10 USAdultIncome5k.csv

The head command previews the first 10 rows of the dataset we will synthesize.

age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
42,Private,255847,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,4386,0,48,United-States,>50K
34,Private,111567,HS-grad,9,Never-married,Transport-moving,Own-child,White,Male,0,0,40,United-States,<=50K
34,Private,263307,Bachelors,13,Never-married,Sales,Unmarried,Black,Male,0,0,45,?,<=50K
69,Private,174474,10th,6,Separated,Machine-op-inspct,Not-in-family,White,Female,0,0,28,Peru,<=50K
26,Private,260614,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
29,Private,201155,9th,5,Never-married,Sales,Not-in-family,White,Female,0,0,48,United-States,<=50K
20,?,124242,Some-college,10,Never-married,?,Own-child,White,Female,0,0,40,United-States,<=50K
26,Private,60722,Bachelors,13,Never-married,Prof-specialty,Own-child,Asian-Pac-Islander,Female,0,0,40,United-States,<=50K
28,Private,331381,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States,<=50K

Train the synthetic model

gretel models create --config synthetics/tabular-actgan --in-data USAdultIncome5k.csv --output .  --name synth-income-model

Outputs

The --output parameter specifies where the model artifacts will be saved. In this example --output . creates several files in your local directory. For models trained in the Gretel Cloud, model artifacts can be downloaded at any time with the following command: gretel models get --model-id [model id] --output . . The following model artifacts are created:

FilenameDescription

data_preview.gz

A preview of your synthetic dataset in CSV format.

report.html.gz

HTML report that offers deep insight into the quality of the synthetic model.

report-json.json.gz

A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.

logs.json.gz

Log output from the synthetic worker that is helpful for debugging.

Generate synthetic data

Now we will use our trained synthetic model to generate more synthetic data. Copy the model ID returned by the gretel models create command.

gretel records generate --model-id [model id] --num-records 5000 --max-invalid 5000 --output .

The following model artifacts are created during a generation job:

FilenameDescription

data.gz

A preview of your synthetic dataset in CSV format.

logs.json.gz

Log output from the synthetic worker that is helpful for debugging.

Last updated