If you have not already gone through our environment setup tutorial, please do so, as this will enable you to run a Gretel Worker with GPU support on your own machine. If you do not have access to a GPU for training, the training data size and complexity of the model for this tutorial will work on a CPU, it will just take a little longer.
For this deep dive, we will use a reduced version of the US Census Income Data Set that is often used to predict if income is above $50k/year for adults in the United States.
In this tutorial, we will show you how to use the Gretel CLI and Smart Seeding to create your own synthetic records. Smart Seeding enables you to provide partial record values to the record generation process and our Gretel model will do the heavy lifting of creating the remainder of the record for you.
Let’s dive in!
First, we will create a synthetic model using a Gretel Worker that is local to the machine running the CLI. In local mode the training data will not be sent to Gretel Cloud and will reside only on the machine you are running the CLI from.
We’ll use 5000 records from the original dataset as the training data, which can be found here.
First, we will download and modify one of Gretel’s configuration templates. We’ll use our default synthetic template and modify it to support smart seeding. When using smart seeding, you must provide the field names that you wish to use as seeds for generating records. We will use the following fields as seed fields: race, gender, and income_bracket.
Download and modify the default config template:
Now edit the configuration to enable Smart Seeding:
# Default configuration for Synthetic model creation.# The parameter settings below match the default settings# in Gretel's open source synthetic packageschema_version: 1.0models:- synthetics:data_source: "__tmp__"params:epochs: 100# NOTE: A synthetic task of type "seed" needs to be added to enable# smart seeding during record generationtask:type: seedattrs:fields:- race- gender- income_bracket
Save this configuration locally. We’ll save it as
seed-config.yml for this tutorial.
Next, we’ll request a model creation job to be run in local mode. This will submit the configuration to the Gretel Cloud API and automatically trigger a download of the Gretel Synthetics container, load the config and training data, and start creating the synthetic model.
The synthetic model, sample data, and synthetic report will be saved to the local machine in the directory specified by the
gretel models create --config seed-config.yml --runner local --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv --output model-data
When this command is ran, a model creation request will be sent to Gretel Cloud and a local Gretel Worker will be launched. The Gretel Worker will download the configuration from Gretel Cloud and load the training data and begin training the synthetic model.
When the model completes, several artifacts will be available in the output directory:
data_preview.gz contains sample synthetic data. This data was created by the synthetic model and is used to create the Synthetic Quality Score (SQS) report.
model.tar.gz is the actual machine learning model. It should not have to be used directly.
report.html.gz is a human readable HTML page of the Synthetic Quality Score report.
report_json.json.gz is the same data from the SQS but in JSON format.
Next, we can serve our model to generate new synthetic records!
Next we'll use our newly created synthetic model and generate some new records. As discussed before, Gretel has already created a more balanced version of this dataset. We will now walk through creating some records that are partially complete.
For demonstration, we'll assume we want to generate 100 new synthetic records where the values for
income_bracket are "Black", "Female" and ">50K".
In order to provide this seed data to our served model, we can create a CSV with just this data. We have already created a 3-column CSV for you.
Now, with our seed data, we can serve the model and generate new records with the following command. Be sure to replace your Model ID!
gretel records generate --runner local --model-id 60ba8dfb0ba87c111336cd9e --model-path model-data/model.tar.gz --output syn-records --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome-SeedFields.csv
After running this command, your previously created synthetic model will be loaded and the seed data CSV will be sent into the model server handler. These seed values will be used to generate new records.
Now let's examine our newly synthesized records:
gunzip -c syn-records/data.gz | head
You will see that every record generated has the 3 distinct values that were requested in our seed data!