Balance a Dataset
In this deep dive, we will walk through using the Gretel CLI to create a synthetic model and generate synthetic records using a Gretel Worker in your own environment.
If you have not already gone through our environment setup tutorial, please do so, as this will enable you to run a Gretel Worker with GPU support on your own machine. If you do not have access to a GPU for training, the training data size and complexity of the model for this tutorial will work on a CPU, it will just take a little longer.
In this tutorial, we will show you how to use the Gretel CLI and Smart Seeding to create your own synthetic records. Smart Seeding enables you to provide partial record values to the record generation process and our Gretel model will do the heavy lifting of creating the remainder of the record for you.
Let’s dive in!
First, we will create a synthetic model using a Gretel Worker that is local to the machine running the CLI. In local mode the training data will not be sent to Gretel Cloud and will reside only on the machine you are running the CLI from.
First, we will download and modify one of Gretel’s configuration templates. We’ll use our default synthetic template and modify it to support smart seeding. When using smart seeding, you must provide the field names that you wish to use as seeds for generating records. We will use the following fields as seed fields: race, gender, and income_bracket.
Download and modify the default config template:
Now edit the configuration to enable Smart Seeding:
# Default configuration for Synthetic model creation.
# The parameter settings below match the default settings
# in Gretel's open source synthetic package
# NOTE: A synthetic task of type "seed" needs to be added to enable
# smart seeding during record generation
Save this configuration locally. We’ll save it as
seed-config.ymlfor this tutorial.
Next, we’ll request a model creation job to be run in local mode. This will submit the configuration to the Gretel Cloud API and automatically trigger a download of the Gretel Synthetics container, load the config and training data, and start creating the synthetic model.
The synthetic model, sample data, and synthetic report will be saved to the local machine in the directory specified by the
gretel models create --config seed-config.yml --runner local --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv --output model-data
When this command is run, a model creation request will be sent to Gretel Cloud and a local Gretel Worker will be launched. The Gretel Worker will download the configuration from Gretel Cloud and load the training data and begin training the synthetic model.
When the model completes, several artifacts will be available in the output directory:
data_preview.gzcontains sample synthetic data. This data was created by the synthetic model and is used to create the Synthetic Quality Score (SQS) report.
model.tar.gzis the actual machine learning model. It should not have to be used directly.
report.html.gzis a human readable HTML page of the Synthetic Quality Score report.
report_json.json.gzis the same data from the SQS but in JSON format.
Next, we can serve our model to generate new synthetic records!
Next we'll use our newly created synthetic model and generate some new records. As discussed before, Gretel has already created a more balanced version of this dataset. We will now walk through creating some records that are partially complete.
For demonstration, we'll assume we want to generate 100 new synthetic records where the values for
income_bracketare "Black", "Female" and ">50K".
Now, with our seed data, we can serve the model and generate new records with the following command. Be sure to replace your Model ID!
gretel records generate --runner local --model-id 60ba8dfb0ba87c111336cd9e --model-path model-data/model.tar.gz --output syn-records --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome-SeedFields.csv
After running this command, your previously created synthetic model will be loaded and the seed data CSV will be sent into the model server handler. These seed values will be used to generate new records.
Now let's examine our newly synthesized records:
gunzip -c syn-records/data.gz | head
You will see that every record generated has the 3 distinct values that were requested in our seed data!