Creating Models

A model in Gretel is an algorithm that can be used to generate, transform, or label data.

Powered by data, models can be thought of as the building blocks of machine learning. This page walks through the basics of initializing and training models for synthetic data, data transformations, and data classification.

The fundamentals of creating a Gretel model are almost always the same, there are three key steps:

1) Choose a default, or create a Gretel Configuration

2) Provide training or input data

3) Submit as a job to Gretel Cloud

When creating a model, Gretel Cloud performs the following steps:

  1. Load the Gretel Configuration

  2. Upload the training data to Gretel Cloud

  3. Gretel Cloud provisions a worker and begins model training

  4. When the job is completed, several Model Artifacts, including output data and reports can then be downloaded client-side

We'll show how to use both the CLI and SDK to create Gretel models in their own sections below.

Gretel Configuration

Gretel Configurations generally start as declarative YAML files, which can then be provided to the SDK, CLI, or Gretel Console for starting a model creation job. Between the CLI and SDK, however, there are some differences (and similarities) on how you can define and provide a Gretel Configuration.

CLI and SDK

  • The CLI and SDK can work with Gretel Configurations that are YAML files. The CLI and SDK can access files on-disk or through remote URIs (HTTPS, S3, etc).

  • Both the CLI and SDK can reference configurations through "template shortcuts." For various models and use cases, Gretel maintains configuration templates. The template can be referenced by using a directory/filename pattern (no file extension required). So the string synthetics/default will automatically fetch and load this configuration file.

SDK Only

The SDK can also load Gretel Configurations as Python dictionaries as an alternative to YAML. This way, you may either load a configuration from disk or a template, and then manipulate it as necessary. Here's an example of this:

from gretel_client.projects.models import read_model_config

config_dict = read_model_config("synthetics/default")

config_dict

# {'schema_version': '1.0',
# 'models': [{'synthetics': {'data_source': '__tmp__',
#    'params': {'epochs': 100,
#     'batch_size': 64,
#     'vocab_size': 20000,
#     'reset_states': False,
#     'learning_rate': 0.01,
#     'rnn_units': 256,
#     'dropout_rate': 0.2,
#     'overwrite': True,
#     'early_stopping': True,
#     'gen_temp': 1.0,
#     'predict_batch_size': 64,
#     'validation_split': False,
#     'dp': False,
#     'dp_noise_multiplier': 0.001,
#     'dp_l2_norm_clip': 5.0,
#     'data_upsample_limit': 10000},
#    'validators': {'in_set_count': 10, 'pattern_count': 10},
#    'generate': {'num_records': 5000, 'max_invalid': None},
#    'privacy_filters': {'outliers': 'medium', 'similarity': 'medium'}}}]}

# NOTE: You may now edit this dict as necessary

Input Data Sources

The various types of data source formats can be reviewed here: Inputs and Outputs. This section will cover how these data sources can be provided to the CLI and SDK.

CLI and SDK

Data sources may be either files on disk or files that can be accessed via a remote URI (such as HTTPS or S3). In both cases, you should provide a string value to the file on disk or the remote path.

SDK Only

The SDK will accept Pandas DataFrames as input data. When a DataFrame is provided, the SDK will temporarily write the DataFrame to disk and upload it to Gretel Cloud. When the operation is complete, the temporary file on disk will be deleted. When showing SDK usage below, we will use the DataFrame input data method.

Creating Models with the CLI

The steps below assume you have a default Gretel Project configured. At any time if you wish to create a model in a different project you can utilize the --project <NAME> flag.

For this example, we will download the sample data to disk so you may observe the full artifact creation process:

wget https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv

Regardless of the model type, creating a Gretel model through the CLI will be done through the gretel models create ... command set.

At any time you can get the help menu by running:

gretel models create --help

Given our data set, and a synthetics configuration shortcut (synthetics/default) let's create a model:

gretel models create \
    --config synthetics/default \
    --in-data us-adult-income.csv \
    --output my-synthetic-data

By default, the CLI will attach to the job as it runs in Gretel Cloud and you will start to see verbose logging output as the job runs.

If you terminate this command, i.e. by sending a keyboard interrupt, this will cancel the job. If you wish to run the job in a "detached" mode, you may use the --wait flag and give some low number of seconds to attach to the job such as --wait 5. After 5 seconds the CLI will detach and the job will continue to run in Gretel Cloud.

Once the model is completed, the CLI will download the artifacts that were created as part of the model. You should be able to see these in the directory you specified in the --output parameter, so in this example, artifacts should be saved to the my-synthetic-data directory.

Additionally, you should see the Model ID be output from the CLI:

INFO: Model done training. The model id is

	62c743f56af5cc87b82b2f03

You will need this ID when re-using this model to generate synthetic data. Next, let's look at the downloaded artifacts.

ls -al my-synthetic-data

#data_preview.gz
#logs.json.gz
#report.html.gz
#report_json.json.gz
  • data_preview.gz contains the synthetic data that was created as part of the model creation process

  • report.html.gz contains the Synthetic Quality Score report as a human readable HTML file

  • report_json.json.gz contains the data from the SQS report but in a JSON consumable format

  • logs.json.gz contain the model creation logs, these may be useful if you ever contact Gretel support

Downloading Model Artifacts

When the CLI stays attached to the Gretel Cloud job, artifacts will automatically be downloaded to the provided --output directory. If you have disconnected the CLI from Gretel Cloud, for example using the --wait option, then you may download the artifacts manually. This can be done with the following command:

gretel models get --model-id <MODEL_ID> --output my-synthetic-data

Creating Models with the SDK

Next, we'll walk through creating models with the SDK. While the SDK can utilize local files data sources and remote URI data sources, for this example, we will show how you can use a Pandas DataFrame as your data source.

First, you'll need to create a Project instance to work with. Creating a Project instance can be reviewed here: Accessing Projects.

Once we have our Project instance, we will want to do a few things:

  • We use the Project instance to create a Model instance by using a specific create_model_obj() factory method. This factory method takes both our Gretel Configuration and data source (a DataFrame) as params.

  • With the Model instance created, we have to actually submit it to Gretel Cloud

  • Next we can poll the Model instance for completion

  • Finally we can download all of the Model Artifacts

Let's see it all in action...

import pandas as pd

from gretel_client import create_or_get_unique_project
from gretel_client import poll

train_df = pd.read_csv("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv")

proj = create_or_get_unique_project(name="my-next-project")

model = proj.create_model_obj(model_config="synthetics/default", data_source=train_df)

model.submit_cloud()

# Once the model is submitted, it will be hydrated with a Model ID
print(model.model_id)
# `62c8a20fc4b9311c959dc03d`

print(model.status)
# <Status.CREATED: 'created'>

poll(model)
# You should start to see logging output similar to CLI-based model creation
# Now issue a KeyboardInterrupt to stop polling (i.e. Ctrl+C)

model.refresh() # get the latest state updates from Gretel Cloud
print(model.status)
# <Status.ACTIVE: 'active'>

poll(model) # let's wait for it to finish

# Once the job finishes, we can download our artifacts
model.download_artifacts("my-artifacts")
! ls -al my-artifacts/

# data_preview.gz
# logs.json.gz
# report.html.gz
# report_json.json.gz

In the above example, our Model instance was in memory the entire time. If you ever lose that instance or restart your Python interpreter, you can create and hydrate a new Model instance right from your Project instance:

from gretel_client import create_or_get_unique_project

proj = create_or_get_unique_project(name="my-next-project")

model = proj.get_model("62c8a20fc4b9311c959dc03d")

model.status
# <Status.COMPLETED: 'completed'>

In the next section, we'll discuss how to utilize existing models to generate synthetic data.

Last updated