Creating Models
A model in Gretel is an algorithm that can be used to generate, transform, or label data.
Powered by data, models can be thought of as the building blocks of machine learning. This page walks through the basics of initializing and training models for synthetic data, data transformations, and data classification.
The fundamentals of creating a Gretel model are almost always the same, there are three key steps:
1) Choose a default, or create a Gretel Configuration
2) Provide training or input data
3) Submit as a job to Gretel Cloud
When creating a model, Gretel Cloud performs the following steps:
Load the Gretel Configuration
Upload the training data to Gretel Cloud
Gretel Cloud provisions a worker and begins model training
When the job is completed, several Model Artifacts, including output data and reports can then be downloaded client-side
We'll show how to use both the CLI and SDK to create Gretel models in their own sections below.
Gretel Configuration
Gretel Configurations generally start as declarative YAML files, which can then be provided to the SDK, CLI, or Gretel Console for starting a model creation job. Between the CLI and SDK, however, there are some differences (and similarities) on how you can define and provide a Gretel Configuration.
CLI and SDK
The CLI and SDK can work with Gretel Configurations that are YAML files. The CLI and SDK can access files on-disk or through remote URIs (HTTPS, S3, etc).
Both the CLI and SDK can reference configurations through "template shortcuts." For various models and use cases, Gretel maintains configuration templates. The template can be referenced by using a
directory/filename
pattern (no file extension required). So the stringsynthetics/default
will automatically fetch and load this configuration file.
SDK Only
The SDK can also load Gretel Configurations as Python dictionaries as an alternative to YAML. This way, you may either load a configuration from disk or a template, and then manipulate it as necessary. Here's an example of this:
Input Data Sources
The various types of data source formats can be reviewed here: Inputs and Outputs. This section will cover how these data sources can be provided to the CLI and SDK.
CLI and SDK
Data sources may be either files on disk or files that can be accessed via a remote URI (such as HTTPS or S3). In both cases, you should provide a string value to the file on disk or the remote path.
SDK Only
The SDK will accept Pandas DataFrames as input data. When a DataFrame is provided, the SDK will temporarily write the DataFrame to disk and upload it to Gretel Cloud. When the operation is complete, the temporary file on disk will be deleted. When showing SDK usage below, we will use the DataFrame input data method.
Creating Models with the CLI
The steps below assume you have a default Gretel Project configured. At any time if you wish to create a model in a different project you can utilize the --project <NAME>
flag.
For this example, we will download the sample data to disk so you may observe the full artifact creation process:
Regardless of the model type, creating a Gretel model through the CLI will be done through the gretel models create ...
command set.
At any time you can get the help menu by running:
Given our data set, and a synthetics configuration shortcut (synthetics/default
) let's create a model:
By default, the CLI will attach to the job as it runs in Gretel Cloud and you will start to see verbose logging output as the job runs.
If you terminate this command, i.e. by sending a keyboard interrupt, this will cancel the job. If you wish to run the job in a "detached" mode, you may use the --wait
flag and give some low number of seconds to attach to the job such as --wait 5
. After 5 seconds the CLI will detach and the job will continue to run in Gretel Cloud.
Once the model is completed, the CLI will download the artifacts that were created as part of the model. You should be able to see these in the directory you specified in the --output
parameter, so in this example, artifacts should be saved to the my-synthetic-data
directory.
Additionally, you should see the Model ID be output from the CLI:
You will need this ID when re-using this model to generate synthetic data. Next, let's look at the downloaded artifacts.
data_preview.gz
contains the synthetic data that was created as part of the model creation processreport.html.gz
contains the Synthetic Quality Score report as a human readable HTML filereport_json.json.gz
contains the data from the SQS report but in a JSON consumable formatlogs.json.gz
contain the model creation logs, these may be useful if you ever contact Gretel support
Downloading Model Artifacts
When the CLI stays attached to the Gretel Cloud job, artifacts will automatically be downloaded to the provided --output
directory. If you have disconnected the CLI from Gretel Cloud, for example using the --wait
option, then you may download the artifacts manually. This can be done with the following command:
Creating Models with the SDK
Next, we'll walk through creating models with the SDK. While the SDK can utilize local files data sources and remote URI data sources, for this example, we will show how you can use a Pandas DataFrame as your data source.
First, you'll need to create a Project
instance to work with. Creating a Project
instance can be reviewed here: Accessing Projects.
Once we have our Project
instance, we will want to do a few things:
We use the
Project
instance to create aModel
instance by using a specificcreate_model_obj()
factory method. This factory method takes both our Gretel Configuration and data source (a DataFrame) as params.With the
Model
instance created, we have to actually submit it to Gretel CloudNext we can
poll
theModel
instance for completionFinally we can download all of the Model Artifacts
Let's see it all in action...
In the above example, our Model
instance was in memory the entire time. If you ever lose that instance or restart your Python interpreter, you can create and hydrate a new Model
instance right from your Project
instance:
In the next section, we'll discuss how to utilize existing models to generate synthetic data.
Last updated