Running Models

Once a Gretel Model is created, you may utilize that model to generate synthetic data as many times as needed. Because you may use a model to classify and transform data as well, we generically refer to the running of a model as a Record Handler.

Compared to model creation, running a model does not require a standalone Gretel Configuration. There are three input types that you should be aware of when it comes to running models:

  • A Model ID (or other reference to a Model, like a Model instance in the SDK)

  • A number of parameters, which are essentially key-value pairs. These will vary depending on the specific type of model you are running.

  • Optionally, an input data file(s). Depending on the model, the input data may serve various purposes. One example of using an input data file to a record handler would be providing as set of pre-conditioned (smart seeds) inputs to a model to use during generation.

In the example below, we will create a record handler for a Gretel LSTM (synthetics) model that we previously created in Creating Models. The Gretel LSTM model utilizes two parameters:

  • num_records: How many synthetic records to generate

  • max_invalid: How many records can fail validation before the job stops

Please reference specific model examples to understand what data inputs and parameters should be used when running specific models.

Running Models from the CLI

Models can be ran by using the gretel models run [OPTIONS] set of commands, at any time you may get help on these commands by running:

gretel models run --help

In order to run a model, you will need to know or access it's Model ID. When passing model run parameters to the CLI, you should use the --param option for each param such that it matches a --param KEY VALUE pattern.

Given a previously created model, let's generate 100 additional records:

gretel models run \
    --model-id 62c743f56af5cc87b82b2f03 \
    --param num_records 100 \
    --output more-syn-data/

When this job completes, the artifacts will be downloaded to the more-syn-data directory. For this particular job you should see logs.json.gz which are the job logs and your new synthetic data in the data.gz artifact.

If the model type supports conditioning (i.e. smart seeding), then you may provide this set of partial records or smart seeds using the --in-data flag.

When providing a data source for running a model, the job will often use the number of records in the data set to determine how many synthetic records to create. In this case, parameters like num_records will be ignored.

Running Models with the SDK

In order to run a model from the SDK, you will need a Model instance. Once you have that instance, you can create and submit a record handler object in a very similar way to model creation. When submitting a record handler to Gretel Cloud, you may track the state of the job the same way as a model.

from gretel_client import create_or_get_unique_project
from gretel_client import poll

proj = create_or_get_unique_project(name="my-next-project")

model = proj.get_model("62c8a20fc4b9311c959dc03d")

# NOTE: Model running params are provided as a dictionary
handler = model.create_record_handler_obj(params={"num_records": 100})
handler.submit_cloud()

# Wait for completion
poll(handler)

# handler.refresh() works too just like Model creation!

# Download artifacts, including newly created data
handler.download_artifacts("more_syn_data/")

Last updated