Train and Generate Jobs
Methods for submitting jobs to Gretel workers
Last updated
Was this helpful?
Methods for submitting jobs to Gretel workers
Last updated
Was this helpful?
With the instance ready to go, you can use its submit_*
methods to submit model training and data generation jobs. Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.
The submit_train
method submits a model training job based on the given . The data source for the training job is passed in using the data_source
argument and may be a file path or pandas DataFrame
:
We trained an model by setting base_config="tabular-actgan"
. You can replace this base config with the path to a custom config file, or you can select any of the config names (excluding the .yml
extension). The returned trained
object is a dataclass
that contains the training job results such as the Gretel model object, synthetic data quality report, training logs, and the final model configuration.
The base configuration can be modified using keyword arguments with the following rules:
Nested model settings can be passed as keyword arguments in the submit_train
method, where the keyword is the name of the config subsection and the value is a dictionary with the desired subsection's parameter settings. For example, this is how you update settings in 's params
and privacy_filters
subsections, where epochs
, discriminator_dim
, similarity
, and outliers
are nested settings:
Once you have models in your Gretel Project, you can use any of them to generate synthetic data using the submit_generate
method:
Above we use the model_id
attribute of a completed training job, but you are free to use the model_id
of any model within the current project. If the model has additional generate
settings (e.g., temperature
when generating text), you can pass them as keyword arguments to the submit_generate
method. The returned generated
object is a dataclass
that contains results from the generation job, including the generated synthetic data.
The above code will conditionally generate 50 examples where the given field's class is "seed".
If you do not want to wait for a job to complete, you can set wait=False
when calling submit_train
or submit_generate
. In this case, the method will return immediately after the job starts:
Some things to know if you use this option:
You can check the job status using the job_status
attribute of the returned object: print(trained.job_status)
.
You can continue waiting for the job to complete by calling the wait_for_completion
method of the returned object: trained.wait_for_completion()
.
If you are not waiting when the job completes, you must call the refresh
method of the returned object to fetch the job results: trained.refresh()
.
Our transforms product allows you to remove PII from data, and you can submit these transform jobs from the high level SDK. The default behavior is to use a model to classify the data and fake entities from that.
You can fetch results from previous training and generation jobs using the fetch_*_job_results
methods:
For fetching transform results, you can do the following, and also access the transformed object as a DataFrame
Non-nested model settings can be passed directly as keyword arguments in the submit_train
method. For example, this is how you update 's pretrained_model
and column_name
, which are not nested within a subsection:
In the previous example, we unconditionally generated num_records
records. To synthetic data, use the seed_data
argument:
You can still monitor the job progress in the .
job analyzes the quality of synthetic data and generates the .
The submit_evaluate
method submits an evaluate model training job based on the given . The data source for the job is passed in using the data_source
argument, the original data source is passed with ref_data
, and these data sources may be file path or pandas DataFrame
:
The test (holdout) data source for is passed with an optional test_data
argument, it may be a file path or pandas DataFrame
: