At a high level, a Gretel Configuration can let you configure and deploy the following types of workloads:
Synthetic data model training
Data classification, to include
Named Entity Recognition
Sensitive Data Detection (API keys, secrets, etc)
Transform detected entities and specified fields
Fake entity replacement
Field value dropping / removal
Detect custom info types
Specify your own regular expressions
Custom keyword / phrase list detection
A Gretel Configuration can be authored with YAML or JSON. The sections below will outline the various configuration options depending on your desired use case.
A Gretel Configuration is submitted to the Gretel Cloud REST API to schedule a Job that will run tasks to train models and classify data. The general user flow will be:
Artifacts that are created from running training or classification jobs are:
Synthetic data models
Synthetic data reports
Sample transformed or synthesized data
Data classification results
Each configuration file will have a standard set of key-value pairs:
schema_version: 1.0name: "my-awesome-model"
Currently, the only
schema_version supported is 1.0. The
name is not required, but if provided, this will be what is displayed when looking at your model / job listing in the Gretel Console.
If providing a name, the requirements are:
Maximum of 32 chars
Must start with a letter
May contain letters, numbers, and -. May not have contiguous - characters.
Must end with a letter or a number
We recommend putting both key-value pairs at the top of each configuration.
In the sections below, there will be a key called
data_source. This key should specify the data you wish to train a model on or classify.
The following data sources are supported:
These are datasets you may upload to your project which are staged for use in a Model Configuration. You may upload a dataset using the REST API or through the Gretel Console within a Project scope. When creating a Project Artifact, the Gretel API will return a URL that you should PUT your file contents into. Additionally, you will receive a special Gretel Artifact Key, such as:
gretel_a63772a737c4412f9314fb998fa480e2_foo.csv. You may now use this key as your data_source if desired.
This upload flow can be achieved with the Gretel CLI as well:
gretel artifacts upload [--project NAME] --in-data my-training-data.csv
When running your own Gretel Worker you may reference local files on your system. These do not have to be added to the Model Configuration
data_source but instead can be provided directly to the
--in-data param of the Gretel CLI.
The following sections will assume a Project Artifact has been uploaded to Gretel Cloud and a Gretel Cloud Worker will be used. These configurations will also work with your own Gretel Workers.