Model Configurations

The heart of Gretel workflows is the Gretel Configuration. The configuration is a declarative way to describe what a Gretel Worker will do with your data.

Overview

At a high level, a Gretel Configuration can let you configure and deploy the following types of workloads:

  • Synthetic data model training

  • Data classification, to include

    • Named Entity Recognition

    • PII Detection

    • Sensitive Data Detection (API keys, secrets, etc)

  • Data transformations

    • Transform detected entities and specified fields

      • Fake entity replacement

      • Secure hashing

      • Field value dropping / removal

  • Detect custom info types

    • Specify your own regular expressions

    • Custom keyword / phrase list detection

A Gretel Configuration can be authored with YAML or JSON. The sections below will outline the various configuration options depending on your desired use case.

A Gretel Configuration is submitted to the Gretel Cloud REST API to schedule a Job that will run tasks to train models and classify data. The general user flow will be:

Artifacts that are created from running training or classification jobs are:

  • Synthetic data models

  • Synthetic data reports

  • Transform models

  • Sample transformed or synthesized data

  • Data classification results

Each configuration file will have a standard set of key-value pairs:

schema_version: 1.0
name: "my-awesome-model"

Currently, the only schema_version supported is 1.0. The name is not required, but if provided, this will be what is displayed when looking at your model / job listing in the Gretel Console.

If providing a name, the requirements are:

  • Maximum of 32 chars

  • Must start with a letter

  • May contain letters, numbers, and -. May not have contiguous - characters.

  • Must end with a letter or a number

We recommend putting both key-value pairs at the top of each configuration.

Data Sources

In the sections below, there will be a key called data_source. This key should specify the data you wish to train a model on or classify.

Currently, we support CSVs as your data source. Headers should be included.

The following data sources are supported:

Gretel Project Artifacts

These are datasets you may upload to your project which are staged for use in a Model Configuration. You may upload a dataset using the REST API or through the Gretel Console within a Project scope. When creating a Project Artifact, the Gretel API will return a URL that you should PUT your file contents into. Additionally, you will receive a special Gretel Artifact Key, such as: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv. You may now use this key as your data_source if desired.

This upload flow can be achieved with the Gretel CLI as well:

gretel artifacts upload [--project NAME] --in-data my-training-data.csv

If you are running localized Gretel Workers, you will not need to create Project Artifacts. Local training data files will be sent directly into the worker.

Local Files

When running your own Gretel Worker you may reference local files on your system. These do not have to be added to the Model Configuration data_source but instead can be provided directly to the --in-data param of the Gretel CLI.

The following sections will assume a Project Artifact has been uploaded to Gretel Cloud and a Gretel Cloud Worker will be used. These configurations will also work with your own Gretel Workers.