Model Configurations

Overview

The heart of Gretel workflows is the Gretel Configuration. The configuration is a declarative way to describe what a Gretel Worker will do with your data. At a high level, a Gretel Configuration can let you configure and deploy the following types of workloads:

  • Synthetic data model training

  • Data classification, to include

    • Named Entity Recognition

    • PII Detection

    • Sensitive Data Detection (API keys, secrets, etc)

  • Transform model training

    • Transform detected entities and specified fields

      • Fake entity replacement

      • Secure hashing

      • Field value dropping / removal

  • Custom entity detectors

    • Specify your own regular expressions

    • Custom keyword / phrase list detection

A Gretel Configuration can be authored with YAML or JSON. The sections below will outline the various configuration options depending on your desired use case.

A Gretel Configuration is submitted to the Gretel Cloud REST API to schedule a Job that will run tasks to train models and classify data. The general user flow will be:

Artifacts that are created from running training or classification jobs are:

  • Synthetic data models

  • Synthetic data reports

  • Transform models

  • Sample transformed or synthesized data

  • Data classification results

Each configuration file will have a standard set of key-value pairs:

schema_version: 1.0
name: "my-awesome-model"

Currently, the only schema_version supported is 1.0. The name is not required, but if provided, this will be what is displayed when looking at your model / job listing in the Gretel Console.

If providing a name, the requirements are:

  • Maximum of 32 chars

  • Must start with a letter

  • May contain letters, numbers, and -. May not have contiguous - characters.

  • Must end with a letter or a number

We recommend putting both key-value pairs at the top of each configuration.

Data Sources

In the sections below, there will be a key called data_source. This key should specify the data you wish to train a model on or classify.

Currently, we support CSVs as your data source. Headers should be included.

The following data sources are supported:

Gretel Project Artifacts

These are datasets you may upload to your project which are staged for use in a Model Configuration. You may upload a dataset using the REST API or through the Gretel Console within a Project scope. When creating a Project Artifact, the Gretel API will return a URL that you should PUT your file contents to. Additionally, you will receive a special Gretel Artifact Key, such as: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv. You may now use this key as your data_source if desired.

This upload flow can be achieved with the Gretel CLI as well:

gretel artifacts upload [--project NAME] --in-data my-training-data.csv

If you are running localized Gretel Workers, you will not need to create Project Artifacts. Local training data files will be sent directly into the worker.

Local Files

When running your own Gretel Worker you may reference local files on your system. These do not have to be added to the Model Configuration data_source but instead can be provided directly to the --in-data param of the Gretel CLI.

The following sections will assume a Project Artifact has been uploaded to Gretel Cloud and a Gretel Cloud Worker will be used. These configurations will also work with your own Gretel Workers.

Synthetic Data Models

To train a synthetic data model, we can add the models object to our configuration. The models object takes a list of keyed objects, named by the type of model we wish to train. For a synthetic data model, we use synthetics. The minimal configuration required is below:

schema_version: 1.0
name: "my-awesome-model"
models:
- synthetics:
data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv

It is assumed that a project artifact was already uploaded for this particular configuration. By default, no extra objects or parameters are required. Gretel uses default settings that will work well for a variety of datasets.

Next, let’s explore the additional parameters that can be used. The configuration below contains additional options for training a synthetic model, with the default options displayed.

schema_version: 1.0
name: "my-awesome-model"
models:
- synthetics:
data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
params:
epochs: 100
field_delimiter: null
batch_size: null
vocab_size: null
reset_states: null
learning_rate: null
rnn_units: null
dropout_rate: null
validators:
in_set_count: 10
pattern_count: 10
generate:
num_records: 5000
max_invalid: 1000

There are three primary sections here to be aware of.

Synthetic Model Parameters

The params object contains key-value pairs that represent the available parameters that will be used to train a machine learning model on the data_source. By default, Gretel will start with 100 epochs and automatically stop training when attributes like model loss and accuracy stop improving.

The field_delimiter parameter is a single character that serves as the delimiter between fields in your training data. If this value is null (the default), then Gretel will automatically detect and use a delimiter.

The remainder of the params default to null. These params are the same params available in Gretel’s Open Source synthetic package. When one of these parameters is null, the default value from the Open Source package will be used. This helps ensure a similar experience when switching between open source and Gretel Cloud.

Please see our open source documentation for a description of the other parameters.

The default model parameters are well suited for a variety of data. However, if your synthetic quality score or data quality is not optimal, please see our deep dives on parameter tuning and data pre-processing to increase performance.

Also, Gretel has configuration templates that may be helpful as starting points for creating your model.

Data Validators

Gretel provides automatic data semantic validation when generating synthetic data. When a model is trained, the following validator models are built, automatically, on a per-field basis:

  • Character Set: For categorical and other string values, the underlying character sets are learned. For example, if the values in a field are all hexadecimal, then the generated data will only contain [0-9a-f] values. This validator is case sensitive.

  • String Length: For categorical and other string values, the maximum and minimum string lengths of a field’s values are learned. When generating data, the generated values will be between the minimum and maximum lengths.

  • Numerical Ranges: For numerical fields, the minimum and maximum values are learned. During generation, numerical values will be between these learned ranges. The following numerical types will be learned and enforced: Float, Base2, Base8, Base 10, Base16, Base16.

  • Field Data Types: For fields that are entirely integers, strings, and floats, the generated data will be of the data type assigned to the field. If a field has mixed data types, then the field may be one of any of the data types, but the above value-based validators will still be enforced.

The validators above are automatic. This to ensure that individual values in the synthetic data are within the basic semantic constraints of the training data.

In addition to the built-in validators, Gretel offers advanced validators that can be managed in the configuration:

  • in_set_count: This validator accumulates all of the unique values in a field. If the cardinality of the field’s values is less than or equal to the setting, then the validator will enforce generated values being in the set of training values. If the cardinality of the field’s value is greater than the setting, the validator will have no affect.

    • For example, if there is a field called US-State, and it has a cardinality of 50 and in_set_count is set to 50, during generation each value for this field must be one of the original values. If in_set_count was only set to 40, then the generated values will not be enforced.

  • pattern_count: This validator builds a pattern mask for each value in a field. Alphanumeric characters are masked, while retaining other special characters. For example, 867-5309 will be masked to ddd-dddd, and f32-sk-39d would mask to add-aa-dda where a represents any A-Za-z character. Much like the previous validator, if the cardinality of learned patterns is less than or equal to the settings, patterns will be enforced during data generation. If the unique pattern count is above the settings, enforcement will be ignored.

By default, both of the above settings are set to 10.

Very high settings for these validators will require higher memory usage.

Data Generation

After a Synthetic Data model is trained a synthetic dataset will be created. This data set will be used to generate the Synthetic Data Report and create a sample synthetic data set that you can explore on your own.

The number of records to generate is controlled by the generate.num_records key in the synthetic config. The default value is 5000 records.

Additionally, you may specify a generate.max_invalid settings as well. This is the number of records that can fail the Data Validation process before generation is stopped. This setting helps govern a long running data generation process where the model is not producing optimal data. The default value will be five times the number of records being generated.

If your training data is less than 5000 records, we recommend setting the generate.num_records value to null. If this value is null, then the number of records generated will be the lesser of 5000 or the total number of records in the training data.

Once your model is trained. You can retrieve the model artifacts (synthetic report, model archive, and sample data) and also schedule the generation of larger datasets.

Transform Models

Gretel’s transform workflow combines data classification with data transformation to easily detect and anonymize or mutate sensitive data. Gretel’s data classification can detect a variety of entities such as PII, which can be used for defining transforms.

Before diving in, let’s define some terms that you will see often. For a given data_source, the following primitives exist:

  • Record: We define a record as a single unit of information, generally. This could be a database row, an ElasticSearch document, a MongoDB document, a JSON object, etc.

  • Field Name: A field name is a string-based key that is found with a record. For a JSON object, this would be the property names. For a database table, this would be the column name, etc. Examples might be first-name or birth-date.

  • Value: A value is the actual information found within a record and is described by a Field Name. This would be like a cell in a table.

  • Label: A label is a tag that describes the existence of a certain type of information. Gretel has several built-in labels that are generated through our classification process. Additionally, you can define custom label detectors (see below).

    • Field Label: A field label is the application of a label uniformly to an entire field name in a dataset. Field Labels can be applied using sampling, and can be useful for classifying, for example, a database column as a specific entity. Consider a database column that contains email addresses, if you specify a field label in your transform, then after a certain amount of email addresses are observed in that field, the entire field name would be classified as an email address.

    • Value Label: The application of a label directly to a value. When processing records, you can define to inspect each record for a variety of labels.

Getting Started with Transforms

Let’s get started with a fully qualified configuration for a very simple transform use case:

I want to search records for email addresses and replace them with fake ones.

schema_version: 1.0
name: "fake-all-emails"
models:
- transforms:
data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
policies:
- name: email_faker
rules:
- name: email-faker
conditions:
value_label:
- email_address
transforms:
- type: fake

The transform policy structure has three notable sections. First, the models array will have one item that is keyed by transforms.

Within the transforms object:

  • A data_source is required

  • There must be a policies array, each item in this array has 2 keys:

    • name: The name of the policy

    • rules: A list of specific data matching conditions and the actual transforms that will be done. Each rule contains the following keys:

      • name: The name of the rule

      • conditions: The various ways to match data (more on this below)

      • transforms: The actual mutations that will take place (more on this below).

Policies and Rules are executed sequentially and should be considered flexible containers that allow transform operations to be structured in more consumable ways.

For this specific configuration, let’s take a look at the conditions and transforms objects. In this particular example, we have created the following rule contents:

conditions:
value_label:
- email_address
transforms:
- type: fake

Conditions is an object that is keyed by the specific matching conditions available for data matching. Each condition name, (like value_label), will have a different value depending on the condition’s matching behavior. The details on each condition can be found below.

A rule must have exactly one conditions object. If you find your self needing more conditions, then you should create additional rules for a given policy.

This particular config uses value_label, which will inspect every for a particular entity, in this case, we are searching every record for an email address.

Next, the transforms object defines what actions will happen to the data. The transforms value is an array of objects that are keyed like:

  • type: The name of the transform (required)

  • attrs: Depending on the type, there may be specific attributes that are required or optional. These attributes are covered in the Transforms section below.

For this example, we use the consistent fake transform. The consistent fake transform will try to replace each detected entity with a fake version of the same entity type. Here we will take every detected email address and replace it with a fake one, such that every email is replaced with the same value throughout the dataset.

Conditions

Conditions are the specific ways that data gets matched before being sent through a transform. Each rule can have one set of conditions that are keyed by the functionality of the condition. The configuration below, shows all possible conditions with their respective matching values.

schema_version: 1.0
name: "hashy-mc-hashface"
models:
- transforms:
data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
policies:
- name: hash-policy
rules:
- name: hash-all-the-things
conditions:
value_label:
- email_address
- phone_number
field_label:
- email_address
field_name:
- password
field_name_regex:
- "^email"
- "^phone"
field_attributes:
is_id: true
transforms:
- type: hash

Let’s take a closer look at each possible condition. Remember, each rule must have exactly one conditions object, and there needs to be at least one condition type declared.

  • value_label: This condition will scan every record for a matching entity in the list. The value for this condition is a list of entity labels. These labels may be Gretel built-in labels or custom defined labels. When labels are found with this condition, Gretel tracks the start and end indices for where the entity exists in the value. Transforms are applied specifically to these offsets.

  • field_label: This condition will match on every value for a field if the entire field has been labeled as containing the particular entity. The value for this condition is an array of labels. During training, a field_label may be applied to an entire field a couple of ways:

    • If the number of records in the training data is less than 10,000 and a specific percentage ofvalue_label entities exist for a field, that entity label will be applied to the field. By default, this cutoff is 90%. This cutoff can be defined in the label predictors object in the configuration. For example if the training data has 8000 records, and at least 7200 values in the firstname field are detected as person_name, then the entire firstname field will be classified with the person_name label and every value in that field will go through transforms.

    • If the number of records is greater than 10,000 and a field value has been labeled with an entity at least 10,000 times, then the field will be labeled with that specific entity

A field value label will only be counted towards the cutoff if the entire value contents is detected as a specific entity. Additionally, once a field_label is applied, transforms will be applied to the entire value for that field. This type of condition is particularly helpful when source data is homogenous, highly structured, and potentially very high in volume.

  • field_name: This will match on the exact value for a field name. It is case insensitive.

  • field_name_regex: This allows a regular expression to be used as a matcher for a field name. For example using ^regex will match on fields like email-address, and emailAddr, etc. Regex matches are case insensitive.

  • field_attributes: This condition can be set to an object of boolean flags. Currently supported flags are:

    • is_id: If the field values are unique and the field name contains an ID-like phrase like “id”, “key”, etc.

Transforms

Now that we can match data on various conditions, transforms can be applied to that matched data. The transforms object takes an array of objects that are structured with a type and attrs key. The type will always define the specific transform to apply. If that transform is configurable, the attrs key will be an object of those particular settings for the transform.

Transforms are applied in order on a best effort. For example, if the first transform cannot be applied to the matched data (like if the entity label is not supported by something like fake), then the next transform in the list will be attempted, and so on.

In this simple example:

transforms:
- type: fake
- type: hash

If the matched data cannot be faked, it will then attempted to be hashed (which works on all types of matched data).

Let’s explore the various transforms. Each description below contains the fully available options of attrs for each transform, and will be noted as being optional.

Fake Entity

This transform will create a fake, but consistent version of a detected entity. This transform only works with value_label and field_label conditions since a specific entity needs to be detected. If this transform is applied to other conditions, no transform will take place.

transforms:
- type: fake
attrs:
seed: 8675309 # optional

Attributes:

  • seed: An optional integer that is used to seed the underlying entity faker. This seed will provide the determinism when faking. This helps ensure that a given value is faked to the same fake value consistently across all transforms. If this is omitted, a random seed will be created and stored with the model.

Currently, the following entity labels are supported:

  • person_name

  • email_address

  • ip_address

If a matched entity is not supported by fake, then the value will pass on to the next transform in the list, or pass through unmodified if there are no more applicable transforms to apply.

Secure Hash

This transform will compute an irreversible SHA256 hash of the matched data using a HMAC. Because of the way HMACs work, a secret is required for the hash to work. You may provide this value and if omitted one will be created for you and stored with the model.

transforms:
- type: hash
attrs:
secret: 83e310729ba111eb9e74a683e7e30c8d # optional

Attributes:

  • secret: An optional string that will be used as input to the HMAC.

Drop

Drop the field from a record.

transforms:
- type: drop

Character Redaction

Redact data by replacing characters with an arbitrary character. For example the value mysecret would get redacted to XXXXXXX.

transforms:
- type: redact_with_char
attrs:
char: X # optional

Attributes:

  • char: The character to used when redacting.

Number Shift

Shift a numerical value “left” or “right” a random amount. Numbers are shifted on an integer basis (i.e. floating point values are preserved). The shift amount is chosen randomly per record.

transforms:
- type: numbershift
attrs:
min: 10 # optional
max: 10 # optional
field_name: start_date # optional

Attributes:

  • max: The maximum amount to increase a number. Default is 10.

  • min: The maximum amount to decrease a number. Default is 10.

  • field_name: If provided, will shift consistently for every value of the start_date field in this example. The first random shift amount for the first instance of start_date will be cached, and used for every subsequent transform. This field may also be an array of field names.

When using field_name, this creates a cache of field values → shift values. So this will use memory linearly based on the number of unique field values.

Date Shift

This transform has the same configuration attrs as numbershift. With the addition of a formats key.

transforms:
- type: dateshift
attrs:
min: 10 # optional, number of days
max: 10 # optional, number of days
field_name: start_date # optional
formats: infer # optional

Attributes:

Same attributes as numbershift. The min and max values refer to number of days to shift.

  • formats: A date time format that should be supported. An example would be %Y-%m-%d. The default value is infer which will attempt to discover the correct format.

When using the default infer value for formats, this transform can perform much slower than providing your own formats.

Number Bucketing

This transform will adjust numerical values to a nearest multiple of the original number. For example, selecting the nearest 5, would change 53 → 50, 108 → 105, etc (using the default min method).

transforms:
- type: numberbucket
attrs:
min: 0 # required
max: 500 # required
nearest: 5 # required
method: min # optional

Attributes:

  • min: The lowest number to consider when bucketing

  • max: The highest number to consider when bucketing

  • nearest: The nearest multiple to adjust the original value to

  • method: One of min, max or avg. The default is min. This controls how to set the bucketed value. Consider the original value of 103 with a nearest value of 5:

    • min Would bucket to 100

    • max Would bucket to 105

    • avg Would bucket to 102.5 (since that is the average between the min and max values).

Passthrough

This transform will skip any type of transform and the data will be the original version in the transformed data.

transforms:
- type: passthrough

If you have fields that absolutely should pass through un-transformed then we recommend the first rule in the pipeline to contain the passthrough transform exclusively. This will ensure that a field isn’t matched and transformed by subsequent rules and policies.

Custom Predictors and Data Labeling

Within the config, you may optionally specify a label_predictors object where you can define custom predictors that will create custom entity labels.

This example creates a custom regular expression for a custom User ID:

schema_version: 1.0
name: "classify-my-data"
# ... transform model defined here ...
label_predictors:
namespace: acme
field_label_threshold: 0.90
regex:
user_id:
# entity can be used in transforms as: acme/user_id
patterns:
- score: high
regex: "user_[\\d]{8}_[A-Z]{3}"

If you wish to create custom predictors, you must provide a namespace which will be used when constructing the labels used.

  • regex: Create your own regular expressions to match and yield custom labels. The value for this property should be an object that is keyed by the labels you wish to create. For each label you wish to create, you should provide an array of patterns. Patterns are objects consisting of:

    • score: One of high, med, low. These map to floating point values of .8, ,5 and .2 respectively. If omitted the default is high.

    • regex: The actual regex that will be used to match. When crafting your regex and testing it, ensure that it is compatible with Python 3.

In the example above, the namespace and the keys of the regex object are combined to create your custom labels. For above, the label acme/user_id will be created when a match occurs.

You may use these custom labels when defining transforms:

schema_version: 1.0
name: "fake-and-hash"
models:
- transforms:
data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
policies:
- name: email_faker
rules:
- name: email-faker
conditions:
value_label:
- email_address
transforms:
- type: fake
- name: user-id-hasher
conditions:
value_label:
# YOUR CUSTOM PREDICTOR
- acme/user_id
transforms:
- hash