Search…
Model Configuration
Learn how to define a policy to label and transform a dataset, with support for advanced options including custom regular expression search, date shifting, and fake entity replacements.

Definitions

Gretel’s transform workflow combines data classification with data transformation to easily detect and anonymize or mutate sensitive data. Gretel’s data classification can detect a variety of entities such as PII, which can be used for defining transforms.
Before diving in, let’s define some terms that you will see often. For a given data_source, the following primitives exist:
  • Record: We define a record as a single unit of information, generally. This could be a database row, an ElasticSearch document, a MongoDB document, a JSON object, etc.
  • Field Name: A field name is a string-based key that is found with a record. For a JSON object, this would be the property names. For a database table, this would be the column name, etc. Examples might be first-name or birth-date.
  • Value: A value is the actual information found within a record and is described by a Field Name. This would be like a cell in a table.
  • Label: A label is a tag that describes the existence of a certain type of information. Gretel has 40+ built-in labels that are generated through our classification process. Additionally, you can define custom label detectors (see below).
    • Field Label: A field label is the application of a label uniformly to an entire field name in a dataset. Field Labels can be applied using sampling, and can be useful for classifying, for example, a database column as a specific entity. Consider a database column that contains email addresses, if you specify a field label in your transform, then after a certain amount of email addresses are observed in that field, the entire field name would be classified as an email address.
    • Value Label: The application of a label directly to a value. When processing records, you can define to inspect each record for a variety of labels.

Getting Started with Transforms

Let’s get started with a fully qualified configuration for a very simple transform use case:
I want to search records for email addresses and replace them with fake ones.
1
schema_version: "1.0"
2
name: "fake-all-emails"
3
4
5
models:
6
- transforms:
7
data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
8
policies:
9
- name: email_faker
10
rules:
11
- name: email-faker
12
conditions:
13
value_label:
14
- email_address
15
transforms:
16
- type: fake
Copied!
The transform policy structure has three notable sections. First, the models array will have one item that is keyed by transforms.
Within the transforms object:
  • A data_source is required
  • There must be a policies array, each item in this array has 2 keys:
    • name: The name of the policy
    • rules: A list of specific data matching conditions and the actual transforms that will be done. Each rule contains the following keys:
      • name: The name of the rule
      • conditions: The various ways to match data (more on this below)
      • transforms: The actual mutations that will take place (more on this below).
Policies and Rules are executed sequentially and should be considered flexible containers that allow transform operations to be structured in more consumable ways.
For this specific configuration, let’s take a look at the conditions and transforms objects. In this particular example, we have created the following rule contents:
1
conditions:
2
value_label:
3
- email_address
4
transforms:
5
- type: fake
Copied!
Conditions is an object that is keyed by the specific matching conditions available for data matching. Each condition name, (like value_label), will have a different value depending on the condition’s matching behavior. The details on each condition can be found below.
A rule must have exactly one conditions object. If you find your self needing more conditions, then you should create additional rules for a given policy.
This particular config uses value_label, which will inspect every record for a particular entity, in this case, we are searching every record for an email address.
Next, the transforms object defines what actions will happen to the data. The transforms value is an array of objects that are keyed like:
  • type: The name of the transform (required)
  • attrs: Depending on the type, there may be specific attributes that are required or optional. These attributes are covered in the Transforms section below.
For this example, we use the consistent fake transform. The consistent fake transform will try to replace each detected entity with a fake version of the same entity type. Here we will take every detected email address and replace it with a fake one, such that every email is replaced with the same value throughout the dataset.

Conditions

Conditions are the specific ways that data gets matched before being sent through a transform. Each rule can have one set of conditions that are keyed by the functionality of the condition. The configuration below, shows all possible conditions with their respective matching values.
1
schema_version: "1.0"
2
name: "hashy-mc-hashface"
3
4
5
models:
6
- transforms:
7
data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
8
policies:
9
- name: hash-policy
10
rules:
11
- name: hash-all-the-things
12
conditions:
13
value_label:
14
- email_address
15
- phone_number
16
field_label:
17
- email_address
18
field_name:
19
- password
20
field_name_regex:
21
- "^email"
22
- "^phone"
23
field_attributes:
24
is_id: true
25
transforms:
26
- type: hash
Copied!
Let’s take a closer look at each possible condition. Remember, each rule must have exactly one conditions object, and there needs to be at least one condition type declared.
  • value_label: This condition will scan every record for a matching entity in the list. The value for this condition is a list of entity labels. These labels may be Gretel built-in labels or custom defined labels. When labels are found with this condition, Gretel tracks the start and end indices for where the entity exists in the value. Transforms are applied specifically to these offsets.
  • field_label: This condition will match on every value for a field if the entire field has been labeled as containing the particular entity. The value for this condition is an array of labels. During training, a field_label may be applied to an entire field a couple of ways:
    • If the number of records in the training data is less than 10,000 and a specific percentage ofvalue_label entities exist for a field, that entity label will be applied to the field. By default, this cutoff is 90%. This cutoff can be defined in the label predictors object in the configuration. For example if the training data has 8000 records, and at least 7200 values in the firstname field are detected as person_name, then the entire firstname field will be classified with the person_name label and every value in that field will go through transforms.
    • If the number of records is greater than 10,000 and a field value has been labeled with an entity at least 10,000 times, then the field will be labeled with that specific entity
A field value label will only be counted towards the cutoff if the entire value contents is detected as a specific entity. Additionally, once a field_label is applied, transforms will be applied to the entire value for that field. This type of condition is particularly helpful when source data is homogenous, highly structured, and potentially very high in volume.
  • field_name: This will match on the exact value for a field name. It is case insensitive.
  • field_name_regex: This allows a regular expression to be used as a matcher for a field name. For example using ^regex will match on fields like email-address, and emailAddr, etc.
Using field_name_regex is case sensitive. If you wish to make your value insensitive you may add the following flag: (?i) to the expression such as(?i)emailaddr.
Regex matches are case insensitive.
  • field_attributes: This condition can be set to an object of boolean flags. Currently supported flags are:
    • is_id: If the field values are unique and the field name contains an ID-like phrase like “id”, “key”, etc.

Transforms

Now that we can match data on various conditions, transforms can be applied to that matched data. The transforms object takes an array of objects that are structured with a type and attrs key. The type will always define the specific transform to apply. If that transform is configurable, the attrs key will be an object of those particular settings for the transform.
Transforms are applied in order on a best effort. For example, if the first transform cannot be applied to the matched data (like if the entity label is not supported by something like fake), then the next transform in the list will be attempted, and so on.
In this simple example:
1
transforms:
2
- type: fake
3
- type: hash
Copied!
If the matched data cannot be faked, it will then attempt to be hashed (which works on all types of matched data).
Let’s explore the various transforms. Each description below contains the fully available options of attrs for each transform, and will be noted as being optional.

Fake Entity

This transform will create a fake, but consistent version of a detected entity. This transform only works with value_label and field_label conditions since a specific entity needs to be detected. If this transform is applied to other conditions, no transform will take place.
1
transforms:
2
- type: fake
3
attrs:
4
seed: 8675309 # optional
Copied!
Attributes:
  • seed: An optional integer that is used to seed the underlying entity faker. This seed will provide the determinism when faking. This helps ensure that a given value is faked to the same fake value consistently across all transforms. If this is omitted, a random seed will be created and stored with the model.
Currently, the following entity labels are supported:
  • person_name
  • email_address
  • ip_address
If a matched entity is not supported by fake, then the value will pass on to the next transform in the list, or pass through unmodified if there are no more applicable transforms to apply.

Secure Hash

This transform will compute an irreversible SHA256 hash of the matched data using a HMAC. Because of the way HMACs work, a secret is required for the hash to work. You may provide this value and if omitted one will be created for you and stored with the model.
1
transforms:
2
- type: hash
3
attrs:
4
secret: 83e310729ba111eb9e74a683e7e30c8d # optional
Copied!
Attributes:
  • secret: An optional string that will be used as input to the HMAC.

Drop

Drop the field from a record.
1
transforms:
2
- type: drop
Copied!

Character Redaction

Redact data by replacing characters with an arbitrary character. For example the value mysecret would get redacted to XXXXXXX.
1
transforms:
2
- type: redact_with_char
3
attrs:
4
char: X # optional
Copied!
Attributes:
  • char: The character to be used when redacting.

Number Shift

Shift a numerical value “left” or “right” a random amount. Numbers are shifted on an integer basis (i.e. floating point values are preserved). The shift amount is chosen randomly per record.
1
transforms:
2
- type: numbershift
3
attrs:
4
min: 10 # optional
5
max: 10 # optional
6
field_name: start_date # optional
Copied!
Attributes:
  • max: The maximum amount to increase a number. Default is 10.
  • min: The maximum amount to decrease a number. Default is 10.
  • field_name: If provided, will shift consistently for every value of the start_date field in this example. The first random shift amount for the first instance of start_date will be cached, and used for every subsequent transform. This field may also be an array of field names.
When using field_name, this creates a cache of field values → shift values. So this will use memory linearly based on the number of unique field values.

Date Shift

This transform has the same configuration attrs as numbershift. With the addition of a formats key.
1
transforms:
2
- type: dateshift
3
attrs:
4
min: 10 # optional, number of days
5
max: 10 # optional, number of days
6
field_name: start_date # optional
7
formats: infer # optional
Copied!
Attributes:
Same attributes as numbershift. The min and max values refer to number of days to shift.
  • formats: A date time format that should be supported. An example would be %Y-%m-%d. The default value is infer which will attempt to discover the correct format.
When using the default infer value for formats, this transform can perform much slower than providing your own formats.

Number Bucketing

This transform will adjust numerical values to a nearest multiple of the original number. For example, selecting the nearest 5, would change 53 → 50, 108 → 105, etc (using the default min method).
1
transforms:
2
- type: numberbucket
3
attrs:
4
min: 0 # required
5
max: 500 # required
6
nearest: 5 # required
7
method: min # optional
Copied!
Attributes:
  • min: The lowest number to consider when bucketing
  • max: The highest number to consider when bucketing
  • nearest: The nearest multiple to adjust the original value to
  • method: One of min, max or avg. The default is min. This controls how to set the bucketed value. Consider the original value of 103 with a nearest value of 5:
    • min Would bucket to 100
    • max Would bucket to 105
    • avg Would bucket to 102.5 (since that is the average between the min and max values).

Passthrough

This transform will skip any type of transform and the data will be the original version in the transformed data.
1
transforms:
2
- type: passthrough
Copied!
If you have fields that absolutely should pass through un-transformed then we recommend the first rule in the pipeline to contain the passthrough transform exclusively. This will ensure that a field isn’t matched and transformed by subsequent rules and policies.

Custom Predictors and Data Labeling

Within the config, you may optionally specify a label_predictors object where you can define custom predictors that will create custom entity labels.
This example creates a custom regular expression for a custom User ID:
1
schema_version: "1.0"
2
name: "classify-my-data"
3
4
# ... transform model defined here ...
5
6
label_predictors:
7
namespace: acme
8
field_label_threshold: 0.90
9
10
regex:
11
user_id:
12
# entity can be used in transforms as: acme/user_id
13
patterns:
14
- score: high
15
regex: "user_[\\d]{8}_[A-Z]{3}"
Copied!
If you wish to create custom predictors, you must provide a namespace which will be used when constructing the labels used.
  • regex: Create your own regular expressions to match and yield custom labels. The value for this property should be an object that is keyed by the labels you wish to create. For each label you wish to create, you should provide an array of patterns. Patterns are objects consisting of:
    • score: One of high, med, low. These map to floating point values of .8, ,5 and .2 respectively. If omitted the default is high.
    • regex: The actual regex that will be used to match. When crafting your regex and testing it, ensure that it is compatible with Python 3.
In the example above, the namespace and the keys of the regex object are combined to create your custom labels. For above, the label acme/user_id will be created when a match occurs.
You may use these custom labels when defining transforms:
1
schema_version: "1.0"
2
name: "fake-and-hash"
3
4
models:
5
- transforms:
6
data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
7
policies:
8
- name: email_faker
9
rules:
10
- name: email-faker
11
conditions:
12
value_label:
13
- email_address
14
transforms:
15
- type: fake
16
- name: user-id-hasher
17
conditions:
18
value_label:
19
# YOUR CUSTOM PREDICTOR
20
- acme/user_id
21
transforms:
22
- hash
Copied!

Enabling NLP

Transform pipelines may be configured to use NLP for data labelling by setting the use_nlp flag to true. For example
1
schema_version: "1.0"
2
models:
3
- transforms:
4
data_source: "_"
5
use_nlp: true
6
policies:
7
- name: remove_pii
8
rules:
9
- name: redact_pii
10
conditions:
11
value_label:
12
- person_name
13
- location
14
transforms:
15
- type: fake
16
- type: redact_with_char
17
attrs:
18
char: X
Copied!
Enabling NLP predictions may decrease model prediction throughput by up to 70%.
For more information about using NLP models with Gretel please refer to our classification docs.
Last modified 1mo ago