Transform v1
Learn how to define a policy to label and transform a dataset, with support for advanced options including custom regular expression search, date shifting, and fake entity replacements.
New Version
We have a new and improved Transform model with superior performance and configurability. Please see Transform v2
.
Definitions
Gretel’s transform model combines data classification with data transformation to easily detect and anonymize or mutate sensitive data. Gretel’s data classification can detect a variety of Supported Entities such as PII, which can be used for defining transforms.
Before diving in, let’s define some terms that you will see often. For a given data_source
, the following primitives exist:
Record: We define a record as a single unit of information, generally. This could be a database row, an ElasticSearch document, a MongoDB document, a JSON object, etc.
Field Name: A field name is a string-based key that is found with a record. For a JSON object, this would be the property names. For a database table, this would be the column name, etc. Examples might be first-name or birth-date.
Value: A value is the actual information found within a record and is described by a Field Name. This would be like a cell in a table.
Label: A label is a tag that describes the existence of a certain type of information. Gretel has 40+ built-in labels that are generated through our classification process. Additionally, you can define custom label detectors (see below).
Field Label: A field label is the application of a label uniformly to an entire field name in a dataset. Field Labels can be applied using sampling, and can be useful for classifying, for example, a database column as a specific entity. Consider a database column that contains email addresses, if you specify a field label in your transform, then after a certain amount of email addresses are observed in that field, the entire field name would be classified as an email address.
Value Label: The application of a label directly to a value. When processing records, you can define to inspect each record for a variety of labels.
Getting Started with Transforms
Let’s get started with a fully qualified configuration for a very simple transform use case:
I want to search records for email addresses and replace them with fake ones.
The transform policy structure has three notable sections. First, the models
array will have one item that is keyed by transforms
.
Within the transforms
object:
A
data_source
is requiredThere must be a
policies
array, each item in this array has 2 required and 1 optional key:name
: The name of the policyrules
: A list of specific data matching conditions and the actual transforms that will be done. Each rule contains the following keys:name
: The name of the ruleconditions
: The various ways to match data (more on this below)transforms
: The actual mutations that will take place (more on this below).
transform_attrs
(optional): Contains optional attributes to be used when configuring transforms in this policy. Currently, there are 2 attributes that can be used here, both only applicable tofake
transform:locales
: List of locales to use when generating fake data.faker_seed
: An integer value that is used to seed fake transforms in this policy. This seed will provide the determinism when faking. This helps ensure that a given value is faked to the same fake value consistently across all transforms.
Policies and Rules are executed sequentially and should be considered flexible containers that allow transform operations to be structured in more consumable ways.
For this specific configuration, let’s take a look at the conditions
and transforms
objects. In this particular example, we have created the following rule contents:
Conditions is an object that is keyed by the specific matching conditions available for data matching. Each condition name, (like value_label
), will have a different value depending on the condition’s matching behavior. The details on each condition can be found below.
A rule must have exactly one conditions object. If you find your self needing more conditions, then you should create additional rules for a given policy.
This particular config uses value_label
, which will inspect every record for a particular entity, in this case, we are searching every record for an email address.
Next, the transforms
object defines what actions will happen to the data. The transforms
value is an array of objects that are keyed like:
type
: The name of the transform (required)attrs
: Depending on the type, there may be specific attributes that are required or optional. These attributes are covered in the Transforms section below.
For this example, we use the consistent fake transform. The consistent fake transform will try to replace each detected entity with a fake version of the same entity type. Here we will take every detected email address and replace it with a fake one, such that every email is replaced with the same value throughout the dataset.
Conditions
Conditions are the specific ways that data gets matched before being sent through a transform. Each rule can have one set of conditions
that are keyed by the functionality of the condition. The configuration below, shows all possible conditions
with their respective matching values.
Let’s take a closer look at each possible condition. Remember, each rule must have exactly one conditions object, and there needs to be at least one condition type declared.
value_label
: This condition will scan every record for a matching entity in the list. The value for this condition is a list of entity labels. These labels may be Gretel built-in labels or custom defined labels. When labels are found with this condition, Gretel tracks the start and end indices for where the entity exists in the value. Transforms are applied specifically to these offsets.field_label
: This condition will match on every value for a field if the entire field has been labeled as containing the particular entity. The value for this condition is an array of labels. During training, afield_label
may be applied to an entire field a couple of ways:If the number of records in the training data is less than 10,000 and a specific percentage of
value_label
entities exist for a field, that entity label will be applied to the field. By default, this cutoff is 90%. This cutoff can be defined in the label predictors object in the configuration. For example if the training data has 8000 records, and at least 7200 values in thefirstname
field are detected asperson_name
, then the entirefirstname
field will be classified with the person_name label and every value in that field will go through transforms.If the number of records is greater than 10,000 and a field value has been labeled with an entity at least 10,000 times, then the field will be labeled with that specific entity
A field value label will only be counted towards the cutoff if the entire value contents is detected as a specific entity. Additionally, once a field_label is applied, transforms will be applied to the entire value for that field. This type of condition is particularly helpful when source data is homogenous, highly structured, and potentially very high in volume.
field_name
: This will match on the exact value for a field name. It is case insensitive.field_name_regex
: This allows a regular expression to be used as a matcher for a field name. For example using^email
will match on fields likeemail-address
, andemailAddr
, etc.Note that regex will be matched anywhere in the field name. If you want to match the whole field name, use
^
and$
anchors (e.g.^email$
will matchemail
, but notemail-address
).
Using field_name_regex
is case sensitive. If you wish to make your value insensitive you may add the following flag: (?i)
to the expression such as(?i)emailaddr.
field_attributes
: This condition can be set to an object of boolean flags. Currently supported flags are:is_id
: If the field values are unique and the field name contains an ID-like phrase like “id”, “key”, etc.
Transforms
Now that we can match data on various conditions, transforms can be applied to that matched data. The transforms
object takes an array of objects that are structured with a type
and attrs
key. The type
will always define the specific transform to apply. If that transform is configurable, the attrs
key will be an object of those particular settings for the transform.
Transforms are applied in order on a best effort. For example, if the first transform cannot be applied to the matched data (like if the entity label is not supported by something like fake
), then the next transform in the list will be attempted, and so on.
In this simple example:
If the matched data cannot be faked, it will then attempt to be hashed (which works on all types of matched data).
Let’s explore the various transforms. Each description below contains the fully available options of attrs
for each transform, and will be noted as being optional.
Fake Entity
This transform will create a fake, but consistent version of a detected entity. There are 2 modes in which this transform work:
Auto mode (default). In this mode fake method is determined based on entities detected in the training dataset.
This mode only works with
value_label
andfield_label
conditions since a specific entity needs to be detected. For all other conditions, no transform will take place in this mode.
Manual mode. In this mode fake method is specified as
attrs.method
config value. See below for more details on that config parameter.
Attributes:
seed
: An optional integer that is used to seed the underlying entity faker. This seed will provide the determinism when faking. This helps ensure that a given value is faked to the same fake value consistently across all transforms. If this is omitted, a random seed will be created and stored with the model.Note: specifying
seed
value here will override the value specified in policy'stransform_attrs.faker_seed
.
field_ref
: If provided, the fake value will be the same for every value of theuser_id
field in this example. The first fake value for the first instance ofuser_id
will be cached, and used for every subsequent transform. This field may also be an array of field names.method
: Name of the faker method to be used to generate the new value. See the Faker docs to see list of available methods. If this value is omitted, or set toauto
, then auto mode is used.params
: Key-value pairs representing parameters to be passed into faker method. This attribute can only be used whenmethod
is provided. List of available parameters can be found in the Faker docs (e.g. parameters accepted by theemail
method).
When using field_ref, this creates a cache of field values → fake values. So this will use memory linearly based on the number of unique field values of the referenced field.
Fake transform will also use following policy-level attributes for its configuration: locales
and faker_seed
, if they are provided. See above for description of these attributes.
Currently, the following entity labels are supported:
person_name
email_address
ip_address
credit_card_number
phone_number
us_social_security_number
iban_code
domain_name
url
If a matched entity is not supported by fake, then the value will pass on to the next transform in the list, or pass through unmodified if there are no more applicable transforms to apply.
Secure Hash
This transform will compute an irreversible SHA256 hash of the matched data using a HMAC. Because of the way HMACs work, a secret is required for the hash to work. You may provide this value and if omitted one will be created for you and stored with the model.
Attributes:
secret
: An optional string that will be used as input to the HMAC.length
: Optionally trim the hash to the last X characters
Drop
Drop the field from a record.
Character Redaction
Redact data by replacing characters with an arbitrary character. For example the value mysecret
would get redacted to XXXXXXX.
Attributes:
char
: The character to be used when redacting.
Number Shift
Shift a numerical by adding a random amount from a range [min, max]
, where min
and max
are specified in the config.
The shift amount is an integer, but that can be controlled with the precision
config. The number after the shift is not rounded (i.e. floating point values are preserved).
The shift amount is chosen randomly per record, unless the field_ref
config is provided.
By default, min
is set to -1000, and max
to 1000, which means that the shifted value will be in a range [original-1000, original+1000]
.
Attributes:
min
: The minimum amount to add to a number. Default is -1000.max
: The maximum amount to add to a number. Default is 1000.precision
: Number of digits after decimal point to be used when generating the shift amount. Default is 0 (generated amount will be an int).For example: with precision set to 2, shift amount could be
1.33
,2.59
, etc.
field_ref
: If provided, will shift consistently for every value of thestart_date
field in this example. The first random shift amount for the first instance ofstart_date
will be cached, and used for every subsequent transform. This field may also be an array of field names.field_name
(deprecated): has been renamed tofield_ref
and will be removed in a future version of the config.
When using field_ref
, this creates a cache of field values → shift values. So this will use memory linearly based on the number of unique field values.
Date Shift
This transform has the same configuration attrs
as numbershift
. With the addition of a formats key.
Attributes:
Same attributes as numbershift
. The min and max values refer to number of days to shift.
formats
: A date time format that should be supported. An example would be%Y-%m-%d
. The default value is infer which will attempt to discover the correct format.
When using the default infer value for formats, this transform can perform much slower than providing your own formats.
Number Bucketing
This transform will adjust numerical values to a nearest multiple of the original number. For example, selecting the nearest 5, would change 53 → 50, 108 → 105, etc (using the default min
method).
Attributes:
min
: The lowest number to consider when bucketingmax
: The highest number to consider when bucketingnearest
: The nearest multiple to adjust the original value tomethod
: One of min, max or avg. The default is min. This controls how to set the bucketed value. Consider the original value of 103 with a nearest value of 5:min
Would bucket to 100max
Would bucket to 105avg
Would bucket to 102.5 (since that is the average between the min and max values).
Passthrough
This transform will skip any type of transform and the data will be the original version in the transformed data.
If you have fields that absolutely should pass through un-transformed then we recommend the first rule in the pipeline to contain the passthrough transform exclusively. This will ensure that a field isn’t matched and transformed by subsequent rules and policies.
Custom Predictors and Data Labeling
Within the config, you may optionally specify a label_predictors
object where you can define custom predictors that will create custom entity labels.
This example creates a custom regular expression for a custom User ID:
If you wish to create custom predictors, you must provide a namespace which will be used when constructing the labels used.
regex
: Create your own regular expressions to match and yield custom labels. The value for this property should be an object that is keyed by the _labels you wish to create._ For each label you wish to create, you should provide an array of patterns. Patterns are objects consisting of:score
: One of high, med, low. These map to floating point values of .8, ,5 and .2 respectively. If omitted the default ishigh
.regex
: The actual regex that will be used to match. When crafting your regex and testing it, ensure that it is compatible with Python 3.
In the example above, the namespace and the keys of the regex object are combined to create your custom labels. For above, the label acme/user_id
will be created when a match occurs.
You may use these custom labels when defining transforms:
Enabling NLP
Transform pipelines may be configured to use NLP for data labelling by setting the use_nlp
flag to true
. For example
Enabling NLP predictions may decrease model prediction throughput by up to 70%.
For more information about using NLP models with Gretel please refer to our classification docs.
Last updated