Transform v1

Learn how to define a policy to label and transform a dataset, with support for advanced options including custom regular expression search, date shifting, and fake entity replacements.

Definitions

Gretel’s transform model combines data classification with data transformation to easily detect and anonymize or mutate sensitive data. Gretel’s data classification can detect a variety of Supported Entities such as PII, which can be used for defining transforms.

Before diving in, let’s define some terms that you will see often. For a given data_source, the following primitives exist:

  • Record: We define a record as a single unit of information, generally. This could be a database row, an ElasticSearch document, a MongoDB document, a JSON object, etc.

  • Field Name: A field name is a string-based key that is found with a record. For a JSON object, this would be the property names. For a database table, this would be the column name, etc. Examples might be first-name or birth-date.

  • Value: A value is the actual information found within a record and is described by a Field Name. This would be like a cell in a table.

  • Label: A label is a tag that describes the existence of a certain type of information. Gretel has 40+ built-in labels that are generated through our classification process. Additionally, you can define custom label detectors (see below).

    • Field Label: A field label is the application of a label uniformly to an entire field name in a dataset. Field Labels can be applied using sampling, and can be useful for classifying, for example, a database column as a specific entity. Consider a database column that contains email addresses, if you specify a field label in your transform, then after a certain amount of email addresses are observed in that field, the entire field name would be classified as an email address.

    • Value Label: The application of a label directly to a value. When processing records, you can define to inspect each record for a variety of labels.

Getting Started with Transforms

Let’s get started with a fully qualified configuration for a very simple transform use case:

I want to search records for email addresses and replace them with fake ones.

schema_version: "1.0"
name: "fake-all-emails"


models:
  - transforms:
      data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
      policies:
        - name: email_faker
          transform_attrs:
            locales: 
              - en_US
            faker_seed: 1234
          rules:
            - name: email-faker
              conditions:
                value_label:
                  - email_address
              transforms:
                - type: fake

The transform policy structure has three notable sections. First, the models array will have one item that is keyed by transforms.

Within the transforms object:

  • A data_source is required

  • There must be a policies array, each item in this array has 2 required and 1 optional key:

    • name: The name of the policy

    • rules: A list of specific data matching conditions and the actual transforms that will be done. Each rule contains the following keys:

      • name: The name of the rule

      • conditions: The various ways to match data (more on this below)

      • transforms: The actual mutations that will take place (more on this below).

    • transform_attrs (optional): Contains optional attributes to be used when configuring transforms in this policy. Currently, there are 2 attributes that can be used here, both only applicable to fake transform:

      • locales : List of locales to use when generating fake data.

      • faker_seed: An integer value that is used to seed fake transforms in this policy. This seed will provide the determinism when faking. This helps ensure that a given value is faked to the same fake value consistently across all transforms.

Policies and Rules are executed sequentially and should be considered flexible containers that allow transform operations to be structured in more consumable ways.

For this specific configuration, let’s take a look at the conditions and transforms objects. In this particular example, we have created the following rule contents:

conditions:
  value_label:
    - email_address
transforms:
  - type: fake

Conditions is an object that is keyed by the specific matching conditions available for data matching. Each condition name, (like value_label), will have a different value depending on the condition’s matching behavior. The details on each condition can be found below.

A rule must have exactly one conditions object. If you find your self needing more conditions, then you should create additional rules for a given policy.

This particular config uses value_label, which will inspect every record for a particular entity, in this case, we are searching every record for an email address.

Next, the transforms object defines what actions will happen to the data. The transforms value is an array of objects that are keyed like:

  • type: The name of the transform (required)

  • attrs: Depending on the type, there may be specific attributes that are required or optional. These attributes are covered in the Transforms section below.

For this example, we use the consistent fake transform. The consistent fake transform will try to replace each detected entity with a fake version of the same entity type. Here we will take every detected email address and replace it with a fake one, such that every email is replaced with the same value throughout the dataset.

Conditions

Conditions are the specific ways that data gets matched before being sent through a transform. Each rule can have one set of conditions that are keyed by the functionality of the condition. The configuration below, shows all possible conditions with their respective matching values.

schema_version: "1.0"
name: "hashy-mc-hashface"


models:
  - transforms:
      data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
      policies:
        - name: hash-policy
          rules:
            - name: hash-all-the-things
              conditions:
                value_label:
                  - email_address
                  - phone_number
                field_label:
                  - email_address
                field_name:
                  - password
                field_name_regex:
                  - "^email"
                  - "^phone"
                field_attributes:
                  is_id: true
              transforms:
                - type: hash

Let’s take a closer look at each possible condition. Remember, each rule must have exactly one conditions object, and there needs to be at least one condition type declared.

  • value_label: This condition will scan every record for a matching entity in the list. The value for this condition is a list of entity labels. These labels may be Gretel built-in labels or custom defined labels. When labels are found with this condition, Gretel tracks the start and end indices for where the entity exists in the value. Transforms are applied specifically to these offsets.

  • field_label: This condition will match on every value for a field if the entire field has been labeled as containing the particular entity. The value for this condition is an array of labels. During training, a field_label may be applied to an entire field a couple of ways:

    • If the number of records in the training data is less than 10,000 and a specific percentage ofvalue_label entities exist for a field, that entity label will be applied to the field. By default, this cutoff is 90%. This cutoff can be defined in the label predictors object in the configuration. For example if the training data has 8000 records, and at least 7200 values in the firstname field are detected as person_name, then the entire firstname field will be classified with the person_name label and every value in that field will go through transforms.

    • If the number of records is greater than 10,000 and a field value has been labeled with an entity at least 10,000 times, then the field will be labeled with that specific entity

A field value label will only be counted towards the cutoff if the entire value contents is detected as a specific entity. Additionally, once a field_label is applied, transforms will be applied to the entire value for that field. This type of condition is particularly helpful when source data is homogenous, highly structured, and potentially very high in volume.

  • field_name: This will match on the exact value for a field name. It is case insensitive.

  • field_name_regex: This allows a regular expression to be used as a matcher for a field name. For example using ^email will match on fields like email-address, and emailAddr, etc.

    • Note that regex will be matched anywhere in the field name. If you want to match the whole field name, use ^ and $ anchors (e.g. ^email$ will match email, but not email-address).

Using field_name_regex is case sensitive. If you wish to make your value insensitive you may add the following flag: (?i) to the expression such as(?i)emailaddr.

  • field_attributes: This condition can be set to an object of boolean flags. Currently supported flags are:

    • is_id: If the field values are unique and the field name contains an ID-like phrase like “id”, “key”, etc.

Transforms

Now that we can match data on various conditions, transforms can be applied to that matched data. The transforms object takes an array of objects that are structured with a type and attrs key. The type will always define the specific transform to apply. If that transform is configurable, the attrs key will be an object of those particular settings for the transform.

Transforms are applied in order on a best effort. For example, if the first transform cannot be applied to the matched data (like if the entity label is not supported by something like fake), then the next transform in the list will be attempted, and so on.

In this simple example:

transforms:
  - type: fake
  - type: hash

If the matched data cannot be faked, it will then attempt to be hashed (which works on all types of matched data).

Let’s explore the various transforms. Each description below contains the fully available options of attrs for each transform, and will be noted as being optional.

Fake Entity

This transform will create a fake, but consistent version of a detected entity. There are 2 modes in which this transform work:

  • Auto mode (default). In this mode fake method is determined based on entities detected in the training dataset.

    • This mode only works with value_label and field_label conditions since a specific entity needs to be detected. For all other conditions, no transform will take place in this mode.

  • Manual mode. In this mode fake method is specified as attrs.method config value. See below for more details on that config parameter.

transforms:
  - type: fake
    attrs:
      seed: 8675309  # optional
      field_ref: user_id  # optional
      method: email # optional
      params: # optional
        safe: False
        domain: foo.com

Attributes:

  • seed: An optional integer that is used to seed the underlying entity faker. This seed will provide the determinism when faking. This helps ensure that a given value is faked to the same fake value consistently across all transforms. If this is omitted, a random seed will be created and stored with the model.

    • Note: specifying seed value here will override the value specified in policy's transform_attrs.faker_seed.

  • field_ref: If provided, the fake value will be the same for every value of the user_id field in this example. The first fake value for the first instance of user_id will be cached, and used for every subsequent transform. This field may also be an array of field names.

  • method: Name of the faker method to be used to generate the new value. See the Faker docs to see list of available methods. If this value is omitted, or set to auto, then auto mode is used.

  • params: Key-value pairs representing parameters to be passed into faker method. This attribute can only be used when method is provided. List of available parameters can be found in the Faker docs (e.g. parameters accepted by the email method).

When using field_ref, this creates a cache of field values → fake values. So this will use memory linearly based on the number of unique field values of the referenced field.

Fake transform will also use following policy-level attributes for its configuration: locales and faker_seed, if they are provided. See above for description of these attributes.

Currently, the following entity labels are supported:

  • person_name

  • email_address

  • ip_address

  • credit_card_number

  • phone_number

  • us_social_security_number

  • iban_code

  • domain_name

  • url

If a matched entity is not supported by fake, then the value will pass on to the next transform in the list, or pass through unmodified if there are no more applicable transforms to apply.

Secure Hash

This transform will compute an irreversible SHA256 hash of the matched data using a HMAC. Because of the way HMACs work, a secret is required for the hash to work. You may provide this value and if omitted one will be created for you and stored with the model.

transforms:
  - type: hash
    attrs:
      secret: 83e310729ba111eb9e74a683e7e30c8d  # optional
      length: 8 # optional

Attributes:

  • secret: An optional string that will be used as input to the HMAC.

  • length: Optionally trim the hash to the last X characters

Drop

Drop the field from a record.

transforms:
  - type: drop

Character Redaction

Redact data by replacing characters with an arbitrary character. For example the value mysecret would get redacted to XXXXXXX.

transforms:
  - type: redact_with_char
    attrs:
      char: X  # optional

Attributes:

  • char: The character to be used when redacting.

Number Shift

Shift a numerical by adding a random amount from a range [min, max], where min and max are specified in the config. The shift amount is an integer, but that can be controlled with the precision config. The number after the shift is not rounded (i.e. floating point values are preserved).

The shift amount is chosen randomly per record, unless the field_ref config is provided.

By default, min is set to -1000, and max to 1000, which means that the shifted value will be in a range [original-1000, original+1000].

transforms:
  - type: numbershift
    attrs:
      min: -1000 # optional
      max: 1000 # optional
      precision: 0 # optional
      field_ref: start_date # optional

Attributes:

  • min: The minimum amount to add to a number. Default is -1000.

  • max: The maximum amount to add to a number. Default is 1000.

  • precision: Number of digits after decimal point to be used when generating the shift amount. Default is 0 (generated amount will be an int).

    • For example: with precision set to 2, shift amount could be 1.33, 2.59, etc.

  • field_ref: If provided, will shift consistently for every value of the start_date field in this example. The first random shift amount for the first instance of start_date will be cached, and used for every subsequent transform. This field may also be an array of field names.

  • field_name (deprecated): has been renamed to field_ref and will be removed in a future version of the config.

When using field_ref, this creates a cache of field values → shift values. So this will use memory linearly based on the number of unique field values.

Date Shift

This transform has the same configuration attrs as numbershift. With the addition of a formats key.

transforms:
  - type: dateshift
    attrs:
      min: -30 # optional, number of days
      max: 30 # optional, number of days
      field_ref: start_date # optional
      formats: infer # optional

Attributes:

Same attributes as numbershift. The min and max values refer to number of days to shift.

  • formats: A date time format that should be supported. An example would be %Y-%m-%d. The default value is infer which will attempt to discover the correct format.

When using the default infer value for formats, this transform can perform much slower than providing your own formats.

Number Bucketing

This transform will adjust numerical values to a nearest multiple of the original number. For example, selecting the nearest 5, would change 53 → 50, 108 → 105, etc (using the default min method).

transforms:
  - type: numberbucket
    attrs:
      min: 0 # required
      max: 500 # required
      nearest: 5 # required
      method: min # optional

Attributes:

  • min: The lowest number to consider when bucketing

  • max: The highest number to consider when bucketing

  • nearest: The nearest multiple to adjust the original value to

  • method: One of min, max or avg. The default is min. This controls how to set the bucketed value. Consider the original value of 103 with a nearest value of 5:

    • min Would bucket to 100

    • max Would bucket to 105

    • avg Would bucket to 102.5 (since that is the average between the min and max values).

Passthrough

This transform will skip any type of transform and the data will be the original version in the transformed data.

transforms:
  - type: passthrough

If you have fields that absolutely should pass through un-transformed then we recommend the first rule in the pipeline to contain the passthrough transform exclusively. This will ensure that a field isn’t matched and transformed by subsequent rules and policies.

Custom Predictors and Data Labeling

Within the config, you may optionally specify a label_predictors object where you can define custom predictors that will create custom entity labels.

This example creates a custom regular expression for a custom User ID:

schema_version: "1.0"
name: "classify-my-data"

# ... transform model defined here ...

label_predictors:
  namespace: acme
  field_label_threshold: 0.90
  
  regex:
    user_id:
      # entity can be used in transforms as: acme/user_id
      patterns:
        - score: high
          regex: "user_[\\d]{8}_[A-Z]{3}"

If you wish to create custom predictors, you must provide a namespace which will be used when constructing the labels used.

  • regex: Create your own regular expressions to match and yield custom labels. The value for this property should be an object that is keyed by the _labels you wish to create._ For each label you wish to create, you should provide an array of patterns. Patterns are objects consisting of:

    • score: One of high, med, low. These map to floating point values of .8, ,5 and .2 respectively. If omitted the default is high.

    • regex: The actual regex that will be used to match. When crafting your regex and testing it, ensure that it is compatible with Python 3.

In the example above, the namespace and the keys of the regex object are combined to create your custom labels. For above, the label acme/user_id will be created when a match occurs.

You may use these custom labels when defining transforms:

schema_version: "1.0"
name: "fake-and-hash"

models:
  - transforms:
      data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
      policies:
        - name: email_faker
          rules:
            - name: email-faker
              conditions:
                value_label:
                  - email_address
              transforms:
                - type: fake
            - name: user-id-hasher
              conditions:
                value_label:
                  # YOUR CUSTOM PREDICTOR
                  - acme/user_id
                transforms:
                  - hash

Enabling NLP

Transform pipelines may be configured to use NLP for data labelling by setting the use_nlp flag to true. For example

schema_version: "1.0"
models:
  - transforms:
      data_source: "_"
      use_nlp: true
      policies:
        - name: remove_pii
          rules:
            - name: redact_pii
              conditions: 
                value_label:
                  - person_name
                  - location
              transforms:
                - type: fake
                - type: redact_with_char
                  attrs:
                    char: X

Enabling NLP predictions may decrease model prediction throughput by up to 70%.

For more information about using NLP models with Gretel please refer to our classification docs.

Last updated