Reference

Transform v2 configs consist of (optional) global parameters followed by a sequence of transformation steps. Rows, columns, and values not touched by any transformation step are maintained as-is in the output. In other words, Transform v2 configs are implicitly "passthrough".

Below is a "kitchen sink" config showing most of Transform v2 capabilities. Don't worry if it looks overwhelming. We will dissect each step in the reference below.

schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          enable: true
          entities: [first_name, last_name, phone_number]
          num_samples: 3
        locales: [en_US]
        seed: 42
      steps:
        - vars:
            categories: data.Category.unique().to_list()
          columns:
            add:
              - name: PrimaryKey
                position: 0
            drop:
              - name: Email
            rename:
              - name: MiddleName
                value: MiddleInitial
          rows:
            drop:
              - condition: index % 2 == 1
            update:
              - name: PrimaryKey
                value: index
              - name: UserIdentifier
                value: this | hash | truncate(10, end="") # The default end is "..."
              - entity: [first_name, last_name]
                value: column.entity | fake
              - name: MiddleInitial
                value: fake.first_name() | first + "."
              - condition: column.name == "Category" and this | isna
                value: vars.categories | random
              - name: Phone
                value: fake(row.Country | lookup_locales).phone_number()
                fallback_value: fake.phone_number()
              - name: work_email
                value: (row.FirstName | lower | first + row.LastName | lower) | normalize + "@" + fake.domain_name()

Globals

The entire globals section is optional. You can use it to re-configure the following default entity detection and transformation settings:

  • classify: Dictionary of classification configuration parameters. Note that classification is only performed once for each model, and currently only maps entire columns to entities (searching for entities within free text fields similarly to Transform's use_nlp option is not currently supported in Transform v2). Subsequent model runs will assume the schema remains unchanged, and continue to use the column to entity mapping detected during the first run. NOTE: This will send column headers and a sample of data to perform the classification to Gretel Navigator or a hybrid-deployed Gretel Inference LLM.

    • enable: Boolean specifying whether to perform classification. Defaults to true when running within Gretel Cloud; defaults to false otherwise. When false, sets column.entity to none for all columns. When true, classification accuracy currently necessitates sending column names and a few (equal to num_samples) randomly selected values from each column to the Gretel Cloud.

    • entities: List of PII entities that Transform's classification model will attempt to detect. Defaults to the following commonly used entities: [name, first_name, last_name, company, email, phone_number, address, street_address, city, administrative_unit, country, postcode]. For best practices around customizing this list, see Classification.

    • num_samples: Number of randomly selected values from each column to use for classification. Defaults to 3, but you can set it to a higher number for more accurate classification, or a lower number if you have privacy or security concerns with sending randomly sampled values from your dataset. Setting num_samples: 0 will use only column names as the input to classification.

  • ner: Named entity recognition

  • locales: List of default Faker locales to use for fake value generation. Defaults to ["en_US"]. fake will randomly choose a locale from this list each time it generates a fake value, except when initialized with explicit locales, e.g. fake(["fr_FR"]).first_name(). For a list of valid locales, see Faker's localized providers.

  • seed: Integer seed value used to generate fake values consistently. Defaults to null. When the seed is set to null, a random integer is generated at the beginning of each Transform v2 run and used as the seed to transform values consistently within the current run (subsequent runs will generate their own random seed). This means rerunning with a null seed can cause inconsistent transforms (i.e. Alice -> Bob for the first run, Alice -> Jane for the second). If you set the seed to a specific number, transforms will be consistent across runs (i.e. Alice -> Bob always). The seed also doubles as a salt for the hash function. While there are privacy benefits to inconsistent transformations, we recommend setting a fixed seed for consistent transformation for use cases involving downstream synthetic data generation, or analysis on the transformed dataset.

You can also access global constants in transformation steps. For example, a transformation step with value: globals.locales | first will set that field's value to the first locale in the list of locales.

Steps

steps contain an ordered list of data transformation actions to be executed in the same order as they are defined in the Transform v2 config.

Vars

Each step can optionally contain a vars section, which defines custom variables to be used in any Jinja expression within the step. Unlike globals, vars are scoped to an individual step, and are initialized using Jinja expressions that are evaluated at the beginning of each step.

These expressions can leverage data (a Pandas DataFrame containing the entire dataset) to implement custom aggregations. For example, the config section below creates a new percent_of_total column by storing the total in vars then dividing the value of each individual row by vars.total:

steps:
  - vars:
      total: data.subtotal.sum()
    columns:
      add:
        - name: percent_of_total
    rows:
      update:
        - name: percent_of_total
          value: row.subtotal * 100 / vars.total

Columns

The columns section of each step contains transformations applying to an entire column at once. Namely: adding a new column, dropping (removing) a column, and renaming a column.

Add

You can add a new blank column (which you can later fill in using a rows update action) by specifying its name and optional position. If position is left unspecified, the new column is added as the last column. Initially all values in the new column will be null, but you can populate them using a rows.update rule. For example, the config section below adds a primary_key column, positions it as the first column in the dataset, and then populates it with the index of the row:

steps:
  - columns:
      add:
        - name: primary_key
          position: 0
    rows:
      update:
        - name: primary_key
          value: index

Drop

To drop a column, specify its name in a columns drop action. For example, the config section below drops the FirstName and LastName columns:

columns:
  drop:
    - name: FirstName
    - name: LastName

You can also drop columns based on a condition expressed. condition has access to the entire Transform v2 Jinja environment, as well as a few additional objects:

  • vars: Dictionary of variables defined under the vars section of the current step. For example, vars.total refers to the value of the total variable defined above.

  • column: Dictionary containing the following column properties. For example, condition: column.entity in vars.entities_to_drop drops all columns matching the list of PII entities defined in the entities_to_drop variable.

    • name: the name or header of the column in the dataset.

    • entity: the detected PII entity type of the column, or none if the column does not match any PII entity type from the list under globals.classify.entities.

    • dtype: Pandas dtype of the column.

    • type: the detected data type of the column, one of "empty", "numeric", "categorical", "binary", "text", or "other".

    • position: zero-indexed position of the column in the dataset. For a dataset with 10 columns, column.position is equal to 0 for the first column and 9 for the last column.

Rename

You can rename a column by specifying its current name (name) and new name (value). For example, the config section below renames the MiddleName column to MiddleInitial:

columns:
  rename:
    - name: MiddleName
      value: MiddleInitial

Rows

Each step can also contain a rows section, listing transformation rules that process the dataset row by row. The two currently supported operations are drop and update, respectively allowing for selective removal of rows or modification of row data based on specified rules.

Drop

The drop operation within the rows section is used to remove rows from the dataset that meet certain conditions. The specified condition must be a valid Jinja template expression. Rows that satisfy the condition are excluded from the resulting dataset.

For instance, to exclude rows where the user_id column is empty, the configuration can be specified as follows:

rows:
  drop:
    - condition: row.user_id is none

You can use more complex Jinja expressions for conditions that involve multiple columns, logical operators, or functions. condition has access to the entire Transform v2 Jinja environment, as well as a few additional objects:

  • vars: Dictionary of variables defined under the vars section of the current step. For example, vars.total refers to the value of the total variable defined above.

  • row: Dictionary of the row's contents. For example, row.user_id refers to the value of the user_id column within that row.

  • index: Zero-based index of the row in the dataset. Note that the index of a row may change during processing if previous steps delete or add rows. For example, the rule below drops every other record from the dataset:

rows:
  drop:
    - condition: index % 2 == 1

Update

The update operation allows you to modify the values of specific rows. It can be used to set new values for columns, generate fake data, anonymize sensitive information, or apply any transformation that can be expressed as a Jinja template.

Each update operation must contain one of name, entity, type or condition which are different ways to specify what to update, as well as value, which is contains the updated value. name and entity must be strings or list of strings, while condition and value are Jinja templates.

You can also optionally specify a fallback_value to be used if evaluating value throws an error. We recommend doing this when passing dynamic inputs to functions in value (for example, setting the Faker locale based on the contents of another column), preferably with a simple template (e.g. using static parameter values) for fallback_value to avoid further errors. In the event where both value and fallback_value fail to parse, the value will be set to the error message to aid with debugging.

condition, value, and fallback_value in row update rules have access to the row drop Jinja environment including vars, row, and index, as well as a few additional objects:

  • column: Dictionary referring to the current column whose value is being changed. The properites of the column that can be accessed are:

    • name: The name of the column

    • entity: The name of an entity that is in the column

    • type: A Gretel extracted generic type for the column, one of:

      • empty

      • numeric

      • categorical

      • text

      • binary

      • other

    • dtype: The Pandas dtype of the column (object, int32, etc)

    • position: The numerical (index) position of the column in the table

  • this: Literal referring to the current value that is being changed. For example, value: this is a no-op which leaves the current value unchanged, while value: this | sha256 replaces the current value with its SHA-256 hash.

Here's how the update operation works with examples:

Setting a static value

The rule below sets the value of the column namedstatus_column to the string processed for all rows.

rows:
  update:
    - name: status_column
      value: '"processed"'

Incrementing an index

In the example below, we use the index special variable to set the value of the column row_index as the index of the record in the dataset. e.g. for a dataset containing 100 rows, the value of row_index for the last row will be 99.

rows:
  update:
    - name: row_index
      value: index

Generating fake PII

You can use the built-in Faker implementation to generate fake entities. See Faker's documentation for a list of supported entities and parameters.

The example below replaces values in all columns detected to contain email addresses with fake email addresses. Notice that unlike previous examples where the update rule was conditioned on name (the name of a column), the rule below is conditioned on entity (the type of entity contained within a column), which may match multiple columns. For example, if the dataset contains personal_email and work_email columns, the rule below will replace the contents of both with fake email addresses.

rows:
  update:
    - entity: email
      value: fake.email()

Modifying based on a condition

You can also conditionally update rows using flexible Jinja conditions. These conditions may match any number of columns and any number of rows (unlike name and entity conditions which apply to all rows).

For example, you can set the value of the flag_for_review column to true for all rows where the value of the amount column is greater than 1,000:

rows:
  update:
    - condition: column.name == "flag_for_review" and row.amount > 1000
      value: true

Classification

Transform v2 incorporates a classification feature to detect personal identifiable information (PII) within data. This feature simplifies selecting and transforming specific types of PII by tagging each column with its appropriate entity, if any.

Note: Column classification requires access to an LLM endpoint. When running within Gretel Cloud, this will use Gretel Navigator. For Gretel Hybrid, classification needs to use a separately deployed LLM within your cluster. For full documentation on how to setup an LLM, see Deploying an LLM.

PII detection

The classification model is capable of recognizing a variety of pre-defined and custom entities. While you can use arbitrary strings as entity names, it is beneficial to align with Faker entities if you plan to pass entity names to the fake filter in order to generate fake values of the same entity.

For example, to detect and replace phone numbers, email addresses, employee IDs, and International Bank Account Numbers (IBAN), include phone_number, email, and iban in the list of entities under globals.classify.entities. These match perfectly Faker's phone_number(), email(), and iban() methods.

Here is an example configuration that uses classification for detecting these 3 entities and applying transformations:

globals:
  classify:
    enable: true
    entities:
      - phone_number
      - email
      - iban
steps:
  - rows:
      update:
        - entity: phone_number
          value: fake.phone_number()
        - entity: email
          value: fake.email()
        - entity: iban
          value: fake.iban()

Since these align with Faker built-in entities, we could also write a single rule that applies to all detected entities:

globals:
  classify:
    enable: true
    entities:
      - phone_number
      - email
      - iban
steps:
  - rows:
      update:
        - condition: column.entity is not none
          value: column.entity | fake

With this setting, Transform v2 will first classify entities in the dataset, then replace detected entities with faker-generated ones for each row in the dataset.

If your list of entities contains custom entities not supported by Faker, you can leverage fallback_value to apply other transformations. For example, the policy below attempts to fake all entities, and falls back to hashing unsupported entities. Since iban is supported by Faker while employee_id is not, the output of this policy will be fake IBAN values in the IBAN column, and hashes of the actual employee IDs in the employee ID column.

globals:
  classify:
    enable: true
    entities:
      - employee_id
      - iban
steps:
  - rows:
      update:
        - condition: column.entity is not none
          value: column.entity | fake
          fallback_value: this | hash

If instead you wish to replace unsupported entities by the entity name between brackets, you could set fallback_value: "<" + column.entity + ">" . You could also generate custom fake values, for example if you wanted to replace all entities not supported by Faker by the letter "E" followed by a random 6 digit number, you could set fallback_value: "E" + fake.pyint(100000, 999999) | string, or use Jinja's concatenation operator ~ which automatically converts integers to strings: fallback_value: "E" ~ fake.pyint(100000, 999999).

Named Entity Recognition

Similarly to column classification, Transform v2 supports flexible Named Entity Recognition (NER) functionality including the ability to detect and transform custom entity types.

To get started, list the entities to detect under the globals.ner.entities section and use one of the four built-in NER transformation filters:

  • redact_entities replaces detected entities with the entity type. For example, "I met Sally" becomes "I met <first_name>".

  • fake_entities replaces detected entities with randomly generated fake values using the Faker function corresponding to the entity type. For example, "I met Sally" could become "I met Joe". When using fake_entities, ensure the name of the entity in the globals.classify.entities section exactly matches the name of a Faker function. Entities without a matching Faker function are redacted by default, and you can customize the fallback behavior using the on_error parameter, e.g. fake_entities(on_error="hash") hashes the non-Faker-matching entities instead of redacting them.

  • hash_entities replaces detected entities with salted hashes of their value. For example, "I met Sally" may become "I met 515acf74f".

  • label_entities is similar to redact_entities, but also includes the entity value. For example, "I met Sally" becomes "I met <entity type="first_name" value="Sally">". This can be useful for downstream post-processing (such as highlighting detected entities within the original text, applying more complex replacement logic for specific entity types, etc.), both within Transform v2 and externally.

You can tweak the ner_threshold parameter if you notice too many or too few detections. You can think of the NER threshold as the level of confidence required in the model's detection before labeling an entity. Increasing the NER threshold decreases the number of detected entities, while decreasing the NER threshold increases the number of detected entities. Values between 0.5 and 0.8 are good starting points.

The sample config below shows how to apply fake_entities (falling back to redact_entities) for a list of custom entity types across all free text fields:

schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          entities:
            - name
            - ssn
            - medical_record_number
            - blood_type
        ner:
          ner_threshold: 0.2
      steps:
        - rows:
            update:
              - type: text
                value: this | fake_entities(on_error="redact")

Additionally, if you would like to speed up Named Entity Recognition by having it run on hardware with a GPU, you can set the globals.ner.ner_optimized flag to true:

schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          entities:
            - name
            - ssn
            - medical_record_number
            - blood_type
        ner:
          ner_threshold: 0.2
          ner_optimized: true
      steps:
        - rows:
            update:
              - type: text
                value: this | fake_entities(on_error="redact")

Classification in Hybrid

If you are running Transform v2 in Gretel Hybrid and want to use classification, you'll need to first ensure you've installed the Gretel Inference LLM chart in your cluster. For full instructions on that installation, see Deploying an LLM.

Once you've done that, you can specify the Gretel Inference LLM model via Transform v2's globals.classify.deployed_llm_name configuration field. This name should match the gretelLLMConfig.modelName defined in the Gretel Inference LLM's values.yml.

Here's how to perform the above PII detection using mistral-7b deployed in your Gretel Hybrid Cluster:

schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          deployed_llm_name: mistral-7b
          entities:
            - name
            - ssn
            - medical_record_number
            - blood_type
        ner:
          ner_threshold: 0.2
      steps:
        - rows:
            update:
              - type: text
                value: this | fake_entities(on_error="redact")

Jinja environment

Objects

Every Jinja environment in Transform v2 can access the objects below:

  • fake: Instantiation of Faker which defaults to the locale and seed specified in the globals section. You can override these defaults by passing parameters, such as fake(locale="it_IT", seed=42), which will generate data using the Italian locale and 42 as the consistency seed.

  • random is Python's random library. For example you could call random.randint(1, 10) to generate an integer between 1 and 10.

Filters

Variables can be modified by filters. Filters are separated from the variable by a pipe symbol (|) and may have optional arguments in parentheses. Multiple filters can be chained. The output of one filter is applied to the next. Transform v2 can use any of Jinja's built-in filters, and also extends them with a few Gretel-specific filters:

Transform v2 extends the capabilities of the standard Jinja filters with its own specific set. These include:

  • hash: Computes the SHA-256 hash of a value. For example, this | hash returns a hash of the value in the matched column in a row update rule. It can also take in its own salt, i.e. this | hash(salt="my-salt"), but by default it uses the seed value of the run as the salt. If the seed is unset, the hash will be different for the same values across runs.

  • isna: Returns true if a value is null or missing.

  • fake: Invokes the Faker library to generate fake data of the entity that's passed to the filter. This is useful if the entity name dynamic, e.g. column.type | fake is equivalent to fake.first_name() if column.type is equal to "first_name".

  • lookup_country: Attempts to map a country name to its corresponding pycountry Country.

  • lookup_locales: Maps a pycountry Country to a list of Faker locales for that country. For example "Canada" | lookup_country | lookup_locales returns ["en_CA", "fr_CA"].

  • normalize: Removes special characters and converts Unicode strings to an ASCII representation.

  • tld: Maps a pycountry Country object to its corresponding top-level domain. For example, "France" | lookup_country | tld evaluates to .fr.

  • partial_mask(prefix: int, padding: str, suffix: int): This filter is similar to the MSSQL dynamic masking partial() functionality. Given a value, this filter will retain the first N characters as the prefix, the last N characters as the suffix, and apply the padding between the prefix and suffix. If the original value is too short and would be leaked in the prefix, suffix, or a combination of the two, then the prefix and suffix are automatically adjusted to prevent this. For very short values, for example a single character value, only the padding may be returned. Example usage: value: this | partial_mask(2, "XXXXXX", 2)

  • date_parse: Takes a string value and parses it into a Python datetime object. Date formats are those supported by Python's dateutil.parser.parse method.

  • date_shift: Takes a date, either as a string or a date object, and randomly shifts it on an interval about the date. For example 2023-01-01 | date_shift('-5y', '+5y') will result in a date object between between 2018-01-01 and 2028-01-01. Supports the same interval formats as Python's faker.providers.date_time.date_between.

  • date_time_shift: Takes a date, either as a string, a date or datetime object, and randomly shifts it on an interval about the date. For example 2023-01-01 00:00 | datetime_shift('-5y', '+5y') will result in a date object between between 2018-01-01 00:00 and 2028-01-01 00:00. Supports the same interval formats as Python's faker.providers.date_time.date_between.

  • date_format: Takes a date and formats it per the passed in format. The default format is "%Y-%m-%d". Supports all formats for strftime.

  • date_time_format: Takes a datetime and formats it per the passed in format. The default format is "%Y-%m-%d" %H:%M:%S. Supports all formats for strftime.

Last updated