Reference

Transform v2 configs consist of (optional) global parameters followed by a sequence of transformation steps. Rows, columns, and values not touched by any transformation step are maintained as-is in the output. In other words, Transform v2 configs are implicitly "passthrough".

Below is a "kitchen sink" config showing most of Transform v2 capabilities. Don't worry if it looks overwhelming. We will dissect each step in the reference below.

schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          enable: true
          entities: [first_name, last_name, phone_number]
          num_samples: 3
        locales: [en_US]
        seed: 42
      steps:
        - vars:
            categories: data.Category.unique().to_list()
          columns:
            add:
              - name: PrimaryKey
                position: 0
            drop:
              - name: Email
            rename:
              - name: MiddleName
                value: MiddleInitial
          rows:
            drop:
              - condition: index % 2 == 1
            update:
              - name: PrimaryKey
                value: index
              - name: UserIdentifier
                value: this | hash | truncate(10)
              - entity: [first_name, last_name]
                value: column.entity | fake
              - name: MiddleInitial
                value: fake.first_name() | first + "."
              - condition: column.name == "Category" and this | isna
                value: vars.categories | random
              - name: Phone
                value: fake(row.Country | lookup_locales).phone_number()
                fallback_value: fake.phone_number()
              - name: work_email
                value: (row.FirstName | lower | first + row.LastName | lower) | normalize + "@" + fake.domain_name()

Globals

The entire globals section is optional. You can use it to re-configure the following default entity detection and transformation settings:

  • classify: Dictionary of classification configuration parameters. Note that classification is only performed once for each model, and currently only maps entire columns to entities (searching for entities within free text fields similarly to Transform's use_nlp option is not currently supported in Transform v2). Subsequent model runs will assume the schema remains unchanged, and continue to use the column to entity mapping detected during the first run.

    • enable: Boolean specifying whether to perform classification. Defaults to false, which sets column.entity to none for all columns. When true, classification accuracy currently necessitates sending column names and a few (equal to num_samples) randomly selected values from each column to the Gretel Cloud.

    • entities: List of PII entities that Transform's classification model will attempt to detect. Defaults to the following commonly used entities: [name, first_name, last_name, company, email, phone_number, address, street_address, city, administrative_unit, country, postcode]. For best practices around customizing this list, see Classification.

    • num_samples: Number of randomly selected values from each column to use for classification. Defaults to 3, but you can set it to a higher number for more accurate classification, or a lower number if you have privacy or security concerns with sending randomly sampled values from your dataset. Setting num_samples: 0 will use only column names as the input to classification.

  • locales: List of default Faker locales to use for fake value generation. Defaults to ["en_US"]. fake will randomly choose a locale from this list each time it generates a fake value, except when initialized with explicit locales, e.g. fake(["fr_FR"]).first_name(). For a list of valid locales, see Faker's localized providers.

  • seed: Integer seed value used to generate fake values consistently. Defaults to session. Given the same seed value and the same input, fake generates the same output throughout the dataset and across multiple sessions (for example, "Alice" may be transformed to "Bob" in all records). When the seed set to null, fake value transformations are not consistent, even within a single session (for example, "Alice" may be transformed to "Bob" in one record, and "Eve" in another record). When the seed is set to session, a random integer is generated at the beginning of each Transform v2 run and used as the seed to transform values consistently within the current run (subsequent runs will generate their own random seed). While there are privacy benefits to inconsistent transformations, we recommend setting a fixed seed for consistent transformation for use cases involving downstream synthetic data generation, or analysis on the transformed dataset.

You can also define global constants in the globals section, which you can access in any step. For example, if you define company: "Acme Inc." under globals, a transformation step with value: globals.company will set that field's value to "Acme Inc.".

Steps

steps contain an ordered list of data transformation actions to be executed in the same order as they are defined in the Transform v2 config.

Vars

Each step can optionally contain a vars section, which defines custom variables to be used in any Jinja expression within the step. Unlike globals, vars are scoped to an individual step, and are initialized using Jinja expressions that are evaluated at the beginning of each step.

These expressions can leverage data (a Pandas DataFrame containing the entire dataset) to implement custom aggregations. For example, the config section below creates a new percent_of_total column by storing the total in vars then dividing the value of each individual row by vars.total:

steps:
  - vars:
      total: data.subtotal.sum()
    columns:
      add:
        - name: percent_of_total
    rows:
      update:
        - name: percent_of_total
          value: row.subtotal * 100 / vars.total

Columns

The columns section of each step contains transformations applying to an entire column at once. Namely: adding a new column, dropping (removing) a column, and renaming a column.

Add

You can add a new blank column (which you can later fill in using a rows update action) by specifying its name and optional position. If position is left unspecified, the new column is added as the last column. Initially all values in the new column will be null, but you can populate them using a rows.update rule. For example, the config section below adds a primary_key column, positions it as the first column in the dataset, and then populates it with the index of the row:

steps:
  - columns:
      add:
        - name: primary_key
          position: 0
    rows:
      update:
        - name: primary_key
          value: index

Drop

To drop a column, specify its name in a columns drop action. For example, the config section below drops the FirstName and LastName columns:

columns:
  drop:
    - name: FirstName
    - name: LastName

Rename

You can rename a column by specifying its current name (name) and new name (value). For example, the config section below renames the MiddleName column to MiddleInitial:

columns:
  rename:
    - name: MiddleName
      value: MiddleInitial

Rows

Each step can also contain a rows section, listing transformation rules that process the dataset row by row. The two currently supported operations are drop and update, respectively allowing for selective removal of rows or modification of row data based on specified rules.

Drop

The drop operation within the rows section is used to remove rows from the dataset that meet certain conditions. The specified condition must be a valid Jinja template expression. Rows that satisfy the condition are excluded from the resulting dataset.

For instance, to exclude rows where the user_id column is empty, the configuration can be specified as follows:

rows:
  drop:
    - condition: row.user_id is none

You can use more complex Jinja expressions for conditions that involve multiple columns, logical operators, or functions. condition has access to the entire Transform v2 Jinja environment, as well as a few additional objects:

  • vars: Dictionary of variables defined under the vars section of the current step. For example, vars.total refers to the value of the total variable defined above.

  • row: Dictionary of the row's contents. For example, row.user_id refers to the value of the user_id column within that row.

  • index: Zero-based index of the row in the dataset. Note that the index of a row may change during processing if previous steps delete or add rows. For example, the rule below drops every other record from the dataset:

rows:
  drop:
    - condition: index % 2 == 1

Update

The update operation allows you to modify the values of specific rows. It can be used to set new values for columns, generate fake data, anonymize sensitive information, or apply any transformation that can be expressed as a Jinja template.

Each update operation must contain one of name, entity or condition which are different ways to specify what to update, as well as value, which is contains the updated value. name and entity must be strings or list of strings, while condition and value are Jinja templates.

You can also optionally specify a fallback_value to be used if evaluating value throws an error. We recommend doing this when passing dynamic inputs to functions in value (for example, setting the Faker locale based on the contents of another column), preferably with a simple template (e.g. using static parameter values) for fallback_value to avoid further errors. In the event where both value and fallback_value fail to parse, the value will be set to the error message to aid with debugging.

condition, value, and fallback_value in row update rules have access to the row drop Jinja environment including vars, row, and index, as well as a few additional objects:

  • column: Dictionary referring to the current column whose value is being changed. Properties include name and entity.

  • this: Literal reffering to the current value that is being changed. For example, value: this is a no-op which leaves the current value unchanged, while value: this | sha256 replaces the current value with its SHA-256 hash.

Here's how the update operation works with examples:

Setting a static value

The rule below sets the value of the column namedstatus_column to the string processed for all rows.

rows:
  update:
    - name: status_column
      value: '"processed"'

Incrementing an index

In the example below, we use the index special variable to set the value of the column row_index as the index of the record in the dataset. e.g. for a dataset containing 100 rows, the value of row_index for the last row will be 99.

rows:
  update:
    - name: row_index
      value: index

Generating fake PII

You can use the built-in Faker implementation to generate fake entities. See Faker's documentation for a list of supported entities and parameters.

The example below replaces values in all columns detected to contain email addresses with fake email addresses. Notice that unlike previous examples where the update rule was conditioned on name (the name of a column), the rule below is conditioned on entity (the type of entity contained within a column), which may match multiple columns. For example, if the dataset contains personal_email and work_email columns, the rule below will replace the contents of both with fake email addresses.

rows:
  update:
    - entity: email
      value: fake.email()

Modifying based on a condition

You can also conditionally update rows using flexible Jinja conditions. These conditions may match any number of columns and any number of rows (unlike name and entity conditions which apply to all rows).

For example, you can set the value of the flag_for_review column to true for all rows where the value of the amount column is greater than 1,000:

rows:
  update:
    - condition: column.name == "flag_for_review" and row.amount > 1000
      value: true

Classification

Transform v2 incorporates a classification feature to detect personal identifiable information (PII) within data. This feature simplifies selecting and transforming specific types of PII by tagging each column with its appropriate entity, if any.

PII Detection

The classification model is capable of recognizing a variety of pre-defined and custom entities. While you can use arbitrary strings as entity names, it is beneficial to align with Faker entities if you plan to pass entity names to the fake filter in order to generate fake values of the same entity.

For example, to detect and replace phone numbers, email addresses, employee IDs, and International Bank Account Numbers (IBAN), include phone_number, email, and iban in the list of entities under globals.classify.entities. These match perfectly Faker's phone_number(), email(), and iban() methods.

Here is an example configuration that uses classification for detecting these 3 entities and applying transformations:

globals:
  classify:
    enable: true
    entities:
      - phone_number
      - email
      - iban
steps:
  - rows:
      update:
        - entity: phone_number
          value: fake.phone_number()
        - entity: email
          value: fake.email()
        - entity: iban
          value: fake.iban()

Since these align with Faker built-in entities, we could also write a single rule that applies to all detected entities:

globals:
  classify:
    enable: true
    entities:
      - phone_number
      - email
      - iban
steps:
  - rows:
      update:
        - condition: column.entity is not none
          value: column.entity | fake

With this setting, Transform v2 will first classify entities in the dataset, then replace detected entities with faker-generated ones for each row in the dataset.

If your list of entities contains custom entities not supported by Faker, you can leverage fallback_value to apply other transformations. For example, the policy below attempts to fake all entities, and falls back to hashing unsupported entities. Since iban is supported by Faker while employee_id is not, the output of this policy will be fake IBAN values in the IBAN column, and hashes of the actual employee IDs in the employee ID column.

globals:
  classify:
    enable: true
    entities:
      - employee_id
      - iban
steps:
  - rows:
      update:
        - condition: column.entity is not none
          value: column.entity | fake
          fallback_value: this | hash

If instead you wish to replace unsupported entities by the entity name between brackets, you could set fallback_value: "<" + column.entity + ">" . You could also generate custom fake values, for example if you wanted to replace all entities not supported by Faker by the letter "E" followed by a random 6 digit number, you could set fallback_value: "E" + fake.pyint(100000, 999999) | string, or use Jinja's concatenation operator ~ which automatically converts integers to strings: fallback_value: "E" ~ fake.pyint(100000, 999999) .

Jinja environment

Objects

Every Jinja environment in Transform v2 can access the objects below:

  • fake: Instantiation of Faker which defaults to the locale and seed specified in the globals section. You can override these defaults by passing parameters, such as fake(locale="it_IT", seed=42), which will generate data using the Italian locale and 42 as the consistency seed.

  • random is Python's random library. For example you could call random.randint(1, 10) to generate an integer between 1 and 10.

Filters

Variables can be modified by filters. Filters are separated from the variable by a pipe symbol (|) and may have optional arguments in parentheses. Multiple filters can be chained. The output of one filter is applied to the next. Transform v2 can use any of Jinja's built-in filters, and also extends them with a few Gretel-specific filters:

Transform v2 extends the capabilities of the standard Jinja filters with its own specific set. These include:

  • hash: Computes the SHA-256 hash of a value. For example, this | hash returns a hash of the value in the matched column in a row update rule.

  • isna: Returns true if a value is null or missing.

  • fake: Invokes the Faker library to generate fake data of the entity that's passed to the filter. This is useful if the entity name dynamic, e.g. column.type | fake is equivalent to fake.first_name() if column.type is equal to "first_name".

  • lookup_country: Attempts to map a country name to its corresponding pycountry Country.

  • lookup_locales: Maps a pycountry Country to a list of Faker locales for that country. For example "Canada" | lookup_country | lookup_locales returns ["en_CA", "fr_CA"].

  • normalize: Removes special characters and converts Unicode strings to an ASCII representation.

  • tld: Maps a pycountry Country object to its corresponding top-level domain. For example, "France" | lookup_country | tld evaluates to .fr.

  • date_parse: Takes a string value and parses it into a Python date object. Date formats are those supported by Python's dateutil.parser.parse method.

  • date_shift: Takes a date, either as a string or a date object, and randomly shifts it on an interval about the date. For example 2023-01-01 | date_shift('-5y', '+5y') will result in a date object between between 2018-01-01 and 2028-01-01. Supports the same interval formats as Python's faker.providers.date_time.date_between.

Last updated