Examples

Below are a few complete sample configs to help you quickly get started with some of the most common Transform v2 use cases.

PII redaction

Replace detected entities with fake entities of the same type

Fallback on hashing entities not supported by Faker. If you don't require NER, remove the last rule (type: text -> fake_entities) to run this config more than 10x faster assuming your dataset contains free text columns.

schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          enable: true
          entities:
            - first_name
            - last_name
            - email
            - phone_number
            - address
            - ssn
            - ip_address
          num_samples: 3
      steps:
        - rows:
            update:
              - condition: column.entity is not none
                value: column.entity | fake
                fallback_value: this | hash | truncate(9,true,"")
              - type: text
                value: this | fake_entities(on_error="hash")

Replace names with fake names and hash all other detected entities

schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          enable: true
      steps:
        - vars:
            entities_to_fake: [first_name, last_name]
          rows:
            update:
              - condition: column.entity is in vars.entities_to_fake
                value: column.entity | fake
              - condition: column.entity is not none and column.entity not in vars.entities_to_fake
                value: this | hash

Exclude the primary key

If you need to preserve certain ID columns for auditability or to maintain relationships between tables, you can explicitly exclude these columns from any transformation rules.

schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          enable: true
      steps:
        - rows:
            update:
              - condition: column.entity is not none and column.name != "id"
                value: column.entity | fake

Replace regular expressions with fake values

You can use the built-in Python re library for regex operations in Python. Below we go a step further by listing all regular expressions we are looking to replace along with their Faker function mapping in the regex_to_faker variable, then iterate through them to replace all of their occurrences in all free text columns.

schema_version: "1.0"
models:
  - transform_v2:
      steps:
        - vars:
            regex_to_faker:
              '[\+(\d][\+()\d\s-]{5,}[)\d]': phone_number
              '[\w\.-]+@[a-zA-Z\d\.-]+\.[a-zA-Z]{2,}': email
          rows:
            update:
              - type: text
                foreach: vars.regex_to_faker
                value: re.sub(item, vars.regex_to_faker[item] | fake, this)

Post-processing

Transform v2 can be used to post-process synthetic data to increase accuracy, for example by dropping invalid rows according to custom business logic, or by ensuring calculated field values are accurate.

Calculated columns

schema_version: "1.0"
models:
  - transform_v2:
      steps:
        - columns:
            add:
              - name: subtotal
          rows:
            update:
              - name: subtotal
                value: row.unit_price * row.quantity

Drop records not meeting business logic

schema_version: "1.0"
models:
  - transform_v2:
      steps:
        - rows:
            drop:
              - condition: row.quantity < 0

Data cleaning

We published a guide containing best practices for cleaning and pre-processing real world data can help train better synthetic data models. The config below automates several steps from this guide, and can be chained in a Workflow to run ahead of synthetic model training.

schema_version: "1.0"
models:
  - transform_v2:
      steps:
        - vars:
            duplicated: data.duplicated()
          rows:
            drop:
              # Remove duplicate records
              - condition: vars.duplicated[index]
            update:
              # Standardize empty values
              - condition: this | lower in ["?", "missing", "n/a", "not applicable"]
                value: none
              # Cap high float precision
              - condition: column.type == "float"
                value: this | round(2)

Last updated