LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page
  • PII redaction
  • Replace detected entities with fake entities of the same type
  • Replace names with fake names and hash all other detected entities
  • Exclude the primary key
  • Replace regular expressions with fake values
  • Post-processing
  • Calculated columns
  • Drop records not meeting business logic
  • Data cleaning
  • Writing your own Transform configuration
  • Default configuration

Was this helpful?

Export as PDF
  1. Create Synthetic Data
  2. Gretel Safe Synthetics
  3. Transform

Examples

PreviousReferenceNextSupported Entities

Last updated 1 month ago

Was this helpful?

You can find the most popular templates in this , including:

  1. - Gretel's default configuration includes the identifiers that span across common privacy policies, such as HIPAA and GDPR

  2. - Redacts and replaces true identifiers using the HIPAA Safe Harbor Method

  3. - Redacts and replaces true identifiers based on the GDPR

  4. - Only applies redaction and replacement for free text columns; recommended option when chaining with the Synthetics model

Below are a few additional configs to help you quickly get started with some additional common Transform use cases.

PII redaction

Replace detected entities with fake entities of the same type

Fallback on hashing entities not supported by Faker. If you don't require NER, remove the last rule (type: text -> fake_entities) to run this config more than 10x faster assuming your dataset contains free text columns.

schema_version: "1.0"
name: example
task:
  name: transform
    globals:
      classify:
        enable: true
        entities:
          - first_name
          - last_name
          - email
          - phone_number
          - address
          - ssn
          - ip_address
        num_samples: 3
    steps:
      - rows:
          update:
            - condition: column.entity is not none
              value: column.entity | fake
              fallback_value: this | hash | truncate(9,true,"")
            - type: text
              value: this | fake_entities(on_error="hash")

Replace names with fake names and hash all other detected entities

schema_version: "1.0"
name: example
task:
  name: transform
    globals:
      classify:
        enable: true
    steps:
      - vars:
          entities_to_fake: [first_name, last_name]
        rows:
          update:
            - condition: column.entity is in vars.entities_to_fake
              value: column.entity | fake
            - condition: column.entity is not none and column.entity not in vars.entities_to_fake
              value: this | hash

Exclude the primary key

If you need to preserve certain ID columns for auditability or to maintain relationships between tables, you can explicitly exclude these columns from any transformation rules.

schema_version: "1.0"
name: example
task:
  name: transform
    globals:
      classify:
        enable: true
    steps:
      - rows:
          update:
            - condition: column.entity is not none and column.name != "id"
              value: column.entity | fake

Replace regular expressions with fake values

You can use the built-in Python re library for regex operations in Python. Below we go a step further by listing all regular expressions we are looking to replace along with their Faker function mapping in the regex_to_faker variable, then iterate through them to replace all of their occurrences in all free text columns.

schema_version: "1.0"
name: example
task:
  name: transform
    steps:
      - vars:
          regex_to_faker:
            '[\+(\d][\+()\d\s-]{5,}[)\d]': phone_number
            '[\w\.-]+@[a-zA-Z\d\.-]+\.[a-zA-Z]{2,}': email
        rows:
          update:
            - type: text
              foreach: vars.regex_to_faker
              value: re.sub(item, vars.regex_to_faker[item] | fake, this)

Post-processing

Transform can be used to post-process synthetic data to increase accuracy, for example by dropping invalid rows according to custom business logic, or by ensuring calculated field values are accurate.

Calculated columns

schema_version: "1.0"
name: example
task:
  name: transform
    steps:
      - columns:
          add:
            - name: subtotal
        rows:
          update:
            - name: subtotal
              value: row.unit_price * row.quantity

Drop records not meeting business logic

schema_version: "1.0"
name: example
task:
  name: transform
    steps:
      - rows:
          drop:
            - condition: row.quantity < 0

Data cleaning

schema_version: "1.0"
name: example
task:
  name: transform
    steps:
      - vars:
          duplicated: data.duplicated()
        rows:
          drop:
            # Remove duplicate records
            - condition: vars.duplicated[index]
          update:
            # Standardize empty values
            - condition: this | lower in ["?", "missing", "n/a", "not applicable"]
              value: none
            # Cap high float precision
            - condition: column.type == "float"
              value: this | round(2)

Writing your own Transform configuration

schema_version: "1.0"
name: example
task:
  name: transform
    globals:
      classify:
        enable: true
        entities:
          # The model has been fine-tuned on the entities
          # listed below, but you can include any arbitrary
          # value and the model will attempt to find it.
          # See here for definitions of each entity:
          # https://docs.gretel.ai/create-synthetic-data/models/transform/v2/supported-entities

          # If you want to fake an entity,
          # it must be included in Faker:
          # https://faker.readthedocs.io/en/master/providers.html

          # You generally want to keep the entity list
          # to a minimum, only including entities that you
          # need to transform, in order to avoid the model getting
          # confused about which entity type a column may be.
          # Comment entities in or out based on what exists
          # in your dataset.

          # If the names are combined into a single column
          # for full name in your dataset, use the name entity
          # instead of first_name and last_name.
          - first_name
          - last_name
          # - name

          # If the address is in a single column rather than
          # separated out into street address, city, state, etc.,
          # use only address as the entity instead,
          # and comment the others out.
          - street_address
          - city
          - administrative_unit  # Faker's term for state or province
          - country
          - postcode
          # - address

          # Other common entities
          - gender
          - email
          - phone_number
          - credit_card_number
          - ssn

          # Entities that the model has been fine-tuned on,
          # but are less common. Hence they have been commented
          # out by default.
          # - account_number
          # - api_key
          # - bank_routing_number
          # - biometric_identifier
          # - certificate_license_number
          # - company_name
          # - coordinate
          # - customer_id
          # - cvv
          # - date
          # - date_of_birth
          # - date_time
          # - device_identifier
          # - employee_id
          # - health_plan_beneficiary_number
          # - ipv4
          # - ipv6
          # - license_plate
          # - medical_record_number
          # - national_id
          # - password
          # - pin
          # - state
          # - swift_bic
          # - unique_identifier
          # - tax_id
          # - time
          # - url
          # - user_name
          # - vehicle_identifier

      ner:
        # You can think of the NER threshold as the level of
        # confidence required in the model's detection before
        # labeling an entity. Increasing the NER threshold
        # decreases the number of detected entities, while
        # decreasing the NER threshold increases the number
        # of detected entities.
        ner_threshold: 0.3

      # You can add additional locales to the list by separating
      # via commas, such as locales: [en_US, en_CA]
      locales: [en_US]
    steps:
      - rows:
          update:
            # For each column in the dataset you want to fake,
            # follow this format:
            # - name: <column_name>
            #   value: fake.<entity_type>()
            - name: address
              value: fake.street_address()
            - name: city
              value: fake.city()
            - name: state
              value: fake.administrative_unit()
            - name: postcode
              value: fake.postcode()

            # Names can be faked the same way:
            - name: fname
              value: fake.first_name()
            - name: lname
              value: fake.last_name()
            # - name: fullname
            #   value: fake.name()

            # You may want names to be based on a gender column instead.
            # Update the name of the gender column (e.g., "gender").
            # Update the values in the gender column (e.g., "male", "female").
            # - name: fname
            #   value: fake.first_name_male() if row["gender"] == 'male' else fake.first_name_female() if row["gender"] == 'female' else fake.first_name()
            # - name: lname
            #   value: fake.last_name_male() if row["gender"] == 'male' else fake.last_name_female() if row["gender"] == 'female' else fake.last_name()
            # Or, for full name:
            # - name: name
            #   value: fake.name_male() if row["gender"] == 'male' else fake.name_female() if row["gender"] == 'female' else fake.name()

            # You may have values based on others values in the
            # dataset, such as email.
            # Ensure steps for dependent values (e.g. email)
            # are performed after steps that fake dependent values
            # (e.g. first_name and last_name).
            # For example, if I want email to be based on first
            # and last name, I need to have faked those already.

            # The below syntax generates an email of the form
            # <lowercase_first_letter_of_first_name><lowercase_last_name><number between 0 and 9>@<freedomain>
            # As an example, it could be "kjohnson7@gmail.com" for someone with a faked name of Kara Johnson
            # Be sure to update the column names with your column names,
            # rather than "fname" and "lname"
            - name: email
              value: row["fname"][0].lower() + row["lname"].lower() + (random.randint(0, 9) | string) + "@" + fake.free_email_domain()

            # This section of the Faker documentation has a list
            # of various options for domains or full emails:
            # https://faker.readthedocs.io/en/master/providers/faker.providers.internet.html
            # Here are some examples:
            # value: fake.email() # Note that this will not be based on first or last name columns, it is random.
            # value: fake.company_email() # Note that this will not be based on first or last name columns, it is random.
            # value: row["fname"] + "." + row["lname"] + "@" + fake.domainname()
            # value: row["fname"] + "." + row["lname"] + "@" + fake.domainword() + ".com"
            # The next example generates a fake company name, removes punctuation,
            # and converts to lowercase for the names and domain.
            # value: row["fname"].lower() + "." + row["lname"].lower() + "@" + fake.company().replace(" ", "").replace(",","").replace("-","").lower() + ".org"

            # By default, Faker does not standardize telephone formats.
            # This example generates a format like "123-456-7890".
            - condition: column.entity == "phone_number"
              value: (random.randint(100, 999) | string) + "-" + (random.randint(100, 999) | string) + "-" + (random.randint(1000, 9999) | string)
            # The next example generates a format like "(123)456-7890"
            # - condition: column.entity == "phone_number"
            #   value: "(" + (random.randint(100, 999) | string) + ")" + (random.randint(100, 999) | string) + "-" + (random.randint(1000, 9999) | string)

            # The next section text columns not classified as a single entity and runs NER.
            # It fakes any entities from the list on globals.classify.entities.
            # Comment this out if you don't want to fake entities in free-text columns.
            - condition: column.entity is none and column.type == "text"
              value: this | fake_entities

Default configuration

Gretel's default configuration covers the identifiers included in common privacy policies, such as HIPAA and GDPR.

schema_version: "1.0"
name: default
task:
  name: transform
  config:
      globals:
        classify:
          enable: true
          entities:
            # True identifiers
            - first_name
            - last_name
            - name
            - street_address
            - city
            - state
            - postcode
            - country
            - address
            - latitude
            - longitude
            - coordinate
            - age
            - phone_number
            - fax_number
            - email
            - ssn
            - unique_identifier
            - medical_record_number
            - health_plan_beneficiary_number
            - account_number
            - certificate_license_number
            - vehicle_identifier
            - license_plate
            - device_identifier
            - biometric_identifier
            - url
            - ipv4
            - ipv6
            - national_id
            - tax_id
            - bank_routing_number
            - swift_bic
            - credit_debit_card
            - cvv
            - pin
            - employee_id
            - api_key
            - coordinate
            - customer_id
            - user_name
            - password
            - mac_address
            - http_cookie

            # Quasi identifiers
            - date
            - date_time
            - blood_type
            - gender
            - sexuality
            - political_view
            - race
            - ethnicity
            - religious_belief
            - language
            - education
            - job_title
            - employment_status
            - company_name
        ner:
          ner_threshold: 0.3
        locales: [en_US]
      steps:
        - vars:
            row_seed: random.random()
          rows:
            update:
              - condition: column.entity == "first_name" and not (this | isna)
                value: fake.persona(row_index=vars.row_seed + index).first_name
              - condition: column.entity == "last_name" and not (this | isna)
                value: fake.persona(row_index=vars.row_seed + index).last_name
              - condition: column.entity == "name" and not (this | isna)
                value: column.entity | fake
              - condition: (column.entity == "street_address" or column.entity == "city" or column.entity == "state" or column.entity == "postcode" or column.entity == "address") and not (this | isna)
                value: column.entity | fake
              - condition: column.entity == "latitude" and not (this | isna)
                value: fake.location_on_land()[0]
              - condition: column.entity == "longitude" and not (this | isna)
                value: fake.location_on_land()[1]
              - condition: column.entity == "coordinate" and not (this | isna)
                value: fake.location_on_land()
              - condition: column.entity == "email" and not (this | isna)
                value: fake.persona(row_index=vars.row_seed + index).email
              - condition: column.entity == "ssn" and not (this | isna)
                value: column.entity | fake
              - condition: column.entity == "phone_number" and not (this | isna)
                value: (fake.random_number(digits=3) | string) + "-" + (fake.random_number(digits=3) | string) + "-" + (fake.random_number(digits=4) | string)
              - condition: column.entity == "fax_number" and not (this | isna)
                value: (fake.random_number(digits=3) | string) + "-" + (fake.random_number(digits=3) |
                  string) + "-" + (fake.random_number(digits=4) | string)
              - condition: column.entity == "vehicle_identifier" and not (this | isna)
                value: fake.vin()
              - condition: column.entity == "license_plate" and not (this | isna)
                value: column.entity | fake
              - condition: (column.entity == "unique_identifier" or column.entity == "medical_record_number" or column.entity == "health_plan_beneficiary_number" or column.entity == "account_number" or column.entity == "certificate_license_number" or column.entity == "device_identifier" or column.entity == "biometric_identifier" or column.entity == "bank_routing_number" or column.entity == "swift_bic" or column.entity == "employee_id" or column.entity == "api_key" or column.entity == "customer_id" or column.entity == "user_name" or column.entity == "password" or column.entity == "http_cookie") and not (this | isna)
                value: fake.bothify(re.sub("\\d", "#", re.sub("[A-Z]", "?", (this | string))))
              - condition: (column.entity == "url" or column.entity == "ipv4" or column.entity == "ipv6") and not (this | isna)
                value: column.entity | fake
              - condition: c(olumn.entity == "national_id" or column.entity == "tax_id") and not (this | isna)
                value: fake.itin()
              - condition: column.entity == "credit_debit_card" and not (this | isna)
                value: fake.credit_card_number()
              - condition: column.entity == "cvv" and not (this | isna)
                value: fake.credit_card_security_code()
              - condition: column.entity == "pin" and not (this | isna)
                value: fake.random_number(digits=4) | string
              - condition: column.entity == "coordinate" and not (this | isna)
                value: column.entity | fake
              - condition: column.entity == "mac_address" and not (this | isna)
                value: column.entity | fake

              - condition: column.entity is none and column.type == "text"
                value: this | fake_entities

We published a containing best practices for cleaning and pre-processing real world data can help train better synthetic data models. The config below automates several steps from this guide, and can be chained in a Workflow to run ahead of synthetic model training.

Below is a template to help you get started writing your own Transform config. It includes common examples, the complete list of , and helper text to guide you as you write your own Transform configuration.

folder
Default
HIPAA
GDPR
NER Only
Text Fine-Tuning
guide
Supported Entities