LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page
  • What can I do with Transform?
  • Anatomy of a Transform step configuration
  • Getting started with Transform

Was this helpful?

Export as PDF
  1. Create Synthetic Data
  2. Gretel Safe Synthetics

Transform

Gretel Transform combines data classification with data transformation to easily detect and anonymize or mutate sensitive data.

Gretel Transform offers custom transformation logic, an expanded library of detectable and fakeable entities, and PII and custom entity detections.

What can I do with Transform?

Gretel Transform is a general-purpose programmatic dataset editing tool. Most commonly, Gretel customers use it to:

  • De-identify datasets, for example by detecting Personally Identifiable Information (PII) and replacing it with fake PII of the same type.

  • Pre-process datasets before using them to train a synthetic data model, for example to remove low quality records such as records containing too many blank values, or columns containing UUIDs or hashes which are not relevant for synthetic data models since they contain no discernible correlations or distributions for the model to learn.

  • Post-process synthetic data generated from a synthetic data model, for example to validate that the generated records respect business-specific rules, and drop or fix any records that don't.

If your data contains any sensitive PII, we recommend running Transform prior to Synthetics when using Gretel Safe Synthetics.

Anatomy of a Transform step configuration

You can configure Transform using YAML. Transform configurations consist of two sections:

  • globals which contains default parameter values (such as the locale and seed used to generate fake values) and user-defined variables applicable throughout the config.

  • steps which lists transformation steps applied sequentially. Transformation steps can define variables (vars), and manipulate columns (add, drop, and rename) and rows (drop and update). In practice, most Transform configs contain a single step, but more steps can be useful if for example the value of column B depends on the original (non-transformed) value of column A, but column A must also be eventually transformed. In that case, the first step could set the new value of column B, leaving column A unchanged, before ultimately setting the new value of column A in the second step.

Below is our default config which shows this config structure in action:

schema_version: "1.0"
name: example
task:
  name: transform
  config:
    globals:
      locales: [en_CA, fr_CA]
    steps:
      - columns:
          add:
            - name: row_index   
        rows:
          drop:
            - condition: row.user_id is none
          update:
            - name: row_index
              value: index
            - type: phone_number
              value: fake.phone_number()
      - columns:
          drop:
            - name: user_id
          rename:
            - name: phone_number_1
              value: cell_phone
            - name: phone_number_2
              value: home_phone    

The config above:

  1. Sets the default locale for fake values to Canada (English) and Canada (French). When multiple locales are provided, a random one is chosen from the list for each fake value.

  2. Adds a new column named row_index initially containing only blank values.

  3. Sets the value of the new row_index column to the index of the record in the original dataset (this can be helpful for use cases where the ability to "reverse" transformations or maintain a mapping between the original and transformed values is important).

  4. Drops the sensitive user_id column. Note that this is done in the second step, since that column is needed in the first step to drop invalid rows.

  5. Renames the phone_number_1 and phone_number_2 columns respectively to cell_phone and home_phone.

Getting started with Transform

To get started with building your own Transform config for de-identification or pre/post processing datasets, see the Examples page for starter configs for several use cases, and the Reference page for the full list of supported transformation steps, template expression syntax, and detectable entities.

PreviousGretel Safe SyntheticsNextReference

Last updated 1 month ago

Was this helpful?

Drops invalid rows, which we define here as rows containing blank user_id values. condition is a , which allows for custom validation logic.

Replaces all values within columns detected as containing phone numbers (including phone_number_1 and phone_number_2) with fake phone numbers having area codes in Canada, since the default locale is set to en_CA and fr_CA in the globals section. fake is a Faker object supporting all .

Jinja template expression
standard Faker providers