Below are a few complete sample configs to help you quickly get started with some of the most common Transform v2 use cases.
PII redaction
Replace detected entities with fake entities of the same type
schema_version:"1.0"models: - transform_v2:globals:classify:# Classification is currently performed in the Gretel Cloud. If you are# running in hybrid mode, you have the option to turn off classification# by setting "enable" to false, or you can do classification based on# column names only (at the cost of some accuracy loss) by setting# "num_samples" to 0.enable:truenum_samples:3steps: - rows:update: - condition:column.entity is not nonevalue:column.entity | fake
Replace names with fake names and hash all other detected entities
schema_version:"1.0"models: - transform_v2:globals:classify:enable:truesteps: - vars:entities_to_fake: [first_name,last_name]rows:update: - condition:column.entity is in vars.entities_to_fakevalue:column.entity | fake - condition:column.entity is not none and not in vars.entities_to_fakevalue:this | hash
Exclude the primary key
If you need to preserve certain ID columns for auditability or to maintain relationships between tables, you can explicitly exclude these columns from any transformation rules.
schema_version:"1.0"models: - transform_v2:globals:classify:enable:truesteps: - rows:update: - condition:column.entity is not none and column.name != "id"value:column.entity | fake
Post-processing
Transform v2 can be used to post-process synthetic data to increase accuracy, for example by dropping invalid rows according to custom business logic, or by ensuring calculated field values are accurate.
We published a guide containing best practices for cleaning and pre-processing real world data can help train better synthetic data models. The config below automates several steps from this guide, and can be chained in a Workflow to run ahead of synthetic model training.
schema_version:"1.0"models: - transform_v2:steps: - vars:duplicated:data.duplicated()rows:drop:# Remove duplicate records - condition:vars.duplicated[index]update:# Standardize empty values - condition:this | lower in ["?", "missing", "n/a", "not applicable"]value:none# Cap high float precision - condition:column.type == "float"value:this | round(2)