Examples
Below are a few complete sample configs to help you quickly get started with some of the most common Transform v2 use cases.
PII redaction
Replace detected entities with fake entities of the same type
Fallback on hashing entities not supported by Faker. If you don't require NER, remove the last rule (type: text -> fake_entities
) to run this config more than 10x faster assuming your dataset contains free text columns.
Replace names with fake names and hash all other detected entities
Exclude the primary key
If you need to preserve certain ID columns for auditability or to maintain relationships between tables, you can explicitly exclude these columns from any transformation rules.
Replace regular expressions with fake values
You can use the built-in Python re
library for regex operations in Python. Below we go a step further by listing all regular expressions we are looking to replace along with their Faker function mapping in the regex_to_faker
variable, then iterate through them to replace all of their occurrences in all free text columns.
Post-processing
Transform v2 can be used to post-process synthetic data to increase accuracy, for example by dropping invalid rows according to custom business logic, or by ensuring calculated field values are accurate.
Calculated columns
Drop records not meeting business logic
Data cleaning
We published a guide containing best practices for cleaning and pre-processing real world data can help train better synthetic data models. The config below automates several steps from this guide, and can be chained in a Workflow to run ahead of synthetic model training.
Last updated