Transform v2
Transform v2 features custom transformation logic, an expanded library of detectable and fake-able entities, PII and custom entity detections, and enhanced performance
Transform v2 is a complete rewrite of Transform v1, building on Pandas DataFrames, Jinja templates, semantic classification, and native Faker support, to offer fully flexible data transformation conditions and templates.
What can I do with Transform v2?
Transform v2 is a general-purpose programmatic dataset editing tool. Most commonly, Gretel customers use it to:
De-identify datasets, for example by detecting Personally Identifiable Information (PII) and replacing it with fake PII of the same type.
Pre-process datasets before using them to train a synthetic data model, for example to remove low quality records such as records containing too many blank values, or columns containing UUIDs or hashes which are not relevant for synthetic data models since they contain no discernible correlations or distributions for the model to learn.
Post-process synthetic data generated from a synthetic data model, for example to validate that the generated records respect business-specific rules, and drop or fix any records that don't.
Anatomy of a Transform v2 configuration
As with other Gretel models, you can configure Transform v2 using YAML. Transform v2 config files consist of two sections:
globals
which contains default parameter values (such as the locale and seed used to generate fake values) and user-defined variables applicable throughout the config.steps
which lists transformation steps applied sequentially. Transformation steps can define variables (vars
), and manipulatecolumns
(add
,drop
, andrename
) androws
(drop
andupdate
). In practice most Transform v2 configs contain a single step, but more steps can be useful if for example the value of column B depends on the original (non-transformed) value of column A, but column A must also be eventually transformed. In that case the first step could set the new value of column B, leaving column A unchanged, before ultimately setting the new value of column A in the second step.
Below is an example config which shows this config structure in action:
The config above:
Sets the default locale for fake values to Canada (English) and Canada (French). When multiple locales are provided, a random one is chosen from the list for each fake value.
Adds a new column named
row_index
initially containing only blank values.Drops invalid rows, which we define here as rows containing blank
user_id
values.condition
is a Jinja template expression, which allows for custom validation logic.Sets the value of the new
row_index
column to the index of the record in the original dataset (this can be helpful for use cases where the ability to "reverse" transformations or maintain a mapping between the original and transformed values is important).Replaces all values within columns detected as containing phone numbers (including
phone_number_1
andphone_number_2
) with fake phone numbers having area codes in Canada, since the default locale is set toen_CA
andfr_CA
in theglobals
section.fake
is a Faker object supporting all standard Faker providers.Drops the sensitive
user_id
column. Note that this is done in the second step, since that column is needed in the first step to drop invalid rows.Renames the
phone_number_1
andphone_number_2
columns respectively tocell_phone
andhome_phone
.
Getting started with Transform v2
To get started with building your own Transform v2 config for de-identification or pre/post processing datasets, see the Examples page for starter configs for several use cases, and the Reference page for the full list of supported transformation steps, template expression syntax, and detectable entities.
Last updated