Reference
Transform v2 configs consist of (optional) global parameters followed by a sequence of transformation steps. Rows, columns, and values not touched by any transformation step are maintained as-is in the output. In other words, Transform v2 configs are implicitly "passthrough".
Below is a "kitchen sink" config showing most of Transform v2 capabilities. Don't worry if it looks overwhelming. We will dissect each step in the reference below.
Globals
The entire globals
section is optional. You can use it to re-configure the following default entity detection and transformation settings:
classify
: Dictionary of classification configuration parameters. Note that classification is only performed once for each model, and currently only maps entire columns to entities (searching for entities within free text fields similarly to Transform'suse_nlp
option is not currently supported in Transform v2). Subsequent model runs will assume the schema remains unchanged, and continue to use the column to entity mapping detected during the first run.enable
: Boolean specifying whether to perform classification. Defaults tofalse
, which setscolumn.entity
tonone
for all columns. Whentrue
, classification accuracy currently necessitates sending column names and a few (equal tonum_samples
) randomly selected values from each column to the Gretel Cloud.entities
: List of PII entities that Transform's classification model will attempt to detect. Defaults to the following commonly used entities:[name, first_name, last_name, company, email, phone_number, address, street_address, city, administrative_unit, country, postcode]
. For best practices around customizing this list, see Classification.num_samples
: Number of randomly selected values from each column to use for classification. Defaults to 3, but you can set it to a higher number for more accurate classification, or a lower number if you have privacy or security concerns with sending randomly sampled values from your dataset. Settingnum_samples: 0
will use only column names as the input to classification.
locales
: List of default Faker locales to use for fake value generation. Defaults to["en_US"]
.fake
will randomly choose a locale from this list each time it generates a fake value, except when initialized with explicit locales, e.g.fake(["fr_FR"]).first_name()
. For a list of valid locales, see Faker's localized providers.seed
: Integer seed value used to generate fake values consistently. Defaults tosession
. Given the same seed value and the same input,fake
generates the same output throughout the dataset and across multiple sessions (for example, "Alice" may be transformed to "Bob" in all records). When the seed set tonull
, fake value transformations are not consistent, even within a single session (for example, "Alice" may be transformed to "Bob" in one record, and "Eve" in another record). When the seed is set tosession
, a random integer is generated at the beginning of each Transform v2 run and used as the seed to transform values consistently within the current run (subsequent runs will generate their own random seed). While there are privacy benefits to inconsistent transformations, we recommend setting a fixed seed for consistent transformation for use cases involving downstream synthetic data generation, or analysis on the transformed dataset.
You can also define global constants in the globals
section, which you can access in any step. For example, if you define company: "Acme Inc."
under globals
, a transformation step with value: globals.company
will set that field's value to "Acme Inc.".
Steps
steps
contain an ordered list of data transformation actions to be executed in the same order as they are defined in the Transform v2 config.
Vars
Each step can optionally contain a vars
section, which defines custom variables to be used in any Jinja expression within the step. Unlike globals
, vars
are scoped to an individual step, and are initialized using Jinja expressions that are evaluated at the beginning of each step.
These expressions can leverage data
(a Pandas DataFrame containing the entire dataset) to implement custom aggregations. For example, the config section below creates a new percent_of_total
column by storing the total
in vars
then dividing the value of each individual row by vars.total
:
Columns
The columns
section of each step contains transformations applying to an entire column at once. Namely: adding a new column, dropping (removing) a column, and renaming a column.
Add
You can add a new blank column (which you can later fill in using a rows
update
action) by specifying its name
and optional position
. If position is left unspecified, the new column is added as the last column. Initially all values in the new column will be null, but you can populate them using a rows.update
rule. For example, the config section below adds a primary_key
column, positions it as the first column in the dataset, and then populates it with the index of the row:
Drop
To drop a column, specify its name in a columns
drop
action. For example, the config section below drops the FirstName
and LastName
columns:
Rename
You can rename a column by specifying its current name (name
) and new name (value
). For example, the config section below renames the MiddleName
column to MiddleInitial
:
Rows
Each step can also contain a rows
section, listing transformation rules that process the dataset row by row. The two currently supported operations are drop
and update
, respectively allowing for selective removal of rows or modification of row data based on specified rules.
Drop
The drop
operation within the rows
section is used to remove rows from the dataset that meet certain conditions. The specified condition must be a valid Jinja template expression. Rows that satisfy the condition are excluded from the resulting dataset.
For instance, to exclude rows where the user_id
column is empty, the configuration can be specified as follows:
You can use more complex Jinja expressions for conditions that involve multiple columns, logical operators, or functions. condition
has access to the entire Transform v2 Jinja environment, as well as a few additional objects:
vars
: Dictionary of variables defined under thevars
section of the currentstep
. For example,vars.total
refers to the value of thetotal
variable defined above.row
: Dictionary of the row's contents. For example,row.user_id
refers to the value of theuser_id
column within that row.index
: Zero-based index of the row in the dataset. Note that the index of a row may change during processing if previous steps delete or add rows. For example, the rule below drops every other record from the dataset:
Update
The update
operation allows you to modify the values of specific rows. It can be used to set new values for columns, generate fake data, anonymize sensitive information, or apply any transformation that can be expressed as a Jinja template.
Each update operation must contain one of name
, entity
or condition
which are different ways to specify what to update, as well as value
, which is contains the updated value. name
and entity
must be strings or list of strings, while condition
and value
are Jinja templates.
You can also optionally specify a fallback_value
to be used if evaluating value
throws an error. We recommend doing this when passing dynamic inputs to functions in value
(for example, setting the Faker locale based on the contents of another column), preferably with a simple template (e.g. using static parameter values) for fallback_value
to avoid further errors. In the event where both value
and fallback_value
fail to parse, the value will be set to the error message to aid with debugging.
condition
, value
, and fallback_value
in row update rules have access to the row drop Jinja environment including vars
, row
, and index
, as well as a few additional objects:
column
: Dictionary referring to the current column whose value is being changed. Properties includename
andentity
.this
: Literal reffering to the current value that is being changed. For example,value: this
is a no-op which leaves the current value unchanged, whilevalue: this | sha256
replaces the current value with its SHA-256 hash.
Here's how the update
operation works with examples:
Setting a static value
The rule below sets the value of the column namedstatus_column
to the string processed
for all rows.
Incrementing an index
In the example below, we use the index
special variable to set the value of the column row_index
as the index of the record in the dataset. e.g. for a dataset containing 100 rows, the value of row_index
for the last row will be 99.
Generating fake PII
You can use the built-in Faker implementation to generate fake entities. See Faker's documentation for a list of supported entities and parameters.
The example below replaces values in all columns detected to contain email addresses with fake email addresses. Notice that unlike previous examples where the update rule was conditioned on name
(the name of a column), the rule below is conditioned on entity
(the type of entity contained within a column), which may match multiple columns. For example, if the dataset contains personal_email
and work_email
columns, the rule below will replace the contents of both with fake email addresses.
Modifying based on a condition
You can also conditionally update rows using flexible Jinja conditions. These conditions may match any number of columns and any number of rows (unlike name
and entity
conditions which apply to all rows).
For example, you can set the value of the flag_for_review
column to true
for all rows where the value of the amount
column is greater than 1,000:
Classification
Transform v2 incorporates a classification feature to detect personal identifiable information (PII) within data. This feature simplifies selecting and transforming specific types of PII by tagging each column with its appropriate entity, if any.
PII Detection
The classification model is capable of recognizing a variety of pre-defined and custom entities. While you can use arbitrary strings as entity names, it is beneficial to align with Faker entities if you plan to pass entity names to the fake
filter in order to generate fake values of the same entity.
For example, to detect and replace phone numbers, email addresses, employee IDs, and International Bank Account Numbers (IBAN), include phone_number
, email
, and iban
in the list of entities under globals.classify.entities
. These match perfectly Faker's phone_number()
, email()
, and iban()
methods.
Here is an example configuration that uses classification for detecting these 3 entities and applying transformations:
Since these align with Faker built-in entities, we could also write a single rule that applies to all detected entities:
With this setting, Transform v2 will first classify entities in the dataset, then replace detected entities with faker-generated ones for each row in the dataset.
If your list of entities contains custom entities not supported by Faker, you can leverage fallback_value
to apply other transformations. For example, the policy below attempts to fake all entities, and falls back to hashing unsupported entities. Since iban
is supported by Faker while employee_id
is not, the output of this policy will be fake IBAN values in the IBAN column, and hashes of the actual employee IDs in the employee ID column.
If instead you wish to replace unsupported entities by the entity name between brackets, you could set fallback_value: "<" + column.entity + ">"
. You could also generate custom fake values, for example if you wanted to replace all entities not supported by Faker by the letter "E" followed by a random 6 digit number, you could set fallback_value: "E" + fake.pyint(100000, 999999) | string
, or use Jinja's concatenation operator ~
which automatically converts integers to strings: fallback_value: "E" ~ fake.pyint(100000, 999999)
.
Jinja environment
Objects
Every Jinja environment in Transform v2 can access the objects below:
fake
: Instantiation of Faker which defaults to the locale and seed specified in theglobals
section. You can override these defaults by passing parameters, such asfake(locale="it_IT", seed=42)
, which will generate data using the Italian locale and 42 as the consistency seed.random
is Python's random library. For example you could callrandom.randint(1, 10)
to generate an integer between 1 and 10.
Filters
Variables can be modified by filters. Filters are separated from the variable by a pipe symbol (|) and may have optional arguments in parentheses. Multiple filters can be chained. The output of one filter is applied to the next. Transform v2 can use any of Jinja's built-in filters, and also extends them with a few Gretel-specific filters:
Transform v2 extends the capabilities of the standard Jinja filters with its own specific set. These include:
hash
: Computes the SHA-256 hash of a value. For example,this | hash
returns a hash of the value in the matched column in a row update rule.isna
: Returnstrue
if a value is null or missing.fake
: Invokes the Faker library to generate fake data of the entity that's passed to the filter. This is useful if the entity name dynamic, e.g.column.type | fake
is equivalent tofake.first_name()
ifcolumn.type
is equal to"first_name"
.lookup_country
: Attempts to map a country name to its corresponding pycountry Country.lookup_locales
: Maps a pycountry Country to a list of Faker locales for that country. For example"Canada" | lookup_country | lookup_locales
returns["en_CA", "fr_CA"]
.normalize
: Removes special characters and converts Unicode strings to an ASCII representation.tld
: Maps a pycountry Country object to its corresponding top-level domain. For example,"France" | lookup_country | tld
evaluates to.fr
.date_parse
: Takes a string value and parses it into a Python date object. Date formats are those supported by Python'sdateutil.parser.parse
method.date_shift
: Takes a date, either as a string or a date object, and randomly shifts it on an interval about the date. For example2023-01-01 | date_shift('-5y', '+5y')
will result in a date object between between2018-01-01
and2028-01-01
. Supports the same interval formats as Python'sfaker.providers.date_time.date_between
.
Last updated