Reference
Transform v2 configs consist of (optional) global parameters followed by a sequence of transformation steps. Rows, columns, and values not touched by any transformation step are maintained as-is in the output. In other words, Transform v2 configs are implicitly "passthrough".
Below is a "kitchen sink" config showing most of Transform v2 capabilities. Don't worry if it looks overwhelming. We will dissect each step in the reference below.
Globals
The entire globals
section is optional. You can use it to re-configure the following default entity detection and transformation settings:
classify
: Dictionary of classification configuration parameters. Note that classification is only performed once for each model, and currently only maps entire columns to entities (searching for entities within free text fields similarly to Transform'suse_nlp
option is not currently supported in Transform v2). Subsequent model runs will assume the schema remains unchanged, and continue to use the column to entity mapping detected during the first run. NOTE: This will send column headers and a sample of data to perform the classification toGretel Navigator
or a hybrid-deployedGretel Inference LLM
.enable
: Boolean specifying whether to perform classification. Defaults totrue
when running within Gretel Cloud; defaults tofalse
otherwise. Whenfalse
, setscolumn.entity
tonone
for all columns. Whentrue
, classification accuracy currently necessitates sending column names and a few (equal tonum_samples
) randomly selected values from each column to the Gretel Cloud.entities
: List of PII entities that Transform's classification model will attempt to detect. Defaults to the following commonly used entities:[name, first_name, last_name, company, email, phone_number, address, street_address, city, administrative_unit, country, postcode]
. For best practices around customizing this list, see Classification.num_samples
: Number of randomly selected values from each column to use for classification. Defaults to 3, but you can set it to a higher number for more accurate classification, or a lower number if you have privacy or security concerns with sending randomly sampled values from your dataset. Settingnum_samples: 0
will use only column names as the input to classification.
ner
: Named entity recognitionlocales
: List of default Faker locales to use for fake value generation. Defaults to["en_US"]
.fake
will randomly choose a locale from this list each time it generates a fake value, except when initialized with explicit locales, e.g.fake(["fr_FR"]).first_name()
. For a list of valid locales, see Faker's localized providers.seed
: Integer seed value used to generate fake values consistently. Defaults tonull
. When the seed is set tonull
, a random integer is generated at the beginning of each Transform v2 run and used as the seed to transform values consistently within the current run (subsequent runs will generate their own random seed). This means rerunning with a null seed can cause inconsistent transforms (i.e. Alice -> Bob for the first run, Alice -> Jane for the second). If you set the seed to a specific number, transforms will be consistent across runs (i.e. Alice -> Bob always). The seed also doubles as a salt for thehash
function. While there are privacy benefits to inconsistent transformations, we recommend setting a fixed seed for consistent transformation for use cases involving downstream synthetic data generation, or analysis on the transformed dataset.
You can also access global constants in transformation steps. For example, a transformation step with value: globals.locales | first
will set that field's value to the first locale in the list of locales
.
Steps
steps
contain an ordered list of data transformation actions to be executed in the same order as they are defined in the Transform v2 config.
Vars
Each step can optionally contain a vars
section, which defines custom variables to be used in any Jinja expression within the step. Unlike globals
, vars
are scoped to an individual step, and are initialized using Jinja expressions that are evaluated at the beginning of each step.
These expressions can leverage data
(a Pandas DataFrame containing the entire dataset) to implement custom aggregations. For example, the config section below creates a new percent_of_total
column by storing the total
in vars
then dividing the value of each individual row by vars.total
:
Columns
The columns
section of each step contains transformations applying to an entire column at once. Namely: adding a new column, dropping (removing) a column, and renaming a column.
Add
You can add a new blank column (which you can later fill in using a rows
update
action) by specifying its name
and optional position
. If position is left unspecified, the new column is added as the last column. Initially all values in the new column will be null, but you can populate them using a rows.update
rule. For example, the config section below adds a primary_key
column, positions it as the first column in the dataset, and then populates it with the index of the row:
Drop
To drop a column, specify its name in a columns
drop
action. For example, the config section below drops the FirstName
and LastName
columns:
You can also drop columns based on a condition expressed. condition
has access to the entire Transform v2 Jinja environment, as well as a few additional objects:
vars
: Dictionary of variables defined under thevars
section of the currentstep
. For example,vars.total
refers to the value of thetotal
variable defined above.column
: Dictionary containing the following column properties. For example,condition: column.entity in vars.entities_to_drop
drops all columns matching the list of PII entities defined in theentities_to_drop
variable.name
: the name or header of the column in the dataset.entity
: the detected PII entity type of the column, ornone
if the column does not match any PII entity type from the list underglobals.classify.entities
.dtype
: Pandas dtype of the column.type
: the detected data type of the column, one of "empty", "numeric", "categorical", "binary", "text", or "other".position
: zero-indexed position of the column in the dataset. For a dataset with 10 columns,column.position
is equal to 0 for the first column and 9 for the last column.
Rename
You can rename a column by specifying its current name (name
) and new name (value
). For example, the config section below renames the MiddleName
column to MiddleInitial
:
Rows
Each step can also contain a rows
section, listing transformation rules that process the dataset row by row. The two currently supported operations are drop
and update
, respectively allowing for selective removal of rows or modification of row data based on specified rules.
Drop
The drop
operation within the rows
section is used to remove rows from the dataset that meet certain conditions. The specified condition must be a valid Jinja template expression. Rows that satisfy the condition are excluded from the resulting dataset.
For instance, to exclude rows where the user_id
column is empty, the configuration can be specified as follows:
You can use more complex Jinja expressions for conditions that involve multiple columns, logical operators, or functions. condition
has access to the entire Transform v2 Jinja environment, as well as a few additional objects:
vars
: Dictionary of variables defined under thevars
section of the currentstep
. For example,vars.total
refers to the value of thetotal
variable defined above.row
: Dictionary of the row's contents. For example,row.user_id
refers to the value of theuser_id
column within that row.index
: Zero-based index of the row in the dataset. Note that the index of a row may change during processing if previous steps delete or add rows. For example, the rule below drops every other record from the dataset:
Update
The update
operation allows you to modify the values of specific rows. It can be used to set new values for columns, generate fake data, anonymize sensitive information, or apply any transformation that can be expressed as a Jinja template.
Each update operation must contain one of name
, entity
, type
or condition
which are different ways to specify what to update, as well as value
, which is contains the updated value. name
and entity
must be strings or list of strings, while condition
and value
are Jinja templates.
You can also optionally specify a fallback_value
to be used if evaluating value
throws an error. We recommend doing this when passing dynamic inputs to functions in value
(for example, setting the Faker locale based on the contents of another column), preferably with a simple template (e.g. using static parameter values) for fallback_value
to avoid further errors. In the event where both value
and fallback_value
fail to parse, the value will be set to the error message to aid with debugging.
condition
, value
, and fallback_value
in row update rules have access to the row drop Jinja environment including vars
, row
, and index
, as well as a few additional objects:
column
: Dictionary referring to the current column whose value is being changed. The properites of the column that can be accessed are:name
: The name of the columnentity
: The name of an entity that is in the columntype
: A Gretel extracted generic type for the column, one of:empty
numeric
categorical
text
binary
other
dtype
: The Pandas dtype of the column (object
,int32
, etc)position
: The numerical (index) position of the column in the table
this
: Literal referring to the current value that is being changed. For example,value: this
is a no-op which leaves the current value unchanged, whilevalue: this | sha256
replaces the current value with its SHA-256 hash.
Here's how the update
operation works with examples:
Setting a static value
The rule below sets the value of the column namedstatus_column
to the string processed
for all rows.
Incrementing an index
In the example below, we use the index
special variable to set the value of the column row_index
as the index of the record in the dataset. e.g. for a dataset containing 100 rows, the value of row_index
for the last row will be 99.
Generating fake PII
You can use the built-in Faker implementation to generate fake entities. See Faker's documentation for a list of supported entities and parameters.
The example below replaces values in all columns detected to contain email addresses with fake email addresses. Notice that unlike previous examples where the update rule was conditioned on name
(the name of a column), the rule below is conditioned on entity
(the type of entity contained within a column), which may match multiple columns. For example, if the dataset contains personal_email
and work_email
columns, the rule below will replace the contents of both with fake email addresses.
Modifying based on a condition
You can also conditionally update rows using flexible Jinja conditions. These conditions may match any number of columns and any number of rows (unlike name
and entity
conditions which apply to all rows).
For example, you can set the value of the flag_for_review
column to true
for all rows where the value of the amount
column is greater than 1,000:
Classification
Transform v2 incorporates a classification feature to detect personal identifiable information (PII) within data. This feature simplifies selecting and transforming specific types of PII by tagging each column with its appropriate entity, if any.
Note: Column classification requires access to an LLM endpoint. When running within Gretel Cloud, this will use Gretel Navigator
. For Gretel Hybrid, classification needs to use a separately deployed LLM within your cluster. For full documentation on how to setup an LLM, see Deploying an LLM.
PII detection
The classification model is capable of recognizing a variety of pre-defined and custom entities. While you can use arbitrary strings as entity names, it is beneficial to align with Faker entities if you plan to pass entity names to the fake
filter in order to generate fake values of the same entity.
For example, to detect and replace phone numbers, email addresses, employee IDs, and International Bank Account Numbers (IBAN), include phone_number
, email
, and iban
in the list of entities under globals.classify.entities
. These match perfectly Faker's phone_number()
, email()
, and iban()
methods.
Here is an example configuration that uses classification for detecting these 3 entities and applying transformations:
Since these align with Faker built-in entities, we could also write a single rule that applies to all detected entities:
With this setting, Transform v2 will first classify entities in the dataset, then replace detected entities with faker-generated ones for each row in the dataset.
If your list of entities contains custom entities not supported by Faker, you can leverage fallback_value
to apply other transformations. For example, the policy below attempts to fake all entities, and falls back to hashing unsupported entities. Since iban
is supported by Faker while employee_id
is not, the output of this policy will be fake IBAN values in the IBAN column, and hashes of the actual employee IDs in the employee ID column.
If instead you wish to replace unsupported entities by the entity name between brackets, you could set fallback_value: "<" + column.entity + ">"
. You could also generate custom fake values, for example if you wanted to replace all entities not supported by Faker by the letter "E" followed by a random 6 digit number, you could set fallback_value: "E" + fake.pyint(100000, 999999) | string
, or use Jinja's concatenation operator ~
which automatically converts integers to strings: fallback_value: "E" ~ fake.pyint(100000, 999999)
.
Named Entity Recognition
Similarly to column classification, Transform v2 supports flexible Named Entity Recognition (NER) functionality including the ability to detect and transform custom entity types.
To get started, list the entities to detect under the globals.ner.entities
section and use one of the four built-in NER transformation filters:
redact_entities
replaces detected entities with the entity type. For example, "I met Sally" becomes "I met <first_name>".fake_entities
replaces detected entities with randomly generated fake values using the Faker function corresponding to the entity type. For example, "I met Sally" could become "I met Joe". When usingfake_entities
, ensure the name of the entity in theglobals.classify.entities
section exactly matches the name of a Faker function. Entities without a matching Faker function are redacted by default, and you can customize the fallback behavior using theon_error
parameter, e.g.fake_entities(on_error="hash")
hashes the non-Faker-matching entities instead of redacting them.hash_entities
replaces detected entities with salted hashes of their value. For example, "I met Sally" may become "I met 515acf74f".label_entities
is similar toredact_entities
, but also includes the entity value. For example, "I met Sally" becomes "I met <entity type="first_name" value="Sally">". This can be useful for downstream post-processing (such as highlighting detected entities within the original text, applying more complex replacement logic for specific entity types, etc.), both within Transform v2 and externally.
You can tweak the ner_threshold
parameter if you notice too many or too few detections. You can think of the NER threshold as the level of confidence required in the model's detection before labeling an entity. Increasing the NER threshold decreases the number of detected entities, while decreasing the NER threshold increases the number of detected entities. Values between 0.5 and 0.8 are good starting points.
The sample config below shows how to apply fake_entities
(falling back to redact_entities
) for a list of custom entity types across all free text fields:
Additionally, if you would like to speed up Named Entity Recognition by having it run on hardware with a GPU, you can set the globals.ner.ner_optimized
flag to true
:
Classification in Hybrid
If you are running Transform v2 in Gretel Hybrid and want to use classification, you'll need to first ensure you've installed the Gretel Inference LLM chart in your cluster. For full instructions on that installation, see Deploying an LLM.
Once you've done that, you can specify the Gretel Inference LLM model via Transform v2's globals.classify.deployed_llm_name
configuration field. This name should match the gretelLLMConfig.modelName
defined in the Gretel Inference LLM's values.yml
.
Here's how to perform the above PII detection using mistral-7b
deployed in your Gretel Hybrid Cluster:
Jinja environment
Objects
Every Jinja environment in Transform v2 can access the objects below:
fake
: Instantiation of Faker which defaults to the locale and seed specified in theglobals
section. You can override these defaults by passing parameters, such asfake(locale="it_IT", seed=42)
, which will generate data using the Italian locale and 42 as the consistency seed.random
is Python's random library. For example you could callrandom.randint(1, 10)
to generate an integer between 1 and 10.
Filters
Variables can be modified by filters. Filters are separated from the variable by a pipe symbol (|) and may have optional arguments in parentheses. Multiple filters can be chained. The output of one filter is applied to the next. Transform v2 can use any of Jinja's built-in filters, and also extends them with a few Gretel-specific filters:
Transform v2 extends the capabilities of the standard Jinja filters with its own specific set. These include:
hash
: Computes the SHA-256 hash of a value. For example,this | hash
returns a hash of the value in the matched column in a row update rule. It can also take in its own salt, i.e.this | hash(salt="my-salt")
, but by default it uses theseed
value of the run as the salt. If the seed is unset, the hash will be different for the same values across runs.isna
: Returnstrue
if a value is null or missing.fake
: Invokes the Faker library to generate fake data of the entity that's passed to the filter. This is useful if the entity name dynamic, e.g.column.type | fake
is equivalent tofake.first_name()
ifcolumn.type
is equal to"first_name"
.lookup_country
: Attempts to map a country name to its corresponding pycountry Country.lookup_locales
: Maps a pycountry Country to a list of Faker locales for that country. For example"Canada" | lookup_country | lookup_locales
returns["en_CA", "fr_CA"]
.normalize
: Removes special characters and converts Unicode strings to an ASCII representation.tld
: Maps a pycountry Country object to its corresponding top-level domain. For example,"France" | lookup_country | tld
evaluates to.fr
.partial_mask(prefix: int, padding: str, suffix: int)
: This filter is similar to the MSSQL dynamic maskingpartial()
functionality. Given a value, this filter will retain the first N characters as the prefix, the last N characters as the suffix, and apply the padding between the prefix and suffix. If the original value is too short and would be leaked in the prefix, suffix, or a combination of the two, then the prefix and suffix are automatically adjusted to prevent this. For very short values, for example a single character value, only the padding may be returned. Example usage:value: this | partial_mask(2, "XXXXXX", 2)
date_parse
: Takes a string value and parses it into a Python datetime object. Date formats are those supported by Python'sdateutil.parser.parse
method.date_shift
: Takes a date, either as a string or a date object, and randomly shifts it on an interval about the date. For example2023-01-01 | date_shift('-5y', '+5y')
will result in a date object between between2018-01-01
and2028-01-01
. Supports the same interval formats as Python'sfaker.providers.date_time.date_between
.date_time_shift
: Takes a date, either as a string, a date or datetime object, and randomly shifts it on an interval about the date. For example2023-01-01 00:00 | datetime_shift('-5y', '+5y')
will result in a date object between between2018-01-01 00:00
and2028-01-01 00:00
. Supports the same interval formats as Python'sfaker.providers.date_time.date_between
.date_format
: Takes a date and formats it per the passed in format. The default format is"%Y-%m-%d"
. Supports all formats forstrftime
.date_time_format
: Takes a datetime and formats it per the passed in format. The default format is"%Y-%m-%d" %H:%M:%S
. Supports all formats forstrftime
.
Last updated