Use Gretel Transforms to remove sensitive personal identifiable information (PII).
The Gretel Transform model is used to redact personal identifiable information (PII) from tabular data. In this example, we will remove PII from a Sample Dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can redact PII using Transform via the Gretel Console, CLI, or Python SDK. If you'd like to follow along via video, see the #video-tutorial below.
In this notebook, we will create a transform policy to identify and redact or replace PII with fake values. We will then use the SDK to transform a dataset and examine the results.
To run this notebook, you will need an API key from the Gretel Console.
import pandas as pd
df = pd.read_csv('https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/gretel_generated_table_simpsons_pii.csv')
df.head(5)
Redact PII via transform model
# De-identification configuration
config = """
schema_version: "1.0"
name: "Replace PII"
models:
- transform_v2:
globals:
classify:
enable: true
entities:
- first_name
- last_name
- email
- phone_number
- street_address
num_samples: 100
steps:
- rows:
update:
# Detect and replace values in PII columns, hash if no Faker available
- condition: column.entity is in globals.classify.entities
value: column.entity | fake
fallback_value: this | hash | truncate(9,true,"")
# Detect and replace entities within free text columns
- type: text
value: this | fake_entities(on_error="hash")
# Replace email addresses with first + last name to retain correlations
- name: email_address
value: 'row.first_name + "." + row.last_name + "@" + fake.free_email_domain()'
"""
transform_result = gretel.submit_transform(
config=config,
data_source=df,
job_label="Transform PII data"
)
transformed_df = transform_result.transformed_df
transformed_df.head()
View results of redacting data
import pandas as pd
def highlight_detected_entities(report_dict):
"""
Process the report dictionary, extract columns with detected entities,
and highlight cells with non-empty entity labels.
Args:
report_dict (dict): The report dictionary from transform_result.report.as_dict.
Returns:
pd.io.formats.style.Styler: Highlighted DataFrame.
"""
# Parse the columns and extract 'Detected Entities'
columns_data = report_dict['columns']
df = pd.DataFrame([
{
'Column Name': col['name'],
'Detected Entities': ', '.join(
entity['label'] for entity in col['entities'] if entity['label']
)
}
for col in columns_data
])
# Highlighting logic
def highlight_entities(s):
return ['background-color: lightgreen' if len(val) > 0 else '' for val in s]
# Apply highlighting
return df.style.apply(highlight_entities, subset=['Detected Entities'], axis=1)
highlight_detected_entities(pd.DataFrame(transform_result.report.as_dict))
You will use transform/default as your --config and pii.csv as --in-data.
After the run, you should see an output like so:
INFO: Model done training. The model id is
679423a823248531cfb6b6bb
Save this model ID for the next step.
Examine the results
First, download the results using
MODEL_ID=679423a823248531cfb6b6bb # Replace with your model_id
gretel models get --model-id "$MODEL_ID"
Transform results are downloaded to the local directory in CSV format to the file data_preview.gz. Classify should have detected our PII related columns and transformed them.
Let's examine the transformed results from the command line.
zcat < data.gz | column -s, -t
id name email phone visa ssn user_id
1 Samantha Sandoval projas@hotmail.com 986.089.1149 344661707423210 102-40-4854 XXXX_XXXXX
2 Shannon Holmes robertprice@mckinney-thomas.com (686)646-3171 3519277724227055 554-61-8106 XXXX_XXXXX
3 David Chapman katherinegillespie@hensley.com 001-946-130-7514x76773 213182470523001 008-06-5773 XXXX_XXXXX
4 Crystal Russo mfischer@yahoo.com 027-327-7306x07952 6011379376191328 628-27-4071 XXXX_XXXXX
5 John Allen evanbrown@yahoo.com (365)502-6954 4047982390743587 740-42-9239 XXXX_XXXXX
Next steps
For use cases such as training machine learning models on customer support logs, it is often desirable to replace PII with fake values to maintain semantics in the original data. However, this is not always desirable. Try updating the transformation policy to simply redact all sensitive values with an "*" character.