Search
K

Redact PII

Use Gretel Transforms to remove sensitive personal identifiable information (PII).
The Gretel Transform model is used to redact personal identifiable information (PII) from tabular data. In this example, we will remove PII from a Sample Dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can redact PII using Transform via the Gretel Console, CLI, or Python SDK.

Tutorial

SDK
CLI
Open In Colab

Redact PII

In this notebook, we will create a transform policy to identify and redact or replace PII with fake values. We will then use the SDK to transform a dataset and examine the results.
To run this notebook, you will need an API key from the Gretel Console.

Getting started

%%capture
!pip install pyyaml Faker pandas
!pip install -U gretel-client
# Specify your Gretel API key
import pandas as pd
from gretel_client import configure_session
pd.set_option("max_colwidth", None)
configure_session(api_key="prompt", cache="yes", validate=True)

Create configuration with transform policy

# Create our configuration with our Transforms Policies and Rules.
config = """schema_version: "1.0"
name: "Redact PII"
models:
- transforms:
data_source: "_"
policies:
- name: remove_pii
rules:
- name: fake_or_redact_pii
conditions:
value_label:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- custom/*
transforms:
- type: fake
- type: redact_with_char
attrs:
char: X
label_predictors:
namespace: custom
regex:
user_id:
patterns:
- score: high
regex: 'user_[\d]{5}'
"""

Use Faker to make training and test datasets

from faker import Faker
# Use Faker to make training and test data.
def fake_pii_csv(filename, lines=100):
fake = Faker()
with open(filename, "w") as f:
f.write("id,name,email,phone,visa,ssn,user_id\n")
for i in range(lines):
_name = fake.name()
_email = fake.email()
_phone = fake.phone_number()
_cc = fake.credit_card_number()
_ssn = fake.ssn()
_id = f'user_{fake.numerify(text="#####")}'
f.write(f"{i},{_name},{_email},{_phone},{_cc},{_ssn},{_id}\n")
fake_pii_csv("train.csv")
fake_pii_csv("test.csv")

Create model

import yaml
from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll
# Create a project and model configuration.
project = create_or_get_unique_project(name="redact-pii-transform")
model = project.create_model_obj(
model_config=yaml.safe_load(config), data_source="train.csv"
)
# Upload the training data. Train the model.
model.submit_cloud()
poll(model)

Generate redacted data and view results

# Use the model to generate synthetic data.
record_handler = model.create_record_handler_obj(data_source="test.csv")
record_handler.submit_cloud()
poll(record_handler)
# Compare results. Here is our "before."
train_df = pd.read_csv("test.csv")
print("test.csv head, before redaction")
print(train_df.head())
# And here is our "after."
transformed = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
print("test.csv head, after redaction")
transformed.head()

Overview

This tutorial will walk through the process of redacting PII using the Gretel CLI. If you'd like to follow along via video, see the Video Tutorial below.

Save sample dataset and configuration

Save your configuration to a local file named redact_pii.yaml. The policy below searches for sensitive PII values as defined by Experian (including a custom regex for user IDs), replacing them with fake values when possible, or redacting with a user-defined character.
schema_version: "1.0"
name: "Redact PII"
models:
- transforms:
data_source: "_"
policies:
- name: remove_pii
rules:
- name: fake_or_redact_pii
conditions:
value_label:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- custom/*
transforms:
- type: fake
- type: redact_with_char
attrs:
char: X
label_predictors:
namespace: custom
regex:
user_id:
patterns:
- score: high
regex: "user_[\\d]{5}"
Save the sample dataset below to pii.csv
id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,kgoodbanne[email protected],228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,ajackso[email protected],611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,[email protected],799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,[email protected],985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,mmyers[email protected],545-861-4923,5108752255128478,180-65-6855,user_92359

Create a transformation model

First, create a project to host your transformation models and artifacts.
gretel projects create --display-name redact-pii --set-default
Next, train your transformation model on your dataset or one with an identical schema.
Currently, only plain text and CSV formats are supported by the Transform API. JSON support is coming soon.
gretel models create --config redact_pii.yaml --in-data pii.csv --runner cloud > model-data.json
You will use redact_pii.yaml as your --config and pii.csv as --in-data.

Redact sensitive data

Your model can now be used to redact sensitive data from any dataset with a similar structure or schema.
gretel records transform --model-id model-data.json --in-data pii.csv --runner cloud --output .

Examine the results

Transform results are downloaded to the local directory in CSV format to the file data.gz. Our policy is set to replace names, addresses, and emails with fake entities, and to redact the user ID regular expression with a character replacement.
Let's examine the transformed results from the command line.
zcat data.gz | column -s, -t
id name email phone visa ssn user_id
1 Samantha Sandoval [email protected] 986.089.1149 344661707423210 102-40-4854 XXXX_XXXXX
2 Shannon Holmes [email protected] (686)646-3171 3519277724227055 554-61-8106 XXXX_XXXXX
3 David Chapman [email protected] 001-946-130-7514x76773 213182470523001 008-06-5773 XXXX_XXXXX
4 Crystal Russo [email protected] 027-327-7306x07952 6011379376191328 628-27-4071 XXXX_XXXXX
5 John Allen [email protected] (365)502-6954 4047982390743587 740-42-9239 XXXX_XXXXX

Next steps

For use cases such as training machine learning models on customer support logs, it is often desirable to replace PII with fake values to maintain semantics in the original data. However, this is not always desirable. Try updating the transformation policy to simply redact all sensitive values with an "*" character.

Video walkthrough

Sample Dataset

id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,[email protected],228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,[email protected],611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,[email protected],799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,[email protected],985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,[email protected],545-861-4923,5108752255128478,180-65-6855,user_92359cs

Video Tutorial