Redact PII
Use Gretel Transforms to remove sensitive personal identifiable information (PII).
The Gretel Transform model is used to redact personal identifiable information (PII) from tabular data. In this example, we will remove PII from a Sample Dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can redact PII using Transform via the Gretel Console, CLI, or Python SDK.
SDK
CLI
In this notebook, we will create a transform policy to identify and redact or replace PII with fake values. We will then use the SDK to transform a dataset and examine the results.
%%capture
!pip install pyyaml Faker pandas
!pip install -U gretel-client
# Specify your Gretel API key
import pandas as pd
from gretel_client import configure_session
pd.set_option("max_colwidth", None)
configure_session(api_key="prompt", cache="yes", validate=True)
# Create our configuration with our Transforms Policies and Rules.
config = """schema_version: "1.0"
name: "Redact PII"
models:
- transforms:
data_source: "_"
policies:
- name: remove_pii
rules:
- name: fake_or_redact_pii
conditions:
value_label:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- custom/*
transforms:
- type: fake
- type: redact_with_char
attrs:
char: X
label_predictors:
namespace: custom
regex:
user_id:
patterns:
- score: high
regex: 'user_[\d]{5}'
"""
from faker import Faker
# Use Faker to make training and test data.
def fake_pii_csv(filename, lines=100):
fake = Faker()
with open(filename, "w") as f:
f.write("id,name,email,phone,visa,ssn,user_id\n")
for i in range(lines):
_name = fake.name()
_email = fake.email()
_phone = fake.phone_number()
_cc = fake.credit_card_number()
_ssn = fake.ssn()
_id = f'user_{fake.numerify(text="#####")}'
f.write(f"{i},{_name},{_email},{_phone},{_cc},{_ssn},{_id}\n")
fake_pii_csv("train.csv")
fake_pii_csv("test.csv")
import yaml
from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll
# Create a project and model configuration.
project = create_or_get_unique_project(name="redact-pii-transform")
model = project.create_model_obj(
model_config=yaml.safe_load(config), data_source="train.csv"
)
# Upload the training data. Train the model.
model.submit_cloud()
poll(model)
# Use the model to generate synthetic data.
record_handler = model.create_record_handler_obj(data_source="test.csv")
record_handler.submit_cloud()
poll(record_handler)
# Compare results. Here is our "before."
train_df = pd.read_csv("test.csv")
print("test.csv head, before redaction")
print(train_df.head())
# And here is our "after."
transformed = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
print("test.csv head, after redaction")
transformed.head()
This tutorial will walk through the process of redacting PII using the Gretel CLI. If you'd like to follow along via video, see the Video Tutorial below.
Save your configuration to a local file named
redact_pii.yaml
. The policy below searches for sensitive PII values as defined by Experian (including a custom regex for user IDs), replacing them with fake values when possible, or redacting with a user-defined character.schema_version: "1.0"
name: "Redact PII"
models:
- transforms:
data_source: "_"
policies:
- name: remove_pii
rules:
- name: fake_or_redact_pii
conditions:
value_label:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- custom/*
transforms:
- type: fake
- type: redact_with_char
attrs:
char: X
label_predictors:
namespace: custom
regex:
user_id:
patterns:
- score: high
regex: "user_[\\d]{5}"
Save the sample dataset below to
pii.csv
id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,[email protected],228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,[email protected],611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,[email protected],799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,[email protected],985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,[email protected],545-861-4923,5108752255128478,180-65-6855,user_92359
First, create a project to host your transformation models and artifacts.
gretel projects create --display-name redact-pii --set-default
Next, train your transformation model on your dataset or one with an identical schema.
Currently, only plain text and CSV formats are supported by the Transform API. JSON support is coming soon.
gretel models create --config redact_pii.yaml --in-data pii.csv --runner cloud > model-data.json
You will use
redact_pii.yaml
as your --config
and pii.csv
as --in-data
.Your model can now be used to redact sensitive data from any dataset with a similar structure or schema.
gretel records transform --model-id model-data.json --in-data pii.csv --runner cloud --output .
Transform results are downloaded to the local directory in CSV format to the file
data.gz
. Our policy is set to replace names, addresses, and emails with fake entities, and to redact the user ID regular expression with a character replacement. Let's examine the transformed results from the command line.
zcat data.gz | column -s, -t
id name email phone visa ssn user_id
1 Samantha Sandoval [email protected] 986.089.1149 344661707423210 102-40-4854 XXXX_XXXXX
2 Shannon Holmes [email protected] (686)646-3171 3519277724227055 554-61-8106 XXXX_XXXXX
3 David Chapman [email protected] 001-946-130-7514x76773 213182470523001 008-06-5773 XXXX_XXXXX
4 Crystal Russo [email protected] 027-327-7306x07952 6011379376191328 628-27-4071 XXXX_XXXXX
5 John Allen [email protected] (365)502-6954 4047982390743587 740-42-9239 XXXX_XXXXX
For use cases such as training machine learning models on customer support logs, it is often desirable to replace PII with fake values to maintain semantics in the original data. However, this is not always desirable. Try updating the transformation policy to simply redact all sensitive values with an "*" character.
id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,[email protected],228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,[email protected],611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,[email protected],799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,[email protected],985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,[email protected],545-861-4923,5108752255128478,180-65-6855,user_92359cs
Last modified 5mo ago