Redact PII
Use Gretel Transforms to remove sensitive personal identifiable information (PII).
The Gretel Transform model is used to redact personal identifiable information (PII) from tabular data. In this example, we will remove PII from a Sample Dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can redact PII using Transform via the Gretel Console, CLI, or Python SDK. If you'd like to follow along via video, see the Video Tutorial below.
Tutorial
Redact PII
In this notebook, we will create a transform policy to identify and redact or replace PII with fake values. We will then use the SDK to transform a dataset and examine the results.
To run this notebook, you will need an API key from the Gretel Console.
Getting started
%%capture
!pip install pyyaml Faker pandas
!pip install -U gretel-client
# Specify your Gretel API key
import pandas as pd
from gretel_client import configure_session
pd.set_option("max_colwidth", None)
configure_session(api_key="prompt", cache="yes", validate=True)
Create configuration with transform policy
# Create our configuration with our Transforms Policies and Rules.
config = """schema_version: "1.0"
name: "Redact PII"
models:
- transforms:
data_source: "_"
policies:
- name: remove_pii
rules:
- name: fake_or_redact_pii
conditions:
value_label:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- custom/*
transforms:
- type: fake
- type: redact_with_char
attrs:
char: X
label_predictors:
namespace: custom
regex:
user_id:
patterns:
- score: high
regex: 'user_[\d]{5}'
"""
Use Faker to make training and test datasets
from faker import Faker
# Use Faker to make training and test data.
def fake_pii_csv(filename, lines=100):
fake = Faker()
with open(filename, "w") as f:
f.write("id,name,email,phone,visa,ssn,user_id\n")
for i in range(lines):
_name = fake.name()
_email = fake.email()
_phone = fake.phone_number()
_cc = fake.credit_card_number()
_ssn = fake.ssn()
_id = f'user_{fake.numerify(text="#####")}'
f.write(f"{i},{_name},{_email},{_phone},{_cc},{_ssn},{_id}\n")
fake_pii_csv("train.csv")
fake_pii_csv("test.csv")
Create model
import yaml
from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll
# Create a project and model configuration.
project = create_or_get_unique_project(name="redact-pii-transform")
model = project.create_model_obj(
model_config=yaml.safe_load(config), data_source="train.csv"
)
# Upload the training data. Train the model.
model.submit_cloud()
poll(model)
Generate redacted data and view results
# Use the model to generate synthetic data.
record_handler = model.create_record_handler_obj(data_source="test.csv")
record_handler.submit_cloud()
poll(record_handler)
# Compare results. Here is our "before."
train_df = pd.read_csv("test.csv")
print("test.csv head, before redaction")
print(train_df.head())
# And here is our "after."
transformed = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
print("test.csv head, after redaction")
transformed.head()
Overview
This tutorial will walk through the process of redacting PII using the Gretel CLI.
Save sample dataset and configuration
Save your configuration to a local file named redact_pii.yaml
. The policy below searches for sensitive PII values as defined by Experian (including a custom regex for user IDs), replacing them with fake values when possible, or redacting with a user-defined character.
schema_version: "1.0"
name: "Redact PII"
models:
- transforms:
data_source: "_"
policies:
- name: remove_pii
rules:
- name: fake_or_redact_pii
conditions:
value_label:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- custom/*
transforms:
- type: fake
- type: redact_with_char
attrs:
char: X
label_predictors:
namespace: custom
regex:
user_id:
patterns:
- score: high
regex: "user_[\\d]{5}"
Save the sample dataset below to pii.csv
id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,kgoodbanne0@house.gov,228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,ajackson@wired.com,611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,sbartkiewicz2@ycombinator.com,799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,mparnell3@vinaora.com,985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,mmyers4@naver.com,545-861-4923,5108752255128478,180-65-6855,user_92359
Create a transformation model
First, create a project to host your transformation models and artifacts.
gretel projects create --display-name redact-pii --set-default
Next, train your transformation model on your dataset or one with an identical schema.
Currently, only plain text and CSV formats are supported by the Transform API. JSON support is coming soon.
gretel models create --config redact_pii.yaml --in-data pii.csv --runner cloud > model-data.json
You will use redact_pii.yaml
as your --config
and pii.csv
as --in-data
.
Redact sensitive data
Your model can now be used to redact sensitive data from any dataset with a similar structure or schema.
gretel records transform --model-id model-data.json --in-data pii.csv --runner cloud --output .
Examine the results
Transform results are downloaded to the local directory in CSV format to the file data.gz
. Our policy is set to replace names, addresses, and emails with fake entities, and to redact the user ID regular expression with a character replacement.
Let's examine the transformed results from the command line.
zcat data.gz | column -s, -t
id name email phone visa ssn user_id
1 Samantha Sandoval projas@hotmail.com 986.089.1149 344661707423210 102-40-4854 XXXX_XXXXX
2 Shannon Holmes robertprice@mckinney-thomas.com (686)646-3171 3519277724227055 554-61-8106 XXXX_XXXXX
3 David Chapman katherinegillespie@hensley.com 001-946-130-7514x76773 213182470523001 008-06-5773 XXXX_XXXXX
4 Crystal Russo mfischer@yahoo.com 027-327-7306x07952 6011379376191328 628-27-4071 XXXX_XXXXX
5 John Allen evanbrown@yahoo.com (365)502-6954 4047982390743587 740-42-9239 XXXX_XXXXX
Next steps
For use cases such as training machine learning models on customer support logs, it is often desirable to replace PII with fake values to maintain semantics in the original data. However, this is not always desirable. Try updating the transformation policy to simply redact all sensitive values with an "*" character.
Sample Dataset
id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,kgoodbanne0@house.gov,228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,ajackson@wired.com,611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,sbartkiewicz2@ycombinator.com,799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,mparnell3@vinaora.com,985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,mmyers4@naver.com,545-861-4923,5108752255128478,180-65-6855,user_92359cs
Video Tutorial
Last updated