Redact PII

Use Gretel Transforms to remove sensitive personal identifiable information (PII).

The Gretel Transform model is used to redact personal identifiable information (PII) from tabular data. In this example, we will remove PII from a Sample Dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can redact PII using Transform via the Gretel Console, CLI, or Python SDK. If you'd like to follow along via video, see the Video Tutorial below.

Tutorial

Redact PII

In this notebook, we will create a transform policy to identify and redact or replace PII with fake values. We will then use the SDK to transform a dataset and examine the results.

To run this notebook, you will need an API key from the Gretel Console.

Getting started

%%capture

!pip install pyyaml Faker pandas
!pip install -U gretel-client

# Specify your Gretel API key

import pandas as pd
from gretel_client import configure_session

pd.set_option("max_colwidth", None)

configure_session(api_key="prompt", cache="yes", validate=True)

Create configuration with transform policy

# Create our configuration with our Transforms Policies and Rules.
config = """schema_version: "1.0"
name: "Redact PII"
models:
  - transforms:
      data_source: "_"
      policies:
        - name: remove_pii
          rules:
            - name: fake_or_redact_pii
              conditions:
                value_label:
                  - person_name
                  - credit_card_number
                  - phone_number
                  - us_social_security_number
                  - email_address
                  - custom/*
              transforms:
                - type: fake
                - type: redact_with_char
                  attrs:
                    char: X
label_predictors:
  namespace: custom
  regex:
    user_id:
      patterns:
        - score: high
          regex: 'user_[\d]{5}'
"""

Use Faker to make training and test datasets

from faker import Faker

# Use Faker to make training and test data.
def fake_pii_csv(filename, lines=100):
    fake = Faker()
    with open(filename, "w") as f:
        f.write("id,name,email,phone,visa,ssn,user_id\n")
        for i in range(lines):
            _name = fake.name()
            _email = fake.email()
            _phone = fake.phone_number()
            _cc = fake.credit_card_number()
            _ssn = fake.ssn()
            _id = f'user_{fake.numerify(text="#####")}'
            f.write(f"{i},{_name},{_email},{_phone},{_cc},{_ssn},{_id}\n")


fake_pii_csv("train.csv")
fake_pii_csv("test.csv")

Create model

import yaml

from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll

# Create a project and model configuration.
project = create_or_get_unique_project(name="redact-pii-transform")

model = project.create_model_obj(
    model_config=yaml.safe_load(config), data_source="train.csv"
)

# Upload the training data.  Train the model.
model.submit_cloud()

poll(model)

Generate redacted data and view results

# Use the model to generate synthetic data.
record_handler = model.create_record_handler_obj(data_source="test.csv")

record_handler.submit_cloud()

poll(record_handler)

# Compare results.  Here is our "before."
train_df = pd.read_csv("test.csv")
print("test.csv head, before redaction")
print(train_df.head())

# And here is our "after."
transformed = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
print("test.csv head, after redaction")
transformed.head()

Overview

This tutorial will walk through the process of redacting PII using the Gretel CLI.

Save sample dataset and configuration

Save your configuration to a local file named redact_pii.yaml. The policy below searches for sensitive PII values as defined by Experian (including a custom regex for user IDs), replacing them with fake values when possible, or redacting with a user-defined character.

schema_version: "1.0"
name: "Redact PII"
models:
  - transforms:
      data_source: "_"
      policies:
        - name: remove_pii
          rules:
            - name: fake_or_redact_pii
              conditions:
                value_label:
                  - person_name
                  - credit_card_number
                  - phone_number
                  - us_social_security_number
                  - email_address
                  - custom/*
              transforms:
                - type: fake
                - type: redact_with_char
                  attrs:
                    char: X
label_predictors:
  namespace: custom
  regex:
    user_id:
      patterns:
        - score: high
          regex: "user_[\\d]{5}"

Save the sample dataset below to pii.csv

id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,kgoodbanne0@house.gov,228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,ajackson@wired.com,611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,sbartkiewicz2@ycombinator.com,799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,mparnell3@vinaora.com,985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,mmyers4@naver.com,545-861-4923,5108752255128478,180-65-6855,user_92359

Create a transformation model

First, create a project to host your transformation models and artifacts.

gretel projects create --display-name redact-pii --set-default

Next, train your transformation model on your dataset or one with an identical schema.

Currently, only plain text and CSV formats are supported by the Transform API. JSON support is coming soon.

gretel models create --config redact_pii.yaml --in-data pii.csv --runner cloud > model-data.json

You will use redact_pii.yaml as your --config and pii.csv as --in-data.

Redact sensitive data

Your model can now be used to redact sensitive data from any dataset with a similar structure or schema.

gretel records transform --model-id model-data.json --in-data pii.csv --runner cloud --output .

Examine the results

Transform results are downloaded to the local directory in CSV format to the file data.gz. Our policy is set to replace names, addresses, and emails with fake entities, and to redact the user ID regular expression with a character replacement.

Let's examine the transformed results from the command line.

zcat data.gz | column -s, -t

id  name               email                            phone                   visa              ssn          user_id
1   Samantha Sandoval  projas@hotmail.com               986.089.1149            344661707423210   102-40-4854  XXXX_XXXXX
2   Shannon Holmes     robertprice@mckinney-thomas.com  (686)646-3171           3519277724227055  554-61-8106  XXXX_XXXXX
3   David Chapman      katherinegillespie@hensley.com   001-946-130-7514x76773  213182470523001   008-06-5773  XXXX_XXXXX
4   Crystal Russo      mfischer@yahoo.com               027-327-7306x07952      6011379376191328  628-27-4071  XXXX_XXXXX
5   John Allen         evanbrown@yahoo.com              (365)502-6954           4047982390743587  740-42-9239  XXXX_XXXXX

Next steps

For use cases such as training machine learning models on customer support logs, it is often desirable to replace PII with fake values to maintain semantics in the original data. However, this is not always desirable. Try updating the transformation policy to simply redact all sensitive values with an "*" character.

Sample Dataset

id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,kgoodbanne0@house.gov,228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,ajackson@wired.com,611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,sbartkiewicz2@ycombinator.com,799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,mparnell3@vinaora.com,985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,mmyers4@naver.com,545-861-4923,5108752255128478,180-65-6855,user_92359cs

Video Tutorial

PreviousSynthesize Tabular Data NextDiscover PII

Last updated 3 months ago