Redact PII

Use Gretel Transforms to remove sensitive personal identifiable information (PII).

The Gretel Transform model is used to redact personal identifiable information (PII) from tabular data. In this example, we will remove PII from a Sample Dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can redact PII using Transform via the Gretel Console, CLI, or Python SDK. If you'd like to follow along via video, see the Video Tutorial below.

Tutorial

Redact PII

In this notebook, we will create a transform policy to identify and redact or replace PII with fake values. We will then use the SDK to transform a dataset and examine the results.

To run this notebook, you will need an API key from the Gretel Console.

Getting started

%%capture

!pip install pyyaml Faker pandas
!pip install -U gretel-client
# Specify your Gretel API key

import pandas as pd
from gretel_client import configure_session

pd.set_option("max_colwidth", None)

configure_session(api_key="prompt", cache="yes", validate=True)

Create configuration with transform policy

# Create our configuration with our Transforms Policies and Rules.
config = """schema_version: "1.0"
name: "Redact PII"
models:
  - transforms:
      data_source: "_"
      policies:
        - name: remove_pii
          rules:
            - name: fake_or_redact_pii
              conditions:
                value_label:
                  - person_name
                  - credit_card_number
                  - phone_number
                  - us_social_security_number
                  - email_address
                  - custom/*
              transforms:
                - type: fake
                - type: redact_with_char
                  attrs:
                    char: X
label_predictors:
  namespace: custom
  regex:
    user_id:
      patterns:
        - score: high
          regex: 'user_[\d]{5}'
"""

Use Faker to make training and test datasets

from faker import Faker

# Use Faker to make training and test data.
def fake_pii_csv(filename, lines=100):
    fake = Faker()
    with open(filename, "w") as f:
        f.write("id,name,email,phone,visa,ssn,user_id\n")
        for i in range(lines):
            _name = fake.name()
            _email = fake.email()
            _phone = fake.phone_number()
            _cc = fake.credit_card_number()
            _ssn = fake.ssn()
            _id = f'user_{fake.numerify(text="#####")}'
            f.write(f"{i},{_name},{_email},{_phone},{_cc},{_ssn},{_id}\n")


fake_pii_csv("train.csv")
fake_pii_csv("test.csv")

Create model

import yaml

from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll

# Create a project and model configuration.
project = create_or_get_unique_project(name="redact-pii-transform")

model = project.create_model_obj(
    model_config=yaml.safe_load(config), data_source="train.csv"
)

# Upload the training data.  Train the model.
model.submit_cloud()

poll(model)

Generate redacted data and view results

# Use the model to generate synthetic data.
record_handler = model.create_record_handler_obj(data_source="test.csv")

record_handler.submit_cloud()

poll(record_handler)

# Compare results.  Here is our "before."
train_df = pd.read_csv("test.csv")
print("test.csv head, before redaction")
print(train_df.head())

# And here is our "after."
transformed = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
print("test.csv head, after redaction")
transformed.head()

Sample Dataset

id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,kgoodbanne0@house.gov,228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,ajackson@wired.com,611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,sbartkiewicz2@ycombinator.com,799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,mparnell3@vinaora.com,985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,mmyers4@naver.com,545-861-4923,5108752255128478,180-65-6855,user_92359cs

Video Tutorial

Last updated