Discover PII

Use Gretel Classify to identify personal identifiable information (PII) in a dataset.

The Gretel Classify model is used to identify personal identifiable information (PII) in a tabular data. Classify can detect 40+ Supported Entities including names, addresses, credentials, and more. In this example, we will identify PII from a Sample Dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can discover PII in your data using Classify via the Gretel Console, CLI, or Python SDK. If you'd like to follow along via video, see the Video Tutorial below.

Tutorial

Discover PII

In this notebook, we will create a classification policy to identify PII as well as a custom regular expression. We will then use the SDK to classify data and examine the results.

To run this notebook, you will need an API key from the Gretel Console.

Getting Started

%%capture

!pip install pyyaml Faker pandas
!pip install -U gretel-client
# Specify your Gretel API key

import pandas as pd
from gretel_client import configure_session

pd.set_option("max_colwidth", None)

configure_session(api_key="prompt", cache="yes", validate=True)

Create configuration with classify policy

# Create configuration with our Classify Policies and Rules.
config = """# Policy to search for "sensitive PII" as defined by
# https://www.experian.com/blogs/ask-experian/what-is-personally-identifiable-information/

schema_version: "1.0"
name: "discover-pii-model"
models:
  - classify:
      data_source: "_"
      labels:
        - person_name
        - credit_card_number
        - phone_number
        - us_social_security_number
        - email_address
        - custom/*

label_predictors:
  namespace: custom
  regex:
    user_id:
      patterns:
        - score: high
          regex: 'user_[\d]{5}'
"""

Use Faker to make training and test datasets

from faker import Faker

# Use Faker to make training and test data.
def fake_pii_csv(filename, lines=100):
    fake = Faker()
    with open(filename, "w") as f:
        f.write("id,name,email,phone,visa,ssn,user_id\n")
        for i in range(lines):
            _name = fake.name()
            _email = fake.email()
            _phone = fake.phone_number()
            _cc = fake.credit_card_number()
            _ssn = fake.ssn()
            _id = f'user_{fake.numerify(text="#####")}'
            f.write(f"{i},{_name},{_email},{_phone},{_cc},{_ssn},{_id}\n")


fake_pii_csv("train.csv")
fake_pii_csv("test.csv")

Create model

Now, we will train our Classify model on the training data.

import yaml

from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll

# Create a project and model configuration.
project = create_or_get_unique_project(name="label-pii-classify")

model = project.create_model_obj(
    model_config=yaml.safe_load(config), data_source="train.csv"
)

# Upload the training data.  Train the model.
model.submit_cloud()

poll(model)

Classify test data using trained model

Finally, we will classify our test dataset using the model we trained on the training data.

# Now we can use our model to classify the test data.
record_handler = model.create_record_handler_obj(data_source="test.csv")

record_handler.submit_cloud()

poll(record_handler)

# Let's inspect the results.
classified = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
classified.head()

Sample Dataset

id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,kgoodbanne0@house.gov,228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,ajackson@wired.com,611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,sbartkiewicz2@ycombinator.com,799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,mparnell3@vinaora.com,985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,mmyers4@naver.com,545-861-4923,5108752255128478,180-65-6855,user_92359cs

Video Tutorial

Last updated