Search…
⌃K

Discover PII

Use Gretel Classify to identify personal identifiable information (PII) in a dataset.
The Gretel Classify model is used to identify personal identifiable information (PII) in a tabular data. Classify can detect 40+ Supported Entities including names, addresses, credentials, and more. In this example, we will identify PII from a Sample dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can discover PII in your data using Classify via the Gretel Console, CLI, or Python SDK.

Tutorial

SDK
CLI
Open In Colab

Discover PII

In this notebook, we will create a classification policy to identify PII as well as a custom regular expression. We will then use the SDK to classify data and examine the results.
To run this notebook, you will need an API key from the Gretel Console.

Getting Started

%%capture
!pip install pyyaml Faker pandas
!pip install -U gretel-client
# Specify your Gretel API key
import pandas as pd
from gretel_client import configure_session
pd.set_option("max_colwidth", None)
configure_session(api_key="prompt", cache="yes", validate=True)

Create configuration with classify policy

# Create configuration with our Classify Policies and Rules.
config = """# Policy to search for "sensitive PII" as defined by
# https://www.experian.com/blogs/ask-experian/what-is-personally-identifiable-information/
schema_version: "1.0"
name: "discover-pii-model"
models:
- classify:
data_source: "_"
labels:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- custom/*
label_predictors:
namespace: custom
regex:
user_id:
patterns:
- score: high
regex: 'user_[\d]{5}'
"""

Use Faker to make training and test datasets

from faker import Faker
# Use Faker to make training and test data.
def fake_pii_csv(filename, lines=100):
fake = Faker()
with open(filename, "w") as f:
f.write("id,name,email,phone,visa,ssn,user_id\n")
for i in range(lines):
_name = fake.name()
_email = fake.email()
_phone = fake.phone_number()
_cc = fake.credit_card_number()
_ssn = fake.ssn()
_id = f'user_{fake.numerify(text="#####")}'
f.write(f"{i},{_name},{_email},{_phone},{_cc},{_ssn},{_id}\n")
fake_pii_csv("train.csv")
fake_pii_csv("test.csv")

Create model

Now, we will train our Classify model on the training data.
import yaml
from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll
# Create a project and model configuration.
project = create_or_get_unique_project(name="label-pii-classify")
model = project.create_model_obj(
model_config=yaml.safe_load(config), data_source="train.csv"
)
# Upload the training data. Train the model.
model.submit_cloud()
poll(model)

Classify test data using trained model

Finally, we will classify our test dataset using the model we trained on the training data.
# Now we can use our model to classify the test data.
record_handler = model.create_record_handler_obj(data_source="test.csv")
record_handler.submit_cloud()
poll(record_handler)
# Let's inspect the results.
classified = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
classified.head()

Overview

This tutorial will walk through the process of identifying PII using the Gretel CLI. If you'd like to follow along via video, see the Video Tutorial below.

Save sample dataset and configuration

To start, create and save your sample dataset and Classify configuration by copying the code below and saving it to local files.

Classify configuration

Save your configuration to a local file named classify_config.yaml. The policy below searches for sensitive PII values as defined by Experian (including a custom regex for user IDs), and identifies them using Supported Entities labels (person_name, credit_card_number, etc.).
# Policy to search for "sensitive PII" as defined by
# https://www.experian.com/blogs/ask-experian/what-is-personally-identifiable-information/
schema_version: "1.0"
name: "my-awesome-model"
models:
- classify:
data_source: "_"
labels:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- custom/*
label_predictors:
namespace: custom
regex:
user_id:
patterns:
- score: high
regex: "user_[\\d]{5}"

Sample dataset

Save the sample dataset below to pii.csv
id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,[email protected],228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,[email protected],611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,[email protected],799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,[email protected],985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,[email protected],545-861-4923,5108752255128478,180-65-6855,user_92359

Create a classification model

First, create a project to host your classification models and artifacts.
gretel projects create --display-name classify-example --set-default
Create the model, providing sample data to help the model learn to exploit header structures.
Currently, only plain text and CSV formats are supported by the Classify API. JSON support is coming soon.
gretel models create --config classify_config.yaml --in-data pii.csv --runner cloud > model-data.json
You will use the classify_config.yaml you created as your --config input and pii.csv as --in-data.

Classify your data

Your model can now be used to classify any datasets with similar structure or schema.
gretel records classify --model-id model-data.json --in-data pii.csv --runner cloud --output .

Examine the results

Classify results are downloaded to the local directory in line-delimited JSONL to the file data.gz. Discovered entities are labeled from their offset within each field in the input CSV.
field denotes the name of the column from the input data. label refers to the classification entity it has been identified as.
zcat data.gz | head -n 1 | python -m json.tool
{
"index": 0,
"entities": [
{
"start": 0,
"end": 21,
"label": "email_address",
"score": 0.8,
"field": "email"
},
{
"start": 0,
"end": 12,
"label": "phone_number",
"score": 0.8,
"field": "phone"
},
{
"start": 0,
"end": 11,
"label": "us_social_security_number",
"score": 0.8,
"field": "ssn"
},
{
"start": 0,
"end": 8,
"label": "person_name",
"score": 0.8,
"field": "first_name"
},
{
"start": 0,
"end": 7,
"label": "person_name",
"score": 0.8,
"field": "last_name"
},
{
"start": 0,
"end": 10,
"label": "custom/user_id",
"score": 0.8,
"field": "user_id"
}
]
}

Sample Dataset

id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,[email protected],228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,[email protected],611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,[email protected],799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,[email protected],985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,[email protected],545-861-4923,5108752255128478,180-65-6855,user_92359cs

Video Tutorial

Last modified 1mo ago