Search…
⌃K

Discover PII

In this tutorial, we will create a classification policy to identify PII as well as a custom regular expression. We will then use the CLI to classify data and examine the results.

Sample configuration

Save your configuration to a local file named classify_config.yaml. Click the link to see all supported info types.
# Policy to search for "sensitive PII" as defined by
# https://www.experian.com/blogs/ask-experian/what-is-personally-identifiable-information/
schema_version: "1.0"
name: "my-awesome-model"
models:
- classify:
data_source: "_"
labels:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
- acme/*
label_predictors:
namespace: acme
regex:
user_id:
patterns:
- score: high
regex: "user_[\\d]{5}"
Save the sample dataset below to pii.csv
id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,[email protected],228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,[email protected],611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,[email protected],799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,[email protected],985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,[email protected],545-861-4923,5108752255128478,180-65-6855,user_92359

Create a classification model

First, create a project to host your classification models and artifacts.
gretel projects create --display-name classify-example --set-default
Create the model, providing sample data to help the model learn to exploit header structures.
Currently, only plain text and CSV formats are supported by the Classify API. JSON support is coming soon.
gretel models create --config classify_config.yaml --in-data pii.csv --runner cloud > model-data.json

Classify your data

Your model can now be used to classify any datasets with similar structure or schema.
gretel records classify --model-id model-data.json --in-data pii.csv --runner cloud --output .

Examine the results

Classify results are downloaded to the local directory in line-delimited JSONL to the file data.gz. Discovered entities are labeled from their offset within each field in the input CSV.
zcat data.gz | head -n 1 | python -m json.tool
{
"index": 0,
"entities": [
{
"start": 0,
"end": 21,
"label": "email_address",
"score": 0.8,
"field": "email"
},
{
"start": 0,
"end": 12,
"label": "phone_number",
"score": 0.8,
"field": "phone"
},
{
"start": 0,
"end": 11,
"label": "us_social_security_number",
"score": 0.8,
"field": "ssn"
},
{
"start": 0,
"end": 8,
"label": "person_name",
"score": 0.8,
"field": "first_name"
},
{
"start": 0,
"end": 7,
"label": "person_name",
"score": 0.8,
"field": "last_name"
},
{
"start": 0,
"end": 10,
"label": "acme/user_id",
"score": 0.8,
"field": "user_id"
}
]
}

Video walkthrough