Search…
Discover PII
In this tutorial, we will create a classification policy to identify PII as well as a custom regular expression. We will then use the CLI to classify data and examine the results.

Sample configuration

Save your configuration to a local file named classify_config.yaml. Click the link to see all supported info types.
1
# Policy to search for "sensitive PII" as defined by
2
# https://www.experian.com/blogs/ask-experian/what-is-personally-identifiable-information/
3
4
schema_version: "1.0"
5
name: "my-awesome-model"
6
models:
7
- classify:
8
data_source: "_"
9
labels:
10
- person_name
11
- credit_card_number
12
- phone_number
13
- us_social_security_number
14
- email_address
15
- acme/*
16
17
label_predictors:
18
namespace: acme
19
regex:
20
user_id:
21
patterns:
22
- score: high
23
regex: "user_[\\d]{5}"
Copied!
Save the sample dataset below to pii.csv
1
id,name,email,phone,visa,ssn,user_id
2
1,Kimberli Goodman,[email protected],228-229-2479,5108758325678962,108-08-9132,user_93952
3
2,Anna Jackson,[email protected],611-570-4635,5048377302905174,256-28-0041,user_23539
4
3,Sammy Bartkiewicz,[email protected],799-160-2165,5108758273775281,849-46-5175,user_35232
5
4,Matt Parnell,[email protected],985-733-6433,5048376551569087,774-83-5725,user_23529
6
5,Meredith Myers,[email protected],545-861-4923,5108752255128478,180-65-6855,user_92359
Copied!

Create a classification model

First, create a project to host your classification models and artifacts.
1
gretel projects create --display-name classify-example --set-default
Copied!
Create the model, providing sample data to help the model learn to exploit header structures.
Currently, only plain text and CSV formats are supported by the Classify API. JSON support is coming soon.
1
gretel models create --config classify_config.yaml --in-data pii.csv --runner cloud > model-data.json
Copied!

Classify your data

Your model can now be used to classify any datasets with similar structure or schema.
1
gretel records classify --model-id model-data.json --in-data pii.csv --runner cloud --output .
Copied!

Examine the results

Classify results are downloaded to the local directory in line-delimited JSONL to the file data.gz. Discovered entities are labeled from their offset within each field in the input CSV.
1
zcat data.gz | head -n 1 | python -m json.tool
2
{
3
"index": 0,
4
"entities": [
5
{
6
"start": 0,
7
"end": 21,
8
"label": "email_address",
9
"score": 0.8,
10
"field": "email"
11
},
12
{
13
"start": 0,
14
"end": 12,
15
"label": "phone_number",
16
"score": 0.8,
17
"field": "phone"
18
},
19
{
20
"start": 0,
21
"end": 11,
22
"label": "us_social_security_number",
23
"score": 0.8,
24
"field": "ssn"
25
},
26
{
27
"start": 0,
28
"end": 8,
29
"label": "person_name",
30
"score": 0.8,
31
"field": "first_name"
32
},
33
{
34
"start": 0,
35
"end": 7,
36
"label": "person_name",
37
"score": 0.8,
38
"field": "last_name"
39
},
40
{
41
"start": 0,
42
"end": 10,
43
"label": "acme/user_id",
44
"score": 0.8,
45
"field": "user_id"
46
}
47
]
48
}
Copied!

Video walkthrough

Last modified 10d ago