The Gretel Classify model is used to identify personal identifiable information (PII) in a tabular data. Classify can detect 40+ Supported Entities including names, addresses, credentials, and more. In this example, we will identify PII from a Sample Dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can discover PII in your data using Classify via the Gretel Console, CLI, or Python SDK. If you'd like to follow along via video, see the Video Tutorial below.
Tutorial
Discover PII
In this notebook, we will create a classification policy to identify PII as well as a custom regular expression. We will then use the SDK to classify data and examine the results.
To run this notebook, you will need an API key from the Gretel Console.
# Specify your Gretel API keyimport pandas as pdfrom gretel_client import configure_sessionpd.set_option("max_colwidth", None)configure_session(api_key="prompt", cache="yes", validate=True)
Create configuration with classify policy
# Create configuration with our Classify Policies and Rules.config ="""# Policy to search for "sensitive PII" as defined by# https://www.experian.com/blogs/ask-experian/what-is-personally-identifiable-information/schema_version: "1.0"name: "discover-pii-model"models: - classify: data_source: "_" labels: - person_name - credit_card_number - phone_number - us_social_security_number - email_address - custom/*label_predictors: namespace: custom regex: user_id: patterns: - score: high regex: 'user_[\d]{5}'"""
Use Faker to make training and test datasets
from faker import Faker# Use Faker to make training and test data.deffake_pii_csv(filename,lines=100): fake =Faker()withopen(filename, "w")as f: f.write("id,name,email,phone,visa,ssn,user_id\n")for i inrange(lines): _name = fake.name() _email = fake.email() _phone = fake.phone_number() _cc = fake.credit_card_number() _ssn = fake.ssn() _id =f'user_{fake.numerify(text="#####")}' f.write(f"{i},{_name},{_email},{_phone},{_cc},{_ssn},{_id}\n")fake_pii_csv("train.csv")fake_pii_csv("test.csv")
Create model
Now, we will train our Classify model on the training data.
import yamlfrom gretel_client.projects import create_or_get_unique_projectfrom gretel_client.helpers import poll# Create a project and model configuration.project =create_or_get_unique_project(name="label-pii-classify")model = project.create_model_obj( model_config=yaml.safe_load(config), data_source="train.csv")# Upload the training data. Train the model.model.submit_cloud()poll(model)
Classify test data using trained model
Finally, we will classify our test dataset using the model we trained on the training data.
# Now we can use our model to classify the test data.record_handler = model.create_record_handler_obj(data_source="test.csv")record_handler.submit_cloud()poll(record_handler)# Let's inspect the results.classified = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")classified.head()
Overview
This tutorial will walk through the process of identifying PII using the Gretel CLI.
Save sample dataset and configuration
To start, create and save your sample dataset and Classify configuration by copying the code below and saving it to local files.
Classify configuration
Save your configuration to a local file named classify_config.yaml. The policy below searches for sensitive PII values as defined by Experian (including a custom regex for user IDs), and identifies them using Supported Entities labels (person_name, credit_card_number, etc.).
# Policy to search for "sensitive PII" as defined by# https://www.experian.com/blogs/ask-experian/what-is-personally-identifiable-information/schema_version:"1.0"name:"my-awesome-model"models: - classify:data_source:"_"labels: - person_name - credit_card_number - phone_number - us_social_security_number - email_address - custom/*label_predictors:namespace:customregex:user_id:patterns: - score:highregex:"user_[\\d]{5}"
Classify results are downloaded to the local directory in line-delimited JSONL to the file data.gz. Discovered entities are labeled from their offset within each field in the input CSV.
field denotes the name of the column from the input data. label refers to the classification entity it has been identified as.