Redact PII

Use Gretel Transforms to remove sensitive personal identifiable information (PII).

The Gretel Transform model is used to redact personal identifiable information (PII) from tabular data. In this example, we will remove PII from a Sample Dataset containing names, email addresses, phone numbers, credit card numbers, and SSNs. You can redact PII using Transform via the Gretel Console, CLI, or Python SDK. If you'd like to follow along via video, see the #video-tutorial below.

Tutorial

https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/transform/redact_pii.ipynb

Redact PII Notebook

In this notebook, we will create a transform policy to identify and redact or replace PII with fake values. We will then use the SDK to transform a dataset and examine the results.

To run this notebook, you will need an API key from the Gretel Console.

Getting started

!pip install -Uqq gretel_client

Login to Gretel and create or load a project. Get a free API key at https://console.gretel.ai/users/me/key

from gretel_client import Gretel

gretel = Gretel(
    project_name="redact-pii",
    api_key="prompt",
    validate=True,
)

Load the Dataset

import pandas as pd

df = pd.read_csv('https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/gretel_generated_table_simpsons_pii.csv')
df.head(5)

Redact PII via transform model


# De-identification configuration
config = """
schema_version: "1.0"
name: "Replace PII"
models:
  - transform_v2:
      globals:
        classify:
          enable: true
          entities:
            - first_name
            - last_name
            - email
            - phone_number
            - street_address
          num_samples: 100
      steps:
        - rows:
            update:
              # Detect and replace values in PII columns, hash if no Faker available
              - condition: column.entity is in globals.classify.entities
                value: column.entity | fake
                fallback_value: this | hash | truncate(9,true,"")

              # Detect and replace entities within free text columns
              - type: text
                value: this | fake_entities(on_error="hash")

              # Replace email addresses with first + last name to retain correlations
              - name: email_address
                value: 'row.first_name + "." + row.last_name + "@" + fake.free_email_domain()'
"""

transform_result = gretel.submit_transform(
    config=config,
    data_source=df,
    job_label="Transform PII data"
)

transformed_df = transform_result.transformed_df
transformed_df.head()

View results of redacting data

import pandas as pd

def highlight_detected_entities(report_dict):
    """
    Process the report dictionary, extract columns with detected entities,
    and highlight cells with non-empty entity labels.

    Args:
        report_dict (dict): The report dictionary from transform_result.report.as_dict.

    Returns:
        pd.io.formats.style.Styler: Highlighted DataFrame.
    """
    # Parse the columns and extract 'Detected Entities'
    columns_data = report_dict['columns']
    df = pd.DataFrame([
        {
            'Column Name': col['name'],
            'Detected Entities': ', '.join(
                entity['label'] for entity in col['entities'] if entity['label']
            )
        }
        for col in columns_data
    ])

    # Highlighting logic
    def highlight_entities(s):
        return ['background-color: lightgreen' if len(val) > 0 else '' for val in s]

    # Apply highlighting
    return df.style.apply(highlight_entities, subset=['Detected Entities'], axis=1)


highlight_detected_entities(pd.DataFrame(transform_result.report.as_dict))
pd.set_option('display.max_colwidth', None)

first_row_df1 = df.iloc[0].to_frame('Original')
first_row_df2 = transformed_df.iloc[0].to_frame('Transformed')

# Join the transposed rows
comparison_df = first_row_df1.join(first_row_df2)

def highlight_differences(row):
    is_different = row['Original'] != row['Transformed']
    color = 'background-color: lightgreen' if is_different else ''
    return ['', f'{color}; min-width: 500px']

styled_df = comparison_df.style.apply(highlight_differences, axis=1).format(escape="html")
styled_df

Sample Dataset

id,name,email,phone,visa,ssn,user_id
1,Kimberli Goodman,kgoodbanne0@house.gov,228-229-2479,5108758325678962,108-08-9132,user_93952
2,Anna Jackson,ajackson@wired.com,611-570-4635,5048377302905174,256-28-0041,user_23539
3,Sammy Bartkiewicz,sbartkiewicz2@ycombinator.com,799-160-2165,5108758273775281,849-46-5175,user_35232
4,Matt Parnell,mparnell3@vinaora.com,985-733-6433,5048376551569087,774-83-5725,user_23529
5,Meredith Myers,mmyers4@naver.com,545-861-4923,5108752255128478,180-65-6855,user_92359cs

Last updated

Was this helpful?