Anonymize PII

In this deep dive, we will walk through anonymizing data with a Gretel Transform model and Gretel Workers that run in our own environment. If you have not already done so, please follow our environment setup guide to ensure the Gretel CLI is installed and configured.

For this tutorial, we’ll use some sample customer-like data that contains a variety of interesting information that may need to be transformed depending on a downstream use case.

Creating a Model

Before creating our model, we need to create a configuration that specifies our Transform Policies and Rules. We’ve created the following configuration for this example data:

Transforms are highly declarative. Please take a look through our Model Configuration documentation to see all of the options for creating Policies and Rules.

# This example transform configuration supports the following dataset:
# https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv
schema_version: 1.0
name: "example-transforms"
models:
- transforms:
data_source: "__tmp__"
policies:
- name: fake_identifiers
rules:
- name: fake_identifiers
conditions:
value_label:
- email_address
- phone_number
- ip_address
transforms:
- type: fake
- type: hash # if a fake cannot be created
- name: redact_names_locations
conditions:
field_label:
- person_name
- location
transforms:
- type: redact_with_char
- name: dateshifter
conditions:
field_label:
- date
- datetime
transforms:
- type: dateshift
attrs:
min: 20
max: 20
formats: "%Y-%m-%d"
- name: bucketize-income
conditions:
field_name:
- YearlyIncome
transforms:
- type: numberbucket
attrs:
min: 0
max: 1000000
nearest: 5000

Save this to a file called transform-config.yml. Next, we will create our transform model with the Gretel CLI:

gretel models create --config transform-config.yml --output transform-model --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv --runner local

Running this command will trigger the following actions automatically:

  • The configuration will be sent to Gretel Cloud and a model creation job will be requested

  • The CLI will start a local Gretel Worker that will will download the configuration from Gretel Cloud

  • The Gretel Worker will create the model and generate model artifacts to the transform-modeldirectory.

When the model is created, you should see logging output that provides the Model ID. You will need this Model ID when serving models to transform records. Since you are running in your own environment you will also need the path to the model.tar.gz artifact that gets created in the output directory.

As part of creating a model, a data preview is created for a quick look at transformed records, for this example we can take a peak at our transformed records with:

gunzip -c transform-model/data_preview.gz | cat | head

Compare the sample transformed data with the original data:

curl https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv | head

Now that a Transform Model has been created, we will look at how to use this model to do full dataset transformations.

Transforming data at scale

Now that you have created a transform model. You can serve that model as many times as you like to transform records at scale. Next, we'll use the model we just created and transform all of the records from the same sample file.

You should have the Model ID and access to the model.tar.gz model archive from the previous model creation step.

To serve the model, we run the following command (replace the Model ID!):

gretel records transform --runner local --model-path transform-model/model.tar.gz --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv --output transformed-data --model-id 60ba8a401fae93eff9d35dc1

You should see the worker start up, create a handler for serving the model and begin transforming the records. Once the job is complete, your transformed data should be sitting in the transformed-data (or whatever directory you specified).

Let's look at our fully transformed dataset:

gunzip -c transformed-data/data.gz | cat