Data De-identification

In this deep dive, we will walk through some of the more advanced features to de-identify data with the Transform API, including bucketing, date shifts, masking, and entity replacements.
For this tutorial, we’ll use some sample customer-like data that contains a variety of interesting information that may need to be transformed depending on a downstream use case.
If you have not already done so, please follow our environment setup guide to ensure the Gretel CLI is installed and configured.

Creating a Model

Before creating our model, we need to create a configuration that specifies our Transform Policies and Rules. We’ve created the following configuration for this example data:
Transforms are highly declarative. Please take a look through our Transforms documentation to see all of the options for creating Policies and Rules.
# This example transform configuration supports the following dataset:
schema_version: "1.0"
name: "example-transforms"
- transforms:
data_source: "__tmp__"
- name: fake_identifiers
- name: fake_identifiers
- email_address
- phone_number
- ip_address
- type: fake
- type: hash # if a fake cannot be created
- name: redact_names_locations
- person_name
- location
- type: redact_with_char
- name: dateshifter
- date
- datetime
- type: dateshift
min: 20
max: 20
formats: "%Y-%m-%d"
- name: bucketize-income
- YearlyIncome
- type: numberbucket
min: 0
max: 1000000
nearest: 5000
Save this to a file called transform-config.yml. Next, we will create our transform model with the Gretel CLI:
gretel models create --config transform-config.yml --output transform-model --in-data --runner local
Running this command will trigger the following actions automatically:
  • The configuration will be sent to Gretel Cloud and a model creation job will be requested
  • The CLI will start a local Gretel Worker that will will download the configuration from Gretel Cloud
  • The Gretel Worker will create the model and generate model artifacts to the transform-modeldirectory.
When the model is created, you should see logging output that provides the Model ID. You will need this Model ID when serving models to transform records. Since you are running in your own environment you will also need the path to the model.tar.gz artifact that gets created in the output directory.
As part of creating a model, a data preview is created for a quick look at transformed records, for this example we can take a peak at our transformed records with:
gunzip -c transform-model/data_preview.gz | cat | head
Compare the sample transformed data with the original data:
curl | head
Now that a Transform Model has been created, we will look at how to use this model to do full dataset transformations.

Transforming data at scale

Now that you have created a transform model. You can serve that model as many times as you like to transform records at scale. Next, we'll use the model we just created and transform all of the records from the same sample file.
You should have the Model ID and access to the model.tar.gz model archive from the previous model creation step.
To serve the model, we run the following command (replace the Model ID!):
gretel records transform --runner local --model-path transform-model/model.tar.gz --in-data --output transformed-data --model-id 60ba8a401fae93eff9d35dc1
You should see the worker start up, create a handler for serving the model and begin transforming the records. Once the job is complete, your transformed data should be sitting in the transformed-data (or whatever directory you specified).
Let's look at our fully transformed dataset:
gunzip -c transformed-data/data.gz | cat