Search…
Data De-identification
In this deep dive, we will walk through some of the more advanced features to de-identify data with the Transform API, including bucketing, date shifts, masking, and entity replacements.
For this tutorial, we’ll use some sample customer-like data that contains a variety of interesting information that may need to be transformed depending on a downstream use case.
If you have not already done so, please follow our environment setup guide to ensure the Gretel CLI is installed and configured.

Creating a Model

Before creating our model, we need to create a configuration that specifies our Transform Policies and Rules. We’ve created the following configuration for this example data:
Transforms are highly declarative. Please take a look through our Transforms documentation to see all of the options for creating Policies and Rules.
1
# This example transform configuration supports the following dataset:
2
# https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv
3
4
schema_version: "1.0"
5
name: "example-transforms"
6
7
models:
8
- transforms:
9
data_source: "__tmp__"
10
policies:
11
- name: fake_identifiers
12
rules:
13
- name: fake_identifiers
14
conditions:
15
value_label:
16
- email_address
17
- phone_number
18
- ip_address
19
transforms:
20
- type: fake
21
- type: hash # if a fake cannot be created
22
- name: redact_names_locations
23
conditions:
24
field_label:
25
- person_name
26
- location
27
transforms:
28
- type: redact_with_char
29
- name: dateshifter
30
conditions:
31
field_label:
32
- date
33
- datetime
34
transforms:
35
- type: dateshift
36
attrs:
37
min: 20
38
max: 20
39
formats: "%Y-%m-%d"
40
- name: bucketize-income
41
conditions:
42
field_name:
43
- YearlyIncome
44
transforms:
45
- type: numberbucket
46
attrs:
47
min: 0
48
max: 1000000
49
nearest: 5000
Copied!
Save this to a file called transform-config.yml. Next, we will create our transform model with the Gretel CLI:
1
gretel models create --config transform-config.yml --output transform-model --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv --runner local
Copied!
Running this command will trigger the following actions automatically:
    The configuration will be sent to Gretel Cloud and a model creation job will be requested
    The CLI will start a local Gretel Worker that will will download the configuration from Gretel Cloud
    The Gretel Worker will create the model and generate model artifacts to the transform-modeldirectory.
When the model is created, you should see logging output that provides the Model ID. You will need this Model ID when serving models to transform records. Since you are running in your own environment you will also need the path to the model.tar.gz artifact that gets created in the output directory.
As part of creating a model, a data preview is created for a quick look at transformed records, for this example we can take a peak at our transformed records with:
1
gunzip -c transform-model/data_preview.gz | cat | head
Copied!
Compare the sample transformed data with the original data:
1
curl https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv | head
Copied!
Now that a Transform Model has been created, we will look at how to use this model to do full dataset transformations.

Transforming data at scale

Now that you have created a transform model. You can serve that model as many times as you like to transform records at scale. Next, we'll use the model we just created and transform all of the records from the same sample file.
You should have the Model ID and access to the model.tar.gz model archive from the previous model creation step.
To serve the model, we run the following command (replace the Model ID!):
1
gretel records transform --runner local --model-path transform-model/model.tar.gz --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv --output transformed-data --model-id 60ba8a401fae93eff9d35dc1
Copied!
You should see the worker start up, create a handler for serving the model and begin transforming the records. Once the job is complete, your transformed data should be sitting in the transformed-data (or whatever directory you specified).
Let's look at our fully transformed dataset:
1
gunzip -c transformed-data/data.gz | cat
Copied!
Last modified 10d ago