In this deep dive, we will walk through anonymizing data with a Gretel Transform model and Gretel Workers that run in our own environment. If you have not already done so, please follow our environment setup guide to ensure the Gretel CLI is installed and configured.
For this tutorial, we’ll use some sample customer-like data that contains a variety of interesting information that may need to be transformed depending on a downstream use case.
Before creating our model, we need to create a configuration that specifies our Transform Policies and Rules. We’ve created the following configuration for this example data:
# This example transform configuration supports the following dataset:# https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csvschema_version: 1.0name: "example-transforms"models:- transforms:data_source: "__tmp__"policies:- name: fake_identifiersrules:- name: fake_identifiersconditions:value_label:- email_address- phone_number- ip_addresstransforms:- type: fake- type: hash # if a fake cannot be created- name: redact_names_locationsconditions:field_label:- person_name- locationtransforms:- type: redact_with_char- name: dateshifterconditions:field_label:- date- datetimetransforms:- type: dateshiftattrs:min: 20max: 20formats: "%Y-%m-%d"- name: bucketize-incomeconditions:field_name:- YearlyIncometransforms:- type: numberbucketattrs:min: 0max: 1000000nearest: 5000
Save this to a file called
transform-config.yml. Next, we will create our transform model with the Gretel CLI:
gretel models create --config transform-config.yml --output transform-model --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv --runner local
Running this command will trigger the following actions automatically:
The configuration will be sent to Gretel Cloud and a model creation job will be requested
The CLI will start a local Gretel Worker that will will download the configuration from Gretel Cloud
The Gretel Worker will create the model and generate model artifacts to the
As part of creating a model, a data preview is created for a quick look at transformed records, for this example we can take a peak at our transformed records with:
gunzip -c transform-model/data_preview.gz | cat | head
Compare the sample transformed data with the original data:
curl https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv | head
Now that a Transform Model has been created, we will look at how to use this model to do full dataset transformations.
Now that you have created a transform model. You can serve that model as many times as you like to transform records at scale. Next, we'll use the model we just created and transform all of the records from the same sample file.
To serve the model, we run the following command (replace the Model ID!):
gretel records transform --runner local --model-path transform-model/model.tar.gz --in-data https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer-orders.csv --output transformed-data --model-id 60ba8a401fae93eff9d35dc1
You should see the worker start up, create a handler for serving the model and begin transforming the records. Once the job is complete, your transformed data should be sitting in the
transformed-data (or whatever directory you specified).
Let's look at our fully transformed dataset:
gunzip -c transformed-data/data.gz | cat