Ask or search…
K
Comment on page

Relational Transform

Transform multi-table databases to redact PII while maintaining referential integrity.

Introduction

Gretel Relational Transforms leverages our Transform capabilities to detect and transform sensitive entities throughout your database. You can effortlessly extract sensitive columns and apply a range of transformations at scale, such as masking, hashing, tokenization, or even replacement. By transforming key values, Gretel Relational Transforms goes the extra mile to ensure your database is private and secure, while maintaining referential integrity and statistical accuracy.
In addition to transforming your database, Gretel Relational also makes it easy to transform then synthesize a database for maximum privacy assurances (think GDPR compliance). We'll discuss how to Transform and Synthesize a Database below.

Recap: Getting Started

In the Relational page, we covered the process for installing Gretel Relational, defining your source database, and creating a Relational model. A brief recap of the code can be found below, again using our telecommunications Demo Database as an example. This example shows defining our source data using a SQLite connector. For more information on using other connectors or defining data manually, refer to Define Source Data.
from gretel_trainer.relational import *
connector = sqlite_conn("telecom.db")
relational_data = connector.extract()
multitable = MultiTable(
relational_data,
#project_display_name = "multi-table",
#gretel_model = "amplify",
#refresh_interval = 60,
)

Transform a Database

Define a Transform Config

The first step in relational transforms is choosing or defining a transform model config. The snippet below demonstrates a few different ways you can provide a config, including a local path, a URL, or a Gretel blueprint config.
local_config = "/path/to/transforms_config.yaml"
remote_config = "https://gretel-blueprints-pub.s3.amazonaws.com/rdb/users_policy.yaml"
blueprint_config = "transform/default"

Train Transform Models

Pass the transform config to train_transform_models to begin training. By default transforms will run on all tables in the RelationalData instance, but this can be scoped to a subset of tables using one of the optional only or ignore parameters.
multitable.train_transforms(
blueprint_config,
# only={"table_a", "table_b"},
# ignore={"table_x", "table_y"},
)
Once train_transforms has started, logs showing the status of each table's model are updated periodically according to the refresh_interval set in the MultiTable instance. When training begins, a model for each table will appear in your project under the name {table}-transforms.

Run Transforms

Once training is complete, you can generate transformed data. Relational Transforms can be used alone or in combination with Relational Synthetics. If you intend to train synthetic models on the transformed output, instead of training on the source data, add the argument in_place=True.
multitable.run_transforms()
You can also run other data through the trained transform model. For example:
multitable.run_transforms(data={"events": some_other_events_dataframe})

Transform and Synthesize a Database

To transform data you plan to then synthesize, add the argument in_place=True to run_transforms. Note: This will modify the data in the RelationalData instance. Below is a code snippet for transforming and synthesizing the telecom database.
from gretel_trainer.relational import *
from gretel_client.projects.models import read_model_config
# Input data from database
db_path = "telecom.db"
sqlite = sqlite_conn(path=db_path)
relational_data = sqlite.extract()
# Create relational model
multitable = MultiTable(
relational_data,
#project_display_name="multi-table",
#gretel_model="amplify",
#refresh_interval=60,
)
# Transform
multitable.train_transforms("transform/default")
multitable.run_transforms(in_place=True)
# Synthesize
multitable.train()
multitable.generate()
# Write output back to database
out_db_path = "output.db"
out_conn = sqlite(path=out_db_path)
out_conn.save(multitable.synthetic_output_tables)

Outputs

The transformed data is automatically written to the working directory as transformed_{table}.csv. These files are also uploaded to the Gretel Cloud in an archive file called transform_outputs.tar.gz. You can find and download this file under the "Data Sources" tab in your project. You can optionally write the transformed data to a database using a Connector. The process for using output Connectors is detailed here.