Ask or search…
K

Relational

Synthesize and Transform multi-table databases with Gretel Relational.

Introduction

With Gretel Relational, you can synthesize and transform multiple tables or entire SQL databases, while ensuring referential integrity, accuracy, and privacy.
Depending on your use case, you can leverage Gretel's generative AI models to:
Gretel Relational is currently delivered through Gretel's Trainer SDK, which provides client-side multi-model management for complex use cases. These features are released as part of this SDK as public-previews before being developed into Gretel's platform APIs.
To install the Trainer SDK, you should be running Python 3.9 or higher and can install via PyPi:
pip install -U gretel-trainer
If you wish to utilize database connectors, we recommend you install with connector support:
pip install -U gretel-trainer[connectors]
You can find more details on connector support here.
MySQL support requires extra configuration, please see the Connectors details for more information.
If you prefer to deploy the Gretel data plane in your own cloud tenant, please go through the Gretel Hybrid instructions first.

Notebook Resources

High Level Flow

  1. 1.
    Define Source Data automatically using one of our Database Connectors or manually via individual table files.
  2. 4.
    View results
  3. 5.
    Optionally, write output to database via Connector

Demo Database

For examples throughout the Relational documentation, we'll refer to our demo telecommunications database, shown below:
ERD for mock telecommunications database

Define Source Data

The source relational data for a Gretel Relational model is captured in a RelationalData object. The RelationalData instance stores information including the table data, primary and foreign keys, and the relationships between tables.
There are two ways to create the RelationalData instance:
  1. 1.
    Automatically, using a Connector to a supported database.
  2. 2.
    Manually, by providing table data in individual files and defining keys and relationships.

Option 1: Database Connector

When you use a database Connector, the keys and relationships between tables are automatically extracted from the database to a RelationalData instance.
Gretel Relational supports 30+ connectors, and provides first-class helper functions for popular database warehouses such as postgresql, sqlite, mysql, mariadb, and snowflake.

Sample - Connect to SQLite database and extract relational data

from gretel_trainer.relational import sqlite_conn
# Download demo SQLite database:
# !curl -o "telecom.db" "https://gretel-blueprints-pub.s3.us-west-2.amazonaws.com/rdb/ecom_xf.db"
db_path = "telecom.db"
sqlite = sqlite_conn(path=db_path)
relational_data = sqlite.extract()
To learn more about how to read from your own connector, check out Database Connectors.

Option 2: Manually define RelationalData

Alternatively, you can create the RelationalData instance from individual files - one per table. When using this approach, you will need to manually define the keys and relationships between tables, as shown below.
Here, an empty RelationalData instance is created. Then, the tables with their primary keys and data are added using add_table, and relationships between those tables are added using add_foreign_key_constraint.
The files are loaded into Pandas DataFrames, one per table. The example below shows manually defining RelationalData from CSVs. However any tabular data file type that is supported by Pandas can be used by adjusting the pd.read_csv function to be pd.read_parquet, pd.read_json, etc.
from gretel_trainer.relational import RelationalData
import pandas as pd
file_dir = "/path/to/table_files"
tables = [
#("table_name", "primary_key")
("account", "account_id"),
("client", "client_id"),
("invoice", "invoice_id"),
("location", "location_id"),
("payment", "payment_id"),
]
foreign_keys = [
{
"table": "account",
"constrained_columns": ["client_id"],
"referred_table": "client",
"referred_columns": ["client_id"],
},
{
"table": "location",
"constrained_columns": ["client_id"],
"referred_table": "client",
"referred_columns": ["client_id"],
},
{
"table": "invoice",
"constrained_columns": ["account_id"],
"referred_table": "account",
"referred_columns": ["account_id"],
},
{
"table": "payment",
"constrained_columns": ["invoice_id"],
"referred_table": "invoice",
"referred_columns": ["invoice_id"],
},
]
relational_data = RelationalData()
for table, pk in tables:
relational_data.add_table(name=table, primary_key=pk, data=pd.read_csv(f"{csv_dir}/{table}.csv"))
for foreign_key in foreign_keys:
relational_data.add_foreign_key_constraint(**foreign_key)

Create Relational Model

After the source data has been defined, the relational model and associated project are created using MultiTable. The MultiTable class is the interface to working with relational data. It is used to control model training and data generation, and to view and evaluate results. A RelationalData instance is required to initialize the MultiTable object. Several other options can be configured; the defaults are shown below as comments.
multitable = MultiTable(
relational_data,
#project_display_name = "multi-table",
#gretel_model = "actgan",
#refresh_interval = 60,
)
  • project_display_name - the display name for the project created
    • A new project is created for each new MultiTable instance, essentially one per Relational model
  • refresh_interval - the frequency in seconds with which to check models' statuses and update logs
    • gretel_model - the synthetic model to use. "actgan" (default), "amplify", or "lstm".
When a MultiTable instance is created, a log appears with the message:
Created project `{project_display_name}` with unique name `{project_name}`.
A local working directory is created and named after the unique project name. This directory is where temporary files and permanent outputs from the MultiTable model will be stored. Additionally, you can view the project in Console using the link https://console.gretel.ai/{project_name} and see outputs under the Data Sources tab at https://console.gretel.ai/{project_name}/data_sources.

Train and Generate Database

Now that your training database and relational model are set up, you can synthesize, transform, or transform then synthesize your database.
  • Synthesize database
  • Transform database
  • Transform and Synthesize database

View Results

Once data generation for Relational Synthetics, Relational Transform, or both has completed, you can view your output database and evaluate the results.
In the Gretel Console, Relational results can be found under the Data Sources tab in your project.

Output Files

The output data is automatically written to the working directory. For Relational Transform, there is a transformed_{table}.csv file per table. For Relational Synthetics, there is a synth_{table}.csv. file per table, one relational_report.html file, and a cross-table and individual synthetic report (in JSON and HTML formats) per table.
Output files are also uploaded to the Gretel Cloud in archives file called transform_outputs.tar.gz for transforms and synthetic_outputs.tar.gz for synthetics.

Relational Report

Created with every Relational Synthetics generation, the Gretel Relational Report provides unique accuracy and privacy scores to help you verify the quality of your synthetic database. Read more about it at the link above.
To view the Relational Report in the notebook:
import IPython
from smart_open import open
report_path = str(multitable._working_dir / multitable._synthetics_run.identifier / "relational_report.html")
IPython.display.HTML(data=open(report_path).read())

Write to Output to Database via Connector

Lastly, you have the option of writing the generated data to a database using a connector. The examples below show writing the synthesized and the transformed telecommunications databases to new connectors using SQLite. For information about connecting to your own database and more code snippets, check out Write to Output to Database via Connector.
output_db_path = "synthetic_telecom.db"
output_conn = sqlite_conn(output_db_path)
output_conn.save(multitable.synthetic_output_tables)
output_db_path = "transformed_telecom.db"
output_conn = sqlite_conn(output_db_path)
output_conn.save(multitable.transform_output_tables)