Ask or search…
K
Comment on page

Relational Synthetics

Synthesize multi-table databases while maintaining referential integrity.

Introduction

Gretel Relational Synthetics leverages our library of generative AI models to synthesize large multi-table databases while maintaining referential integrity and statistical accuracy.
In addition to just synthesizing your database, Gretel Relational also makes it easy to transform then synthesize a database for maximum privacy assurances (think GDPR compliance). We'll discuss how to Transform and Synthesize a Database when we get to Relational Transform.

Recap: Getting Started

In the previous page, we covered the process for installing Gretel Relational, defining your source database, and creating a Relational model. A brief recap of the code can be found below, again using our telecommunications Demo Database as an example. This example shows defining our source data using a SQLite connector. For more information on using other connectors or defining data manually, refer to Define Source Data.
from gretel_trainer.relational import *
connector = sqlite_conn("telecom.db")
relational_data = connector.extract()
multitable = MultiTable(
relational_data,
#project_display_name = "multi-table",
#gretel_model = "actgan",
#refresh_interval = 60,
)
The MultiTable object is the interface for Relational Synthetics, primarily via the train and generate methods. You already specified which gretel_model to use when initializing the MultiTable instance. For Relational Synthetics, the gretel_model options are "actgan", "amplify", and "lstm".

Synthesize a Database

From here, you really only need two lines of code to generate relational synthetic data.
multitable.train_synthetics()
synthetic_tables = multitable.generate()
There are a few parameter options for these methods, and some other useful functions that we’ll cover below.

Training

The train_synthetics method is used to train synthetic sub-models. By default, one model is trained per table, with the model type as set by gretel_model. If there are tables you do not want synthesized (such as tables containing static reference data), you can use one of the optional only and ignore arguments to tailor which tables are trained for synthetics.
multitable.train_synthetics(
# only={"table_a", "table_b"},
# ignore={"table_x", "table_y"},
)
Logs showing the status of each table's model training are updated periodically according to the refresh_interval set in the MultiTable instance.

Retain failed tables

What happens if one or more tables fail to train? While rare, we want to ensure that one table failing to train does not mess up your entire database training. That's where retrain_tables comes in. Rather than editing the database and then starting from scratch, retrain_tables is used if initial training for a table or tables fails and the source data needs to be altered, but progress on the other tables should be left as is. In general, if a model fails to train, it is most likely an issue with the source data. The retrain_tables method allows you to clean/edit/fix source data of the failed table, and then try again. You can use this method to say, "hey, keep everything else that completed successfully the same, but for table X here's some new data—please replace it in the RelationalData source and try training a new model for the table with this new data".
retrain_tables takes a dictionary input tables where the key is the table name and the value is a DataFrame with table data.
new_table = pd.read_csv("/path/to/new/csv/new_table.csv") # Updated DataFrame
multitable.retrain_tables(tables={"table": new_table})

Generate

Once training is complete, you can generate relational synthetic data using the generate method. This method has two optional parameters.
synthetic_tables = multitable.generate(
#record_size_ratio = 1,
#preserve_tables = []
)
  • record_size_ratio - optional float used to control how much data to generate. Default (1) generates the same number of synthetic records as training records.
  • preserve_tables - optional list of tables used to synthesize only a subset of tables in the database. If provided, the tables in the list will not be synthesized. preserve_tables takes a list of string, the table names to preserve. This list effective adds to any tables that were omitted from training earlier.
When you generate synthetic data, you choose the amount of data to generate via record_size_ratio. You can choose to replicate the size of your database, generate less data (subset), or create more data than your original database.

Subset

Synthetic subsetting allows you to shrink your database proportionally with anonymized data that looks and feels like production data—all without risking privacy or sacrificing quality. Unlike other database subsetting tools that gamble with random sampling, Gretel Relational leverages our industry-leading generative AI models to accurately subset your data so you can innovate with confidence and speed.
To generate a synthetic subset of your database, adjust the record_size_ratio parameter to be less than 1. For example, multitable.generate(record_size_ratio=0.5) will generate a database half the size of the input data. Each table in the synthesized database will have half the number of records as the source database. In addition to maintaining referential integrity, Gretel also maintains the statistical accuracy of your original database, now smaller.
Some examples of use cases for generating a synthetic subset of a database include:
  • Software Development and Testing - Speed up the development process testing by working with a smaller, statistically accurate database;
  • Resource Constraints - Reduce costs and improve performance by generating smaller databases instead of storing and processing large databases that can be resource-intensive and expensive;
  • Minimize Risk - Subset the data accessible in lower environments to reduce risk in the event of a breach.

Generate a larger database

To generate a synthetic database larger than your real-world data, adjust the record_size_ratio parameter to be greater than 1. For example, multitable.generate(record_size_ratio=2) will generate a database twice the size of the input data. Each table in the synthesized database will have twice the number of records as the source database. In addition to maintaining referential integrity, Gretel also maintains the statistical accuracy of your original database, now bigger.
Alternatively, you can also grow your database and generate more records. Some examples of use cases that may require larger databases include:
  • Load Testing - Data is expensive, and critical to testing the robustness of an application. Creating large amounts of synthetic data can make load testing an application easier and cheaper;
  • Simulate Real-World Scenarios - For pre-production environments generating additional data that mimic real-world scenarios and edge cases allows for more comprehensive, robust testing;
  • Improve ML Models, for example Fraud detection - Generating a larger synthetic database that simulates fraudulent transactions and patterns can improve the accuracy of fraud detection systems by providing more data to learn from.

Outputs

The synthetic data is automatically written to the working directory as synth_{table}.csv. These files are also uploaded to the Gretel Cloud in an archive file called synthetic_outputs.tar.gz. You can find and download this file under the "Data Sources" tab in your project. You can optionally write the synthetic data to a database using a Connector. The process for using output Connectors is detailed here.
In the Gretel Console, Relational results can be found under the Data Sources tab in your project.
In addition to synthetic tables, a Gretel Relational Report and two Synthetic Data Quality Reports per table are created to help you assess the accuracy and privacy of your synthetic database. Read more about evaluating the results of your data here.
To view the Relational Report in the notebook:
import IPython
from smart_open import open
report_path = str(multitable._working_dir / multitable._synthetics_run.identifier / "relational_report.html")
IPython.display.HTML(data=open(report_path).read())