SDK
Detailed information on how to use the Gretel Safe Synthetics SDK.
Overview
Gretel's Safe Synthetics SDK allows you to easily create privacy-safe, synthetic versions of your data. It provides high-level functionality to configure, transform, synthesize, evaluate, and preview synthetic datasets based on real-world data. This documentation will guide you through setting up and using the SDK.
Installation
Read about installing the SDK here. You can begin your notebook with:
You can find your API key at https://console.gretel.ai/users/me/key after signing up for a Gretel account.
Running Safe Synthetics
When you run a Safe Synthetics job, you are running a workflow. A workflow is a set of tasks chained together to execute a final goal. For example. workflows often include reading in source data, redacting personally identifiable information (PII), training a model, generating synthetic data records, and evaluating the results.
The hierarchy is Project -> Workflow -> Workflow Run. A workflow run is one execution of a workflow. The run is immutable, but you can re-run a workflow (with or without changes to the configuration). Doing so will kick off a new Workflow Run inside the same Workflow. You can also create multiple workflows to run different jobs.
Base template
The standard workflow template for running Safe Synthetics is:
The above code does the following:
Reads the datasource.
my_data_source
can be a Pandas DataFrame, a path to a local file, or a public URL, such ashttps://gretel-datasets.s3.us-west-2.amazonaws.com/ecommerce_customers.csv
.
Creates a holdout dataset.
This is automatic, as part of the
.from_data_source()
step.The default holdout is 5% of your data.
The holdout data is later used by Evaluate to generic some of the metrics in the Quality & Privacy Report.
Replaces true identifiers with Transform.
Redacts & replaces true identifiers found in your dataset, based on definitions from common regulations such as GDPR and HIPAA.
The default configuration used can be found here.
Generates a synthetic version of your data with Synthesize.
Generates a synthetic version of your data, creating records that mimic the characteristics and properties of the original data, without mapping rows 1:1.
By default, we use Gretel's flagship model, Tabular Fine-Tuning. This is our most flexible model, supporting a variety of data types including numeric, categorical, text, JSON, and event-driven data.
Produces a Quality & Privacy Report.
This happens automatically.
Kicks off the job via the
.create()
call.
Viewing results
After kicking off your Safe Synthetics workflow, you will begin to see logs streaming with information about your job as it runs.
Once the job completes, there are several useful methods to help you view your results.
Previewing output data
You can use the following code snippet to preview your synthetic dataset.
Viewing the Quality & Privacy Report
You can see a quick table of top-level metrics for the report by calling:
To get the raw python dictionary version of the table, you can use:
If you want to view the detailed HTML report in the notebook, you can call:
To open the report in a new tab, use:
Accessing workflow details
You can print the YAML configuration of your workflow with:
You can print out all the steps in your workflow using:
You can get the output from an individual by calling it by name. The output for a step may be a dataset or a report. For example:
Naming your workflow and run
We attempt to provide reasonable names for your workflow and run by default, but you may want to customize those so they are easier for you to find in the future and differentiate from other workflows or runs.
You can use the name
parameter to specify a workflow name. You can use the run_name
parameter to specify the name for a specific execution of that workflow. If a run_name
is not provided, the default is workflow_name_run_1
with a counter based on how many runs exist in the workflow so far.
These parameters can be provided in the create()
step.
Modifying the configuration
You may find that the default settings for the Safe Synthetics workflow need to be modified to meet your needs.
All Safe Synthetics jobs, whether advanced or simple, build off of the Base template shared above:
Holdout
You can adjust the holdout settings by adjusting parameters inside the .from_data_source()
call.
To turn off the holdout, call:
To adjust the holdout size, you can specify the desired amount as a percentage of the original dataset (e.g. 10%) or an integer number of rows (e.g. 250 rows; minimum 10):
Alternatively, if you would like to pass in your own holdout dataset instead, you can do so by setting holdout
to be a Pandas DataFrame, path to a local file, or public URL.
In addition, you can set the maximum number of holdout rows (for example, to 2000) by calling:
Finally, if your data is event-driven in nature, you can specify the column with which items should be grouped. This ensures that all items with a matching value in that column are either entirely placed in the holdout or entirely placed in the training dataset to be used throughout the rest of the workflow. The example below groups items by the column named "state."
Transform
We recommend calling Transform prior to Synthetics to ensure that any personally identifiable information is replaced, ensuring that there is no chance the synthetics model could learn the sensitive information.
However, if your data does not contain any sensitive information, you can choose not to run Transform simply by excluding it.
In the event that you only want to run Transform, we recommend disabling the holdout to ensure all of your dataset rows are included.
If you want to use a different configuration for Transform, there are two options.
First, you can choose from our most popular Transform templates - those with "transform" here.
These include:
Default - Gretel's default configuration includes the identifiers that span across common privacy policies, such as HIPAA and GDPR
If no configuration is specified, this is the configuration that is automatically used.
HIPAA - Redacts and replaces true identifiers using the HIPAA Safe Harbor Method
GDPR - Redacts and replaces true identifiers based on the GDPR
NER Only - Only applies redaction and replacement for free text columns; recommended option when chaining with the Text Fine-Tuning Synthetics model
You can then reference the template you want to use via:
where the portion after the /
is the portion after the double underscore __
in the directory.
Second, you can specify your own YAML configuration. For example:
Synthetics
By default, .synthetics()
uses Gretel's flagship model, Tabular Fine-Tuning, without differential privacy applied. However, you may find that a different Synthetics model or applying differential privacy is better-suited for your use case. You can read about the various Synthetics models here.
If you would like to use the default configuration of a different synthetics model, you can do so by specifying the model name. The options are:
"tabular_ft"
"text_ft"
"tabular_gan"
Alternatively, you can use one of our template configurations to switch to a different synthetics model, for example, if you want the template to apply differential privacy for Tabular Fine-Tuning. You can choose any of the templates from this folder and reference them as model_name/template_name
, where template_name
is the portion after the double underscore __
in the directory. For example:
You can also use a python dictionary to tweak individual parameters. Any that aren't specified will pick up the backend defaults. The order of parameters is:
Model name (required, but only if specifying either of the following parameters)
Python dictionary (optional)
num_records (optional)
In the example below, we update the num_input_records_to_sample
parameter to be 5000, and the num_records
to generate to be 1000. Aside from these changes, the default configuration, labeled default in this folder, is used.
Finally, you can specify your own, complete YAML configuration. For example:
Evaluate
If you do not want to generate the Quality & Privacy Report, you can turn off Evaluate by explicitly disabling it:
Create
By default, the .create()
call creates the workflow run but does not wait for it to finish before moving onto other cells in a notebook. This means if your next cell asks for the report, it will likely return an error since the workflow run has not completed. If you want to wait until the workflow run completes to continue, you can use .wait_until_done()
after creating the dataset. We recommend making that call in a separate cell.
The benefit of using the above approach is if your workflow does hit and raise an exception, you will still be able to work with that synthetic_dataset
object. For example, you could call get_step_output
to get the output from an earlier step that succeeded, console_url
for the link, and config
or config_yaml
.
Alternatively, you can specify wait_until_done = True
inside the Create call. It does not have the benefit described above, but it will ensure that the notebook waits to run future cells until the workflow run has finished.
Advanced use cases
By default, new Workflow Runs are created under the same Workflow for a given session.
If you want to create a new Workflow per Run, you can pass new_workflow=True
when creating the Safe Synthetic dataset:
You can load an existing workflow run by referencing the Workflow Run ID, which can be found in the Run Details page in the Console or in the logs of the workflow run.
Workflow Run IDs begin with "wr_".
Once loaded, you can then reference the output, as described earlier, such as:
Last updated
Was this helpful?