LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page
  • Overview
  • Installation
  • Running Safe Synthetics
  • Base template
  • Viewing results
  • Accessing workflow details
  • Naming your workflow and run
  • Modifying the configuration
  • Advanced use cases

Was this helpful?

Export as PDF
  1. Create Synthetic Data
  2. Gretel Safe Synthetics

SDK

Detailed information on how to use the Gretel Safe Synthetics SDK.

PreviousData Privacy 101NextGretel Data Designer

Last updated 4 days ago

Was this helpful?

Overview

Gretel's Safe Synthetics SDK allows you to easily create privacy-safe, synthetic versions of your data. It provides high-level functionality to configure, transform, synthesize, evaluate, and preview synthetic datasets based on real-world data. This documentation will guide you through setting up and using the SDK.

Installation

Read about installing the SDK . You can begin your notebook with:

%%capture
%pip install -U gretel-client
from gretel_client.navigator_client import Gretel

gretel = Gretel(api_key="prompt")

You can find your API key at after signing up for a Gretel account.

Running Safe Synthetics

When you run a Safe Synthetics job, you are running a workflow. A workflow is a set of tasks chained together to execute a final goal. For example. workflows often include reading in source data, redacting personally identifiable information (PII), training a model, generating synthetic data records, and evaluating the results.

The hierarchy is Project -> Workflow -> Workflow Run. A workflow run is one execution of a workflow. The run is immutable, but you can re-run a workflow (with or without changes to the configuration). Doing so will kick off a new Workflow Run inside the same Workflow. You can also create multiple workflows to run different jobs.

Base template

The standard workflow template for running Safe Synthetics is:

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(my_data_source) \
    .transform() \
    .synthesize() \
    .create()

The above code does the following:

  1. Reads the datasource.

    1. my_data_source can be a Pandas DataFrame, a path to a local file, or a public URL, such as https://gretel-datasets.s3.us-west-2.amazonaws.com/ecommerce_customers.csv .

  2. Creates a holdout dataset.

    1. This is automatic, as part of the .from_data_source() step.

    2. The default holdout is 5% of your data.

    3. The holdout data is later used by Evaluate to generic some of the metrics in the Quality & Privacy Report.

  3. Replaces true identifiers with Transform.

    1. Redacts & replaces true identifiers found in your dataset, based on definitions from common regulations such as GDPR and HIPAA.

  4. Generates a synthetic version of your data with Synthesize.

    1. Generates a synthetic version of your data, creating records that mimic the characteristics and properties of the original data, without mapping rows 1:1.

  5. Produces a Quality & Privacy Report.

    1. This happens automatically.

  6. Kicks off the job via the .create() call.

Viewing results

After kicking off your Safe Synthetics workflow, you will begin to see logs streaming with information about your job as it runs.

Once the job completes, there are several useful methods to help you view your results.

Previewing output data

You can use the following code snippet to preview your synthetic dataset.

synthetic_dataset.dataset.df.head()

Viewing the Quality & Privacy Report

You can see a quick table of top-level metrics for the report by calling:

synthetic_dataset.report.table

To get the raw python dictionary version of the table, you can use:

synthetic_dataset.report.dict

If you want to view the detailed HTML report in the notebook, you can call:

synthetic_dataset.report.display_in_notebook()

To open the report in a new tab, use:

synthetic_dataset.report.display_in_browser()

Accessing workflow details

You can print the YAML configuration of your workflow with:

print(synthetic_dataset.config_yaml)

You can print out all the steps in your workflow using:

for step in synthetic_dataset.steps:
  print(step.name)

You can get the output from an individual by calling it by name. The output for a step may be a dataset or a report. For example:

synthetic_dataset.get_step_output("transform").df

Naming your workflow and run

We attempt to provide reasonable names for your workflow and run by default, but you may want to customize those so they are easier for you to find in the future and differentiate from other workflows or runs.

You can use the name parameter to specify a workflow name. You can use the run_name parameter to specify the name for a specific execution of that workflow. If a run_name is not provided, the default is workflow_name_run_1 with a counter based on how many runs exist in the workflow so far.

These parameters can be provided in the create() step.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .create(name="my-overall-workflow-name", run_name="my-workflow-run-name")

Modifying the configuration

You may find that the default settings for the Safe Synthetics workflow need to be modified to meet your needs.

All Safe Synthetics jobs, whether advanced or simple, build off of the Base template shared above:

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .create()

Holdout

You can adjust the holdout settings by adjusting parameters inside the .from_data_source() call.

To turn off the holdout, call:

.from_data_source(ds, holdout=None)

To adjust the holdout size, you can specify the desired amount as a percentage of the original dataset (e.g. 10%) or an integer number of rows (e.g. 250 rows; minimum 10):

.from_data_source(ds, holdout=0.1)
.from_data_source(ds, holdout=250)

Alternatively, if you would like to pass in your own holdout dataset instead, you can do so by setting holdout to be a Pandas DataFrame, path to a local file, or public URL.

.from_data_source(ds, holdout=holdout_ds)

In addition, you can set the maximum number of holdout rows (for example, to 2000) by calling:

.from_data_source(ds, max_holdout=2000)

Finally, if your data is event-driven in nature, you can specify the column with which items should be grouped. This ensures that all items with a matching value in that column are either entirely placed in the holdout or entirely placed in the training dataset to be used throughout the rest of the workflow. The example below groups items by the column named "state."

.from_data_source(ds, group_by="state")

Transform

We recommend calling Transform prior to Synthetics to ensure that any personally identifiable information is replaced, ensuring that there is no chance the synthetics model could learn the sensitive information.

However, if your data does not contain any sensitive information, you can choose not to run Transform simply by excluding it.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .synthesize() \
    .create()

In the event that you only want to run Transform, we recommend disabling the holdout to ensure all of your dataset rows are included.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds, holdout=None) \
    .transform() \
    .create()

If you want to use a different configuration for Transform, there are two options.

These include:

    1. If no configuration is specified, this is the configuration that is automatically used.

You can then reference the template you want to use via:

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds, holdout=None) \
    .transform("transform/hipaa") \
    .create()

Second, you can specify your own YAML configuration. For example:

transform_yaml_config = """
globals:
  classify:
    enable: true
    entities:
      - first_name
      - last_name
  ner:
    ner_threshold: 0.3
  locales: [en_US]
steps:
  - vars:
      row_seed: random.random()
    rows:
      update:
        - name: fname
          value: fake.first_name()
        - name: lname
          value: fake.last_name()
        - name: notes
          value: this | fake_entities
"""
synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds, holdout=None) \
    .transform(transform_yaml_config) \
    .create()

Synthetics

If you would like to use the default configuration of a different synthetics model, you can do so by specifying the model name. The options are:

  • "tabular_ft"

  • "text_ft"

  • "tabular_gan"

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize("text_ft") \
    .create()
synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize("tabular_ft/differential_privacy") \
    .create()

You can also use a python dictionary to tweak individual parameters. Any that aren't specified will pick up the backend defaults. The order of parameters is:

  1. Model name (required, but only if specifying either of the following parameters)

  2. Python dictionary (optional)

  3. num_records (optional)

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize("tabular_ft", {"train": {"params": {"num_input_records_to_sample": 5000}}}, num_records=1000) \
    .create()

Finally, you can specify your own, complete YAML configuration. For example:

synthetics_yaml_config = """
train:
  privacy_params:
    dp: true
    epsilon: 8
  params:
    num_input_records_to_sample: auto
    batch_size: 4
generate:
  num_records: 5000
  use_structured_generation: true
"""
synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds, holdout=None) \
    .transform() \
    .synthetics("tabular_ft", synthetics_yaml_config) \
    .create()

Evaluate

If you do not want to generate the Quality & Privacy Report, you can turn off Evaluate by explicitly disabling it:

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .evaluate(disable=True) \
    .create()

In some cases, you may already have the synthetic dataset, and want to run only Evaluate. To do so, you need to first convert your Pandas Data Frames to Dataset files, and then add the appropriate steps to the workflow.

# Convert any Pandas Data Frames to Datasets
training_file = gretel.files.upload(train_df, "dataset")
holdout_file = gretel.files.upload(holdout_df, "dataset")
synthetic_file = gretel.files.upload(synthetic_df, "dataset")

# Instantiate the Workflow Builder
workflow = gretel.workflows.builder()

# Add Holdout & Evaluate steps
workflow.add_step(gretel.tasks.Holdout(), [training_file.id, holdout_file.id], step_name="holdout")
workflow.add_step(gretel.tasks.EvaluateSafeSyntheticsDataset(), [synthetic_file.id, "holdout"])

# Run workflow
workflow.run(wait_until_done=True)

Create

By default, the .create() call creates the workflow run but does not wait for it to finish before moving onto other cells in a notebook. This means if your next cell asks for the report, it will likely return an error since the workflow run has not completed. If you want to wait until the workflow run completes to continue, you can use .wait_until_done() after creating the dataset. We recommend making that call in a separate cell.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .evaluate() \
    .create()
synthetic_dataset.wait_until_done()

The benefit of using the above approach is if your workflow does hit and raise an exception, you will still be able to work with that synthetic_dataset object. For example, you could call get_step_output to get the output from an earlier step that succeeded, console_url for the link, and config or config_yaml .

Alternatively, you can specify wait_until_done = True inside the Create call. It does not have the benefit described above, but it will ensure that the notebook waits to run future cells until the workflow run has finished.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .create(wait_until_done=True)

Advanced use cases

Create a new workflow for each run

By default, new Workflow Runs are created under the same Workflow for a given session.

If you want to create a new Workflow per Run, you can pass new_workflow=True when creating the Safe Synthetic dataset:

synthetic_dataset = gretel.safe_synthetic_dataset \
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .create(new_workflow=True)

print(synthetic_dataset.workflow_id)

synthetic_dataset = gretel.safe_synthetic_dataset \
    .from_data_source(ds) \
    .transform() \
    .synthesize("tabular_gan") \
    .create(new_workflow=True)

print(synthetic_dataset.workflow_id)

Load an existing workflow run

You can load an existing workflow run by referencing the Workflow Run ID, which can be found in the Run Details page in the Console or in the logs of the workflow run.

Workflow Run IDs begin with "wr_".

synthetic_dataset = gretel.workflows.get_workflow_run("wr_2u6g4ZBljmLHm8WWH6RAHuFDJVg")

Once loaded, you can then reference the output, as described earlier, such as:

synthetic_dataset.dataset.df.head()
synthetic_dataset.report.table
synthetic_dataset.report.display_in_notebook()

The default configuration used can be found .

By default, we use Gretel's flagship model, . This is our most flexible model, supporting a variety of data types including numeric, categorical, text, JSON, and event-driven data.

First, you can choose from our most popular Transform templates - those with "transform" .

- Gretel's default configuration includes the identifiers that span across common privacy policies, such as HIPAA and GDPR

- Redacts and replaces true identifiers using the HIPAA Safe Harbor Method

- Redacts and replaces true identifiers based on the GDPR

- Only applies redaction and replacement for free text columns; recommended option when chaining with the Synthetics model

where the portion after the / is the portion after the double underscore __ in the .

By default, .synthetics() uses Gretel's flagship model, , without differential privacy applied. However, you may find that a different Synthetics model or applying differential privacy is better-suited for your use case. You can read about the various Synthetics models .

Alternatively, you can use one of our template configurations to switch to a different synthetics model, for example, if you want the template to apply differential privacy for Tabular Fine-Tuning. You can choose any of the templates from this and reference them as model_name/template_name , where template_name is the portion after the double underscore __ in the directory. For example:

In the example below, we update the num_input_records_to_sample parameter to be 5000, and the num_records to generate to be 1000. Aside from these changes, the default configuration, labeled default in this , is used.

here
https://console.gretel.ai/users/me/key
here
Tabular Fine-Tuning
here
Default
HIPAA
GDPR
NER Only
Text Fine-Tuning
directory
Tabular Fine-Tuning
here
folder
folder