SDK

Detailed information on how to use the Gretel Safe Synthetics SDK.

Overview

Gretel's Safe Synthetics SDK allows you to easily create privacy-safe, synthetic versions of your data. It provides high-level functionality to configure, transform, synthesize, evaluate, and preview synthetic datasets based on real-world data. This documentation will guide you through setting up and using the SDK.

Installation

Read about installing the SDK here. You can begin your notebook with:

%%capture
%pip install -U gretel-client
from gretel_client.navigator_client import Gretel

gretel = Gretel(api_key="prompt")

You can find your API key at https://console.gretel.ai/users/me/key after signing up for a Gretel account.

Running Safe Synthetics

When you run a Safe Synthetics job, you are running a workflow. A workflow is a set of tasks chained together to execute a final goal. For example. workflows often include reading in source data, redacting personally identifiable information (PII), training a model, generating synthetic data records, and evaluating the results.

The hierarchy is Project -> Workflow -> Workflow Run. A workflow run is one execution of a workflow. The run is immutable, but you can re-run a workflow (with or without changes to the configuration). Doing so will kick off a new Workflow Run inside the same Workflow. You can also create multiple workflows to run different jobs.

Base template

The standard workflow template for running Safe Synthetics is:

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(my_data_source) \
    .transform() \
    .synthesize() \
    .create()

The above code does the following:

  1. Reads the datasource.

    1. my_data_source can be a Pandas DataFrame, a path to a local file, or a public URL, such as https://gretel-datasets.s3.us-west-2.amazonaws.com/ecommerce_customers.csv .

  2. Creates a holdout dataset.

    1. This is automatic, as part of the .from_data_source() step.

    2. The default holdout is 5% of your data.

    3. The holdout data is later used by Evaluate to generic some of the metrics in the Quality & Privacy Report.

  3. Replaces true identifiers with Transform.

    1. Redacts & replaces true identifiers found in your dataset, based on definitions from common regulations such as GDPR and HIPAA.

    2. The default configuration used can be found here.

  4. Generates a synthetic version of your data with Synthesize.

    1. Generates a synthetic version of your data, creating records that mimic the characteristics and properties of the original data, without mapping rows 1:1.

    2. By default, we use Gretel's flagship model, Tabular Fine-Tuning. This is our most flexible model, supporting a variety of data types including numeric, categorical, text, JSON, and event-driven data.

  5. Produces a Quality & Privacy Report.

    1. This happens automatically.

  6. Kicks off the job via the .create() call.

Viewing results

After kicking off your Safe Synthetics workflow, you will begin to see logs streaming with information about your job as it runs.

Once the job completes, there are several useful methods to help you view your results.

Previewing output data

You can use the following code snippet to preview your synthetic dataset.

synthetic_dataset.dataset.df.head()

Viewing the Quality & Privacy Report

You can see a quick table of top-level metrics for the report by calling:

synthetic_dataset.report.table

To get the raw python dictionary version of the table, you can use:

synthetic_dataset.report.dict

If you want to view the detailed HTML report in the notebook, you can call:

synthetic_dataset.report.display_in_notebook()

To open the report in a new tab, use:

synthetic_dataset.report.display_in_browser()

Accessing workflow details

You can print the YAML configuration of your workflow with:

print(synthetic_dataset.config_yaml)

You can print out all the steps in your workflow using:

for step in synthetic_dataset.steps:
  print(step.name)

You can get the output from an individual by calling it by name. The output for a step may be a dataset or a report. For example:

synthetic_dataset.get_step_output("transform").df

Naming your workflow and run

We attempt to provide reasonable names for your workflow and run by default, but you may want to customize those so they are easier for you to find in the future and differentiate from other workflows or runs.

You can use the name parameter to specify a workflow name. You can use the run_name parameter to specify the name for a specific execution of that workflow. If a run_name is not provided, the default is workflow_name_run_1 with a counter based on how many runs exist in the workflow so far.

These parameters can be provided in the create() step.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .create(name="my-overall-workflow-name", run_name="my-workflow-run-name")

Modifying the configuration

You may find that the default settings for the Safe Synthetics workflow need to be modified to meet your needs.

All Safe Synthetics jobs, whether advanced or simple, build off of the Base template shared above:

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .create()

Holdout

You can adjust the holdout settings by adjusting parameters inside the .from_data_source() call.

To turn off the holdout, call:

.from_data_source(ds, holdout=None)

To adjust the holdout size, you can specify the desired amount as a percentage of the original dataset (e.g. 10%) or an integer number of rows (e.g. 250 rows; minimum 10):

.from_data_source(ds, holdout=0.1)
.from_data_source(ds, holdout=250)

Alternatively, if you would like to pass in your own holdout dataset instead, you can do so by setting holdout to be a Pandas DataFrame, path to a local file, or public URL.

.from_data_source(ds, holdout=holdout_ds)

In addition, you can set the maximum number of holdout rows (for example, to 2000) by calling:

.from_data_source(ds, max_holdout=2000)

Finally, if your data is event-driven in nature, you can specify the column with which items should be grouped. This ensures that all items with a matching value in that column are either entirely placed in the holdout or entirely placed in the training dataset to be used throughout the rest of the workflow. The example below groups items by the column named "state."

.from_data_source(ds, group_by="state")

Transform

We recommend calling Transform prior to Synthetics to ensure that any personally identifiable information is replaced, ensuring that there is no chance the synthetics model could learn the sensitive information.

However, if your data does not contain any sensitive information, you can choose not to run Transform simply by excluding it.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .synthesize() \
    .create()

In the event that you only want to run Transform, we recommend disabling the holdout to ensure all of your dataset rows are included.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds, holdout=None) \
    .transform() \
    .create()

If you want to use a different configuration for Transform, there are two options.

First, you can choose from our most popular Transform templates - those with "transform" here.

These include:

  1. Default - Gretel's default configuration includes the identifiers that span across common privacy policies, such as HIPAA and GDPR

    1. If no configuration is specified, this is the configuration that is automatically used.

  2. HIPAA - Redacts and replaces true identifiers using the HIPAA Safe Harbor Method

  3. GDPR - Redacts and replaces true identifiers based on the GDPR

  4. NER Only - Only applies redaction and replacement for free text columns; recommended option when chaining with the Text Fine-Tuning Synthetics model

You can then reference the template you want to use via:

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds, holdout=None) \
    .transform("transform/hipaa") \
    .create()

where the portion after the / is the portion after the double underscore __ in the directory.

Second, you can specify your own YAML configuration. For example:

transform_yaml_config = """
globals:
  classify:
    enable: true
    entities:
      - first_name
      - last_name
  ner:
    ner_threshold: 0.3
  locales: [en_US]
steps:
  - vars:
      row_seed: random.random()
    rows:
      update:
        - name: fname
          value: fake.first_name()
        - name: lname
          value: fake.last_name()
        - name: notes
          value: this | fake_entities
"""
synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds, holdout=None) \
    .transform(transform_yaml_config) \
    .create()

Synthetics

By default, .synthetics() uses Gretel's flagship model, Tabular Fine-Tuning, without differential privacy applied. However, you may find that a different Synthetics model or applying differential privacy is better-suited for your use case. You can read about the various Synthetics models here.

If you would like to use the default configuration of a different synthetics model, you can do so by specifying the model name. The options are:

  • "tabular_ft"

  • "text_ft"

  • "tabular_gan"

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize("text_ft") \
    .create()

Alternatively, you can use one of our template configurations to switch to a different synthetics model, for example, if you want the template to apply differential privacy for Tabular Fine-Tuning. You can choose any of the templates from this folder and reference them as model_name/template_name , where template_name is the portion after the double underscore __ in the directory. For example:

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize("tabular_ft/differential_privacy") \
    .create()

You can also use a python dictionary to tweak individual parameters. Any that aren't specified will pick up the backend defaults. The order of parameters is:

  1. Model name (required, but only if specifying either of the following parameters)

  2. Python dictionary (optional)

  3. num_records (optional)

In the example below, we update the num_input_records_to_sample parameter to be 5000, and the num_records to generate to be 1000. Aside from these changes, the default configuration, labeled default in this folder, is used.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize("tabular_ft", {"train": {"params": {"num_input_records_to_sample": 5000}}}, num_records=1000) \
    .create()

Finally, you can specify your own, complete YAML configuration. For example:

synthetics_yaml_config = """
train:
  privacy_params:
    dp: true
    epsilon: 8
  params:
    num_input_records_to_sample: auto
    batch_size: 4
generate:
  num_records: 5000
  use_structured_generation: true
"""
synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds, holdout=None) \
    .transform() \
    .synthetics("tabular_ft", synthetics_yaml_config) \
    .create()

Evaluate

If you do not want to generate the Quality & Privacy Report, you can turn off Evaluate by explicitly disabling it:

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .evaluate(disable=True) \
    .create()

Create

By default, the .create() call creates the workflow run but does not wait for it to finish before moving onto other cells in a notebook. This means if your next cell asks for the report, it will likely return an error since the workflow run has not completed. If you want to wait until the workflow run completes to continue, you can use .wait_until_done() after creating the dataset. We recommend making that call in a separate cell.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .evaluate() \
    .create()
synthetic_dataset.wait_until_done()

The benefit of using the above approach is if your workflow does hit and raise an exception, you will still be able to work with that synthetic_dataset object. For example, you could call get_step_output to get the output from an earlier step that succeeded, console_url for the link, and config or config_yaml .

Alternatively, you can specify wait_until_done = True inside the Create call. It does not have the benefit described above, but it will ensure that the notebook waits to run future cells until the workflow run has finished.

synthetic_dataset = gretel.safe_synthetic_dataset\
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .create(wait_until_done=True)

Advanced use cases

By default, new Workflow Runs are created under the same Workflow for a given session.

If you want to create a new Workflow per Run, you can pass new_workflow=True when creating the Safe Synthetic dataset:

synthetic_dataset = gretel.safe_synthetic_dataset \
    .from_data_source(ds) \
    .transform() \
    .synthesize() \
    .create(new_workflow=True)

print(synthetic_dataset.workflow_id)

synthetic_dataset = gretel.safe_synthetic_dataset \
    .from_data_source(ds) \
    .transform() \
    .synthesize("tabular_gan") \
    .create(new_workflow=True)

print(synthetic_dataset.workflow_id)

You can load an existing workflow run by referencing the Workflow Run ID, which can be found in the Run Details page in the Console or in the logs of the workflow run.

Workflow Run IDs begin with "wr_".

synthetic_dataset = gretel.workflows.get_workflow_run("wr_2u6g4ZBljmLHm8WWH6RAHuFDJVg")

Once loaded, you can then reference the output, as described earlier, such as:

synthetic_dataset.dataset.df.head()
synthetic_dataset.report.table
synthetic_dataset.report.display_in_notebook()

Last updated

Was this helpful?