1 of 100

Gretel.ai

Welcome to Gretel!

The developer platform for synthetic data.

With Gretel, developers can get started in minutes with open source reference examples and simple APIs for generating unlimited amounts of synthetic data, labeling personally identifiable information, or anonymizing and removing biases from data. Gretel services are controlled by a simple web-based interface and run in Gretel’s managed cloud service or within your own private cloud environment.

Get started now!

What's Next?

After reviewing our Getting Started guide, check out the Gretel Fundamentals section to learn about the core concepts you'll encounter frequently when using Gretel.

Additional Documentation

Gretel Basics

Getting Started

Take the first step on your journey with synthetic data.

Introduction

Begin your journey with Gretel and start creating privacy guaranteed synthetic data today.

Start by following our Quickstart guide to install Gretel and train a basic model using the console or a notebook.
Follow along with Gretel Blueprints which cover some common foundational use cases.
Review our specific use case Use Case Examples which you can test out and modify for your own needs.

What's Next?

After following along with the recommended journey above you can dive into the Gretel Fundamentals section to understand the core Gretel concepts you'll be working with regularly.

Quickstart

Start generating synthetic data in minutes.

Create and share data with best-in-class accuracy and privacy guarantees with Gretel.

1. Set up Gretel

Sign up for a free account at https://console.gretel.ai.
Retrieve your API key.

For more detailed instructions, see Environment Setup.

2. Start using Gretel

Gretel Console

Gretel's Console provides easy to create synthetic data from a prompt or your existing datasets without writing any code. Check out our Console setup guide to start using Gretel via our Console.

3. Try Our Blueprints and Use Case Examples

Follow along with Gretel Blueprints which cover some common foundational use cases.
Review our specific use case Use Case Examples which you can test out and modify for your own needs.
Dive into the Gretel Fundamentals section to understand the core Gretel concepts you'll be working with regularly.

Blueprints

New to the Gretel SDK? Start here!

A foundational series of notebooks for fine-tuning and generating synthetic data with the Gretel SDK.

Use Case Examples

Notebooks for common Gretel use cases.

Follow along with these use cases to familiarize yourself with core Gretel features. These examples provide a starting point for common use cases which you can modify to suit your specific needs.

To help decide which approach may be best for you, you can use this flow chart.

Safe Synthetic Generation

Notebook

Description

Data Designer

Note: The Data Designer functionality demonstrated in this notebook is currently in Preview. To access these features, please join the waitlist.

Notebook

Description

Use structured outputs feature to generate synthetic data with complex, nested data structures, with support for both Pydantic and JSON schema definitions.

Create multi-turn user-assistant dialogues tailored for fine-tuning language models.

Partner Integrations

Notebook

Description

Partner

AWS

Azure

AWS

Databricks

Azure

Google

Other

Notebook

Description

Use Gretel's Navigator SDK to generate or edit tabular data from a user-provided prompt.

Generate synthetic daily oil price data using the DoppelGANger GAN for time-series data.

Environment Setup

Create a Gretel Account and generate an API Key to get started!

Sign up for Gretel using your work email or existing Google or GitHub accounts in the Gretel Console. All new accounts automatically get added to the free Developer plan. Learn more about our free and paid plans on the pricing page.

To use the Gretel CLI (Command Line Interface), you'll need your API Key. Get it by clicking the API Key menu in the sidebar, and then copy it to your clipboard. You'll also need this key for running notebooks that use the Gretel Cloud APIs. You can regenerate your key at any time using the secondary actions menu (the three dots).

You can interact with Gretel through the interface of your choice: the Gretel Console, CLI, or Python SDK. For more information on setting up each interface, check out the following pages:

Console

Getting to know the Gretel Console

The Gretel Console provides a fast and easy way to generate synthetic data, classify and redact PII and use our AI models without having to download or install any tools. Sign up for a free Developer account and choose a use case in the dashboard to get started.

Use cases allow you to train and run any of our models in four steps. Just launch one of the cards and upload your training dataset, or use the sample dataset we provide. A configuration file is already selected for you so you don't have to tweak any parameters. Our auto-parameters and auto-privacy settings will tune the configuration to your training dataset, ensuring the highest chance of success.

While the model is running, you can track progress in the log window, train a new model, or try another use case.

Measure Accuracy and Download Results

When your model has completed training, you'll see your SQS (Synthetic Quality Score) and be able to download the full report along with your synthetic data from the Downloads page. We automatically generate some records for you as part of model training, and you can easily generate more using the Generate button in the Model Header.

Projects can contain one or more models of different types such as Synthetic, Classify or Transform. Think of them as folders for your models, data sources and generated data. Projects are private by default, but you can share them with your co-workers and collaborators.

Projects can be created, filtered and sorted from the Projects page. Select a project to create a new model in that project.

Additional Console Features

You can also manage your Account in the Console, view Documentation and Announcements, and create a support ticket if you need additional help.

The Members section inside each project allows you to quickly share that project with collaborators. The following access permissions are supported: Read-only, Read/Write, Administrator, Co-Owner.

Create your First Synthetic Dataset

Here's a quick walkthrough of creating synthetic data in the Gretel Console.

Once the model has been trained, you can use the SQS and Privacy Levels to determine whether the data meets your quality standards. If so, quickly generate more data whenever you want, or fine tune the configuration settings to improve your scores. See our tips on improving synthetic data quality.

CLI & SDK

Get up and running with Gretel's CLI and SDK.

The Gretel CLI and Python SDK are made available through both PyPi (most common) and GitHub.

Installation

Prerequisites

We require using Python 3.9+ when using the CLI and SDK. You can download Python 3.9 (or newer) here and install manually, or you may wish to install Python 3.9+ from your terminal. If you are working with a new Python installation or environment you should also verify that pip is installed.

Gretel Client

To get started, you will need to setup your environment and install the appropriate packages.

The most straightforward way to install the gretel-client CLI and SDK is with pip:

pip install -U gretel-client

The -U flag will ensure the most recent version is installed. Occasionally we will ship a Release Candidate (RC) version of the package. These are generally safe to install, you may optionally include this with the inclusion of the --pre flag.

If you wish to have the most recent development features, you may also choose to install directly from GitHub with the following command. This may be suggested from our Customer Success team if you are testing new features that have not been fully released yet.

pip install git+https://github.com/gretelai/gretel-python-client@main

Gretel Hybrid Dependencies

If you are using Gretel Hybrid to run Gretel jobs on your own cloud infrastructure, the Gretel CLI and SDK will require your cloud provider's respective Python libraries. To install these dependencies run the relevant command below.

pip install -U "gretel-client[aws]"

pip install -U "gretel-client[azure]"

pip install -U "gretel-client[gcp]"

Authentication

After installing the package, you should configure authentication with Gretel Cloud. This will be required in order to create and utilize any models.

If you are installing Gretel on a system that you own or wholly control, we highly recommend configuring the CLI and SDK once with our configuration assistant. If you do this once, you will be able to use the CLI and SDK without doing specific authentication before running any commands.

To begin the CLI configuration process, use the command:

gretel configure

This will walk you through some prompts. You may press <ENTER> to accept the default which is shown in square brackets for each prompt. The prompt will look similar to:

Gretel.ai COPYRIGHT Notice


The Gretel CLI and Python SDK, installed through the "gretel-client"
package or other mechanism is free and open source software under
the Apache 2.0 License.

When using the CLI or SDK, you may launch "Gretel Worker(s)"
that are hosted in your local environment as containers. These
workers are launched automatically when running commands that create
models or process data records.

The "Gretel Worker" and all code within it is copyrighted and an
extension of the Gretel Service and licensed under the Gretel.ai
Terms of Service.  These terms can be found at https://gretel.ai/terms
section G paragraph 2.


Endpoint [https://api.gretel.cloud]: 
Artifact Endpoint [cloud]: 
Default Runner (cloud, local, hybrid) [cloud]: 
Gretel API Key [grtuf6c5****]: 
Default Project []:

Press <ENTER> to accept the default value for the Endpoint. (https://api.gretel.cloud)
The Artifact Endpoint is only required for Gretel Hybrid users. If you are using Gretel Cloud, press <ENTER> to accept the default value of cloud. If you are a Gretel Hybrid user the configured value should be the URI for the Sink Bucket which was created during the Gretel Hybrid deployment. This would be the resource identifier for an Amazon S3 Bucket, Azure Storage Container, Google Cloud Storage Bucket.
- Amazon S3 Example: s3://your-sink-bucket
- Azure Storage Example: azure//your-sink-bucket
- Google Cloud Storage Example: gcs://your-sink-bucket
The Default Runner is set for cloud. Press <ENTER> to accept the default value unless you are a Gretel Hybrid user or are running Gretel locally on your own machine(s). We recommend keeping cloud as the default runner, which will utilize Gretel Cloud's auto-scaling GPU and CPU fleet to create and utilize models.
- If you are a Gretel Hybrid user set this value to hybrid to utilize hybrid runners.
- If you need to run compute on your own machine(s) set this value to local.
When prompted for your Gretel API Key, paste the key you created in the Gretel Console.
When prompted for your Default Project, you may optionally enter a Project Name or press <ENTER> to accept the default.

Finally, you can test your configuration using the command:

gretel whoami

If the configuration is good to go, you should get back an output like this:

{
    "email": "user@domain.com",
    "config": {
        "endpoint": "https://api.gretel.cloud",
        "artifact_endpoint": "cloud",
        "api_key": "grtuf6c5****",
        "default_project_name": "my-synthetic-project",
        "default_runner": "cloud",
        "preview_features": "disabled"
    }
}

At this point, you are authenticated with Gretel, and can use the CLI without needing to re-authenticate. If you run into trouble, feel free to contact us for help!

There are a few different options to configure your Gretel Cloud connection through the SDK.

If you are using an ephemeral environment (such as Google Colab, etc) and you only wish to configure your connection for the duration of your Python session. You can configure your connection like this:

from gretel_client import configure_session

configure_session(api_key="grtu****", validate=True)

# If in a Notebook or similar environment you should see...

# Using endpoint https://api.gretel.cloud
# Logged in as user@domain.com ✅

Never commit code with your Gretel API key exposed! Generally you should load your Gretel API key in from some secure secrets manager or an environment variable.

See below for additional options if you are creating a Notebook, etc. such that you can always configure API key prompting.

Prompting

If you wish to maintain code that others may use, you can also use the following modification for configuring your session with Gretel Cloud. By using the prompt value, you'll be presented with a dialogue to import your API key.

from gretel_client import configure_session

configure_session(api_key="prompt", validate=True)

# If in a Notebook or similar environment you should see...

# Using endpoint https://api.gretel.cloud
# Logged in as user@domain.com ✅

Using the prompt option will only work if you do not already have Gretel credentials saved on disk. If credentials are already found on disk, configure_session() will utilize those and validate the connection with Gretel Cloud.

Hybrid Support

If you want to configure your session to run in Hybrid mode, run the following as part of configure_session:

from gretel_client import configure_session

configure_session(
  api_key="grtu****", validate=True, 
  default_runner="hybrid", artifact_endpoint="s3://my-bucket" # or gcs:// azure://
)

# If in a Notebook or similar environment you should see...

# Using endpoint https://api.gretel.cloud
# Logged in as user@domain.com ✅

The hybrid environment configuration will apply to everything run with the Gretel client, including libraries like Gretel Trainer and Gretel Relational.

See additional storage setup instructions per cloud provider here.

Gretel Python Client docs can be found here.

Cloud Provider Authentication for Gretel Hybrid

The Gretel Client uses cloud provider specific libraries to interact with the underlying object storage via the smart_open library. If you're a Gretel Hybrid user you may need to configure your environment with proper credentials for your specific cloud provider.

AWS

When using AWS, the Gretel Client will look for default credentials already configured on your system. Docs for configuring credentials can be found here. For Gretel CLI usage we recommend configuring the "credentials file" (which can be done using the AWS CLI) or utilizing environment variables. Both of these authentication methods are outlined in the linked documentation.

Azure

When using Azure, the Gretel Client will look for credentials already configured on your system. Docs for configuring credentials can be found here. For Gretel CLI usage we recommend authenticating with the Azure CLI or utilizing environment variables. Both of these authentication methods are outlined in the linked documentation.

When interacting with Azure Storage, the Gretel Client will also need information exported via the AZURE_STORAGE_CONNECTION_STRING or AZURE_STORAGE_ACCOUNT_NAME environment variables.

To fetch a connection string for the AZURE_STORAGE_CONNECTION_STRING you can run the following command from your terminal using the Azure CLI.

export AZURE_STORAGE_CONNECTION_STRING=$(az storage account show-connection-string \
    --name ${STORAGE_ACCOUNT_NAME} \
    --resource-group "${RESOURCE_GROUP}" --query="connectionString")

Be sure to replace STORAGE_ACCOUNT_NAME and RESOURCE_GROUP with the appropriate values for your storage container.

If you want to use AZURE_STORAGE_ACCOUNT_NAMEthen you'll simply export the following:

# Replace with your Azure storage account
export AZURE_STORAGE_ACCOUNT_NAME="my-storage-account-name"

And then use the Gretel Client as you normally would with your already configured authentication mechanism.

The AZURE_STORAGE_ACCOUNT_NAME may be used to configure the Gretel Client with system assigned managed identities, but it will try all of the options supported by DefaultAzureCredentials. AZURE_STORAGE_ACCOUNT_NAME should contain the value of the storage account associated with your storage container.

GCP

When using GCP, the Gretel Client will look for default credentials already configured on your machine. Docs for configuring GCP credentials can be found here. For Gretel CLI usage we recommend authenticating with the GCP CLI (gcloud).

Gretel Fundamentals

Gretel's core concepts.

These fundamentals will cover the core functionality that you should understand when working with Gretel. Before going further, you should have followed our getting started guide and installed and configured the Gretel Client.

Here are the core fundamentals you will be familiar with after going through the next few sections:

Architecture. Review a summary of Gretel's core system components.
Deployment options. Gretel Cloud empowers you to train models and generate synthetic data without needing to manage complex operating systems or GPU configurations. Gretel Hybrid enables you to deploy the Gretel Data Plane into your own cloud tenant, providing all of Gretel's incredible features and benefits without the need for data to leave the boundaries of your own enterprise network.
Projects. Gretel Projects can be thought of as repositories that hold models. Projects are created by single users and can be shared with various permissions.
Inputs and Outputs. Gretel Models support a number of input and output data formats. For concepts related to input and output data sources like relational databases or object stores, see the Workflows and connectors section.
Creating models. Create models and train them against your source data sets.
Running models. Running models will let you generate unlimited amounts of synthetic data.
Model types. This overview page will give you a glimpse into the different possibilities when creating and training models with Gretel.
Workflows and connectors. Workflows and connectors provide an easy way to connect to sources and sinks for working with synthetic data generation at scale.

Architecture

Get familiar with Gretel's architectural components.

Gretel Components

Gretel has three architectural components that you will want to be familiar with:

Gretel Control Plane: The control plane for scheduling work such as creating models and generating, classifying, or transforming data. This includes the Gretel REST API, Console and CLI tool. The REST API is hosted as a service and is used to manage accounts, projects, and metadata for projects, workflows, and models.

Regardless of where Gretel Workers run, they will communicate to Gretel’s REST API to communicate timing information, errors, and additional metadata. If you use workers in your own environment, no training data or sensitive information will be sent back to Gretel’s API.

Gretel Data Plane: Containers that consume Gretel Configurations and handle requests to process records. When a worker consumes a Gretel Configuration, it creates a re-usable model. Additionally, workers can utilize existing models to generate, transform, and classify records. The data plane also includes several controller microservices that are responsible for detecting queued jobs and scheduling the required worker containers. Gretel Cloud's managed data plane will execute all of your workloads by default. Gretel Hybrid allows customers to deploy their own Gretel Data Plane into their preferred cloud environment which will enable customers to utilize all of Gretel's incredible features without the need for data to leave the boundaries of your cloud tenant. See Deployment Options for more details.
Gretel Configurations: Declarative objects that are used to create models. Gretel offers several configuration templates to help you get started with popular use cases such as creating synthetic datasets or anonymizing PII. These configurations are sent to the Gretel REST API to create models. These models can then be used to generate, transform, and classify data. Further information can be found in the Model Configurations page.

These components work together to enable developers to build robust and flexible privacy engineering systems.

Gretel’s CLI tool and Console automate privacy engineering by working directly with Gretel’s REST API. Throughout this documentation, you will see how to achieve these tasks with examples from our Console and CLI. The REST API can always be used directly to create your own custom or more advanced automated systems.

Gretel Control Plane

The Gretel Control Plane is responsible for creating and managing projects, models, workflows, and job scheduling. The Control Plane is accessible via our REST API. We also consider other core Gretel components part of the control plane, such as the Gretel Console and Gretel CLI which are both responsible for interacting with the Control Plane API.

Projects

The primary object within Gretel that you will be working with is a Project. Projects are like repositories that contain models, workflows, and other associated data. You can invite other users to a project and control their permissions.

The following primitives exist within a Gretel Project:

Project Artifacts: These are datasets that can be uploaded and stored with your project. These artifacts are typically datasets that can be used to create models. Project artifacts can be uploaded by anyone with “write” access to a project. Additionally, project artifacts will be kept with the project until they are explicitly deleted. When using the Gretel Console or CLI you use Gretel Cloud Workers by default, and project artifacts will automatically be created for you from your training data. Project artifacts will have a specific structure. If your training data is called my-data.csv then an example artifact key might be: gretel_89bdba626464477aaeeef96fc8b2b613_my-data.csv. This key can be used as a data source for training or running models.
Models: Models are created on source datasets. You configure a model to be created using a Gretel Configuration which allows you to specify a source dataset, model type, and various parameters. You can train a model to generate synthetic data, transform records, or classify records. For each model that is created, the following artifacts are created:
- A model archive, which can be referenced to generate, transform, and classify data at scale.
- A model report. For synthetic models, this will be the Gretel Synthetic Report. For transforms, this will be a Gretel Transform Report.
- Sample data. A small sample of synthesized or transformed data will be created as part of the model creation process.
Model Servers: After a model has been created, you may run that model as many times as you like to generate, transform, and classify new data. The result of the model server will be an output dataset that can be shared or used for your downstream use case.

Uploading project artifacts, model creation, and model server creation can only be done by Project members that have “write” access or higher.

Gretel Data Plane

Whether you are utilizing Gretel's managed data plane (Gretel Cloud) or deploying your own data plane (Gretel Hybrid), the Data Plane is responsible for running jobs created via the Gretel Control Plane. The Data Plane consists of two primary components: Gretel Workers that create and run models, and the controller microservices responsible for creating and scheduling Gretel Worker containers. Gretel Workers are containerized applications that are designed to communicate directly with Gretel Cloud. All communications will occur over HTTPS (Port 443) to api.gretel.cloud. If you are running your own Gretel Data Plane (using Gretel Hybrid), your environment will need open outbound communication with the Control Plane API.

Workers are stateful and will transition through different statuses during their run time. Additionally, during their run time, the workers will periodically check in with Gretel Cloud to transmit usage information (for billing), status updates, generalized run logs, and error / troubleshooting diagnostic information.

When you run your own worker, your training (and possibly sensitive) data will never be sent to Gretel's Control Plane.

A Gretel Worker can exist in one of the following states:

created - A request for a worker has been made. This is the default state for a worker and will stay in this state until a worker is launched. By default, a user may have up to 10 created workers. This essentially serves as your “queue” for creating or running models.
pending - This state indicates that the scheduling service has obtained the request and is provisioning a worker for your model or model server.
active - A worker is creating a model, generating, or processing records. Once a worker is in this state it will begin periodically sending control plane and logging information back to the Gretel Control Plane.
completed - A worker successfully completed its job. If it was a Gretel Cloud Worker, all model or server artifacts have been uploaded and stored in Gretel Cloud. If using a Gretel Hybrid worker, then all artifacts should have been written to the private location specified when starting the job.
error - A worker countered an error. Basic error and troubleshooting information should have been sent to the Gretel Control Plane.
cancelled - A user has cancelled the worker. When a worker is cancelled, the worker will promptly shut down operation and cease all processing.
lost- A worker will be marked as lost if the Gretel Control Plane has been unable to communicate with the worker after some period of time.

In the event of an error, cancelled, or lost status, a worker cannot recover from this state. A new model or server will have to be created once the underlying issue is fixed.

To create a model, a Gretel worker is launched and will download a configuration from the Gretel Control Plane. Once the configuration is loaded, the worker will obtain the training data and begin creating a synthetic, transform, or classification model.

To run a model, a Gretel worker is launched which we consider a "model server". Depending on the model type, a model server can be used to generate, transform, or classify data.

Workers can be automatically launched for you in Gretel Cloud. This is the default mode when uploading a configuration from the Console or the CLI. In cloud mode, once a request for a model is received, Gretel will provision a worker for you and the model and associated artifacts (such as quality reports, sample data, etc) will also be stored in Gretel Cloud. You may download these artifacts at any time. With a model created and stored in Gretel Cloud, model servers can be created to utilize the model and generate, transform, or classify data.

Gretel Configurations

Gretel configurations are declarative objects that specify how a model should be created. Configurations can be authored in YAML or JSON. To help you get started, we have several Configuration Templates. You may download and edit these templates as necessary or directly reference them when using the CLI (see our tutorials on using the templates directly for model creation). You can also edit configurations directly in the Gretel Console, using the Config Editor.

The configuration file is the primary way to specify how a model can be created. When a model is requested to be created, a copy of this configuration will be sent to the Gretel Control Plane. Regardless of where a Gretel Worker is run, this configuration will be stored in Gretel's Control Plane and associated with the model.

When a Gretel Worker is scheduled (in our cloud or your own environment), it will contact Gretel Cloud and download a copy of the configuration and then start the model creation process.

All Gretel models follow a similar configuration file format structure.

To learn more about the configurations, please see the Model Configurations documentation.

Service Limits and Pricing

Please see our pricing page for details on our various plans. You can get started completely free with 15 credits on our Developer Plan. The following limits apply:

Maximum Queued Jobs (10). This is the maximum number of jobs that can be in a created state. If you are using Gretel Cloud workers, these jobs are automatically queued to start. While a worker is in this state, you may delete it or cancel it at any time. When this number is exceeded, API calls will return a 4xx error when attempting to create new models or model servers.
Maximum Running Workers (4). This is the maximum number of jobs that can be in an active state. When using Gretel Cloud workers, if this limit is exceeded, Gretel will wait for work to complete and then automatically start a new job from the queue of created jobs. When running local workers, if the worker starts and the limit is exceeded, the job will be put into an error state.
Maximum Worker Duration (1 hour). This is the maximum amount of time a worker can be in an active state either creating or serving a model. If the job exceeds this limit, the job will be put into an error state.

Deployment Options

Where do Gretel Models run?

Gretel jobs run within the Gretel Data Plane. Gretel provides two deploy options for the Gretel Data Plane that you may utilize depending on your requirements.

Gretel Cloud

Gretel Cloud is a comprehensive, fully managed service for synthetic data generation and it operates within Gretel's cloud compute infrastructure, allowing Gretel to handle all concerns related to compute, automation, and scalability. Gretel Cloud provides a seamless solution that simplifies the technical demands of setting up your own machine learning cloud infrastructure.

When you create your Gretel account you're given instant access to Gretel Cloud and Gretel Cloud Workers, so you can start your first model training job instantly.

Gretel Hybrid

Gretel Hybrid operates within your own cloud tenant and is deployed on Kubernetes. Gretel Hybrid is supported on GCP, Azure, and AWS through the use of the managed Kubernetes services offered by these cloud providers. Gretel Hybrid interfaces with the Gretel Control Plane API for job scheduling and job related metadata but customer owned data will never egress from your cloud environment. Gretel Hybrid is particularly well suited for handling sensitive or regulated data that cannot leave your cloud tenant's boundaries. Gretel Hybrid combines the benefits of using your infrastructure for training synthetic data models with Gretel’s advanced tools, offering a balance of control and convenience.

To learn more about Gretel Hybrid, check out the Gretel Hybrid section in our documentation.

Projects

Learn to use and manage projects that allow you to store and collaborate on data.

Projects Overview

Gretel Projects can be thought of as repositories that hold models. Projects are created by single users and can be shared with various permissions:

Read: Users may access data artifacts (such as synthetic data and reports)
Write: Users may create and run models.
Administrator: Users may add other users to a project.
Owner: Full control.

The most important thing to note about Projects is that the name attribute of a project is globally unique. If you are familiar with services like Simple Storage Service (S3), then Project naming will feel very similar since S3 bucket names are also globally unique within a specific service provider (such as AWS).

Projects have the following attributes you should be familiar with:

name: A globally unique name for the project. When you create a project without specifying a specific name, Gretel will generate one for you. This will be a randomized name based on your username and a unique hash slug. If you specify a name that is already used, Project creation will fail.
display_name: This can be any descriptive name for the Project that will control how the Project is listed and displayed in the Gretel Console. It is non-unique.
description: This optional field can be provided to provide a user-friendly description of the Project.

When using the CLI and SDK and you need to specify a Project parameter or variable, you should ensure you are using the Project name (unique key).

We are introducing new ID values that we call Gretel UIDs, for projects these will be randomized values that are prefixed with proj_. You may start seeing this in lieu of Project names but they are effectively the same. A full example of a Project GUID is: proj_2BGDjIP0B2nx3RmNw8rURRUm0dz.

CLI Project Management

Next, let's look at creating and using Projects from the Gretel CLI.

At any point, if you ever set a "default Project" via the CLI either through the gretel configure command or by using the --set-default flag and you do not explicitly configure a project when running subsequent commands, your default Project will be used for all model operations.

At any point, you can get help on project management in the CLI by running:

gretel projects --help

Creating Projects

You can create a project with auto-naming by running:

gretel projects create

This will return a message (and the full Project object) that looks something like this:

INFO: Created project john-b0ab0.
INFO: Console link: https://console.gretel.cloud/john-b0ab0

Now, you may use this Project name as a reference in future operations.

You may also specify other Project attributes at creation time. For example, let's try selecting a unique project name and setting a display name for the console:

gretel projects create --name my-unique-project-name --display-name "Test Project 101"

Which returns:

INFO: Created project my-unique-project-name.
INFO: Console link: https://console.gretel.cloud/my-unique-project-name

If you follow the Console link, you will now see your new project by it's display name:

If the Project name you choose is not available, the CLI will return an error.

Deleting Projects

To delete a project, either the name or project-id is required. An example on how to delete the above project would be:

gretel projects delete --name my-unique-project-name

SDK Project Management

The Gretel Python SDK gives more flexibility and control around Project management. Within the SDK, the Projects module and class should be the primary orientation point for doing most of your work with Gretel.

The SDK differs in that when creating or accessing Projects, you will be given an instance of a Project class that you can interact with. Let's take a look.

Creating Projects

Similar to our CLI interface, you can create a project with no input attributes:

from gretel_client import create_project

proj = create_project()

proj.name
# 'john-a6cbf'

proj.get_console_url()
# 'https://console.gretel.cloud/john-a6cbf'

Similarly, you can provide Project attributes to the create_project() method:

from gretel_client import create_project

proj = create_project(name="my-awesome-project", display_name="Projects 101")

proj.name
# 'my-awesome-project'

proj.get_console_url()
# 'https://console-dev.gretel.cloud/my-awesome-project'

Unique Project Name Helper

As mentioned earlier, Project names are globally unique. However, we have created a utility in the SDK that allows users to "share" identical project names such that any user could have their own version of a project called "test" or "foo".

This helper will either create a new project or fetch an existing one, giving you back a Project instance. Additionally, the display name of the project will automatically be set for you based on the name you provide. Let's take a look:

from gretel_client import create_or_get_unique_project

proj = create_or_get_unique_project(name="my-new-awesome-project")

proj.name
# 'my-new-awesome-project-92fbf5bd55d2472'
# NOTE: The slug at the end is automatically handled by the SDK

proj.display_name
# 'my-new-awesome-project'
# NOTE: This will be the display name in the console

In this mode, every user could use the exact my-new-awesome-project string and a unique slug for that user will be appended to the Project name. This may be especially useful if you are re-running Notebooks or routines and do not want to use a combination of create_project() and get_project() to determine if a project already exists or not.

Temporary Projects

In certain occasions, you may want to create a Project only for the purposes of creating a model and extracting the specific outputs (Synthetic Data, Synthetic Quality Report, etc). Once you have extracted the data you need, you can delete the Project, which will then delete all of the models and artifacts related to those models.

For this use case, there is a temporary project context manager you can use. Once the context handler exits, the Project will be deleted:

# NOTE: This import is one package down in the `projects` package

from gretel_client.projects import tmp_project

with tmp_project() as my_tmp_project:
    print(my_tmp_project.name)
    # 'john-3a8ae'

# At this point, our project has been deleted.

Accessing Projects

If you already have a Gretel Project, in order to run model operations, you will need to load an instance of the Project class in the SDK. We'll use our example Project from above: my-awesome-project to show how to do this.

from gretel_client import get_project

proj = get_project(name="my-awesome-project")

proj.name
# 'my-awesome-project'

Deleting Projects

To delete a project from the SDK, you utilize the delete() method on a Project instance:

from gretel_client import create_project

proj = create_project()

proj.delete()

Once you delete a project, the class instance is not usable anymore. If you try and do any meaningful operation with it, you'll receive a GretelProjectError such as:

GretelProjectError: Cannot call method. The project has been marked for deletion.

Inputs and Outputs

Supported input and output formats

Gretel Models support a number of input and output data formats which are outlined on this page. Gretel also provides a way for you to connect directly to your source and destination data sources using Gretel Connectors.

Gretel workflows and connectors make interacting with data sources and destinations easier than ever. For more information check out our detailed documentation.

Input Formats

Gretel Models support input datasets in the following formats:

CSV (Comma Separated Values)
- CSV data input is supported for Synthetics, Transform and Classify jobs.
- The first row of the CSV file will be treated as column names, and these are required for processing.
JSON (JavaScript Object Notation)
- The files may be formatted as a single JSON doc, or as JSONLines (where each line is a separate JSON doc).
- Processing JSONL files is much more efficient for larger datasets, therefore we recommend it over regular JSON.
- The JSON documents may be flat or contain nested objects. While there is no limit to the number of levels in a nested document, the more complex the structure, the longer the data will take to process.
- JSON datasets are currently supported for Classify and Transform only. Support for Synthetics is coming soon.
Apache Parquet
- The following compression algorithms for column data are supported: snappy, gzip, brotli, zstd.
- Parquet datasets are currently supported for Classify and Transform only. Support for Synthetics is coming soon.
- Uploading Parquet datasets as project artifacts is currently only supported in the Gretel CLI and SDK. The ability to upload these in the Gretel Console is coming soon.

When using the console, we recommend uploading files no larger than 500MB. We don't impose any limits on training data size, but larger uploads could be hampered by connectivity issues or timeouts.

Output Formats

Results are automatically output in the same format as the input dataset.

JSON Outputs

For JSON datasets in Classify, there will be an additional field for each detected entity: json_path. This field contains the JSONPath location of that detected entity within the JSON document. See below for a sample classify result on a JSON dataset.

{
  "index": 123,
  "entities": [
    {
      "start": 0,
      "end": 16,
      "label": "email_address",
      "source": "gretel/email_address",
      "score": 0.8,
      "field": "user.emails.address",
      "json_path": "$.user.emails[0].address"
    }  
  ]
}

For Transform, the output will be written in the same format as the input, however whitespaces and order of fields from the input will not be preserved.

Field Names for JSON Data

In CSV files, field names correspond to the column name. JSON data doesn't have columns, but we still want to be able to reference fields for reporting purposes. Therefore, field names are created by referencing the dot-delimited path from the root of the document to each scalar value (the leaf). In the example below, the field that contains the value test@example.com will be referenced as: user.emails.address.

{
  "user": {
    "emails": [
      {"address": "test@example.com"}
    ]
  }
}

Note that in the example above, the array index is omitted. Thus the values inside the array will be aggregated together since typically all elements inside an array have the same schema. This method of field naming works well for JSON datasets that have a uniform schema across all the records. The naming convention could vary in the case of optional fields, etc.

JSON support is currently available only for Classify and Transform jobs. Synthetics support is coming soon.

Parquet Outputs

For Classify, the result structure for Parquet datasets will be the same as that of JSON datasets. Since Parquet data can be nested in a similar way as JSON data, each detected entity will contain a json_path field.

For Transform, the output will use the same schema and Parquet version as the input file.

Field Names for Parquet Data

Field names that appear in Classify and Transform reports when processing Parquet files correspond to column names in the Parquet schema. For columns that contain nested data, field names are constructed in the same way as for JSON data (see above).

If you would like us to import a different format, let us know.

Creating Models

A model in Gretel is an algorithm that can be used to generate, transform, or label data.

Powered by data, models can be thought of as the building blocks of machine learning. This page walks through the basics of initializing and training models for synthetic data, data transformations, and data classification.

The fundamentals of creating a Gretel model are almost always the same, there are three key steps:

1) Choose a default, or create a Gretel Configuration

2) Provide training or input data

3) Submit as a job to Gretel Cloud

When creating a model, Gretel Cloud performs the following steps:

Load the Gretel Configuration
Upload the training data to Gretel Cloud
Gretel Cloud provisions a worker and begins model training
When the job is completed, several Model Artifacts, including output data and reports can then be downloaded client-side

We'll show how to use both the CLI and SDK to create Gretel models in their own sections below.

Gretel Configuration

Gretel Configurations generally start as declarative YAML files, which can then be provided to the SDK, CLI, or Gretel Console for starting a model creation job. Between the CLI and SDK, however, there are some differences (and similarities) on how you can define and provide a Gretel Configuration.

CLI and SDK

The CLI and SDK can work with Gretel Configurations that are YAML files. The CLI and SDK can access files on-disk or through remote URIs (HTTPS, S3, etc).

SDK Only

The SDK can also load Gretel Configurations as Python dictionaries as an alternative to YAML. This way, you may either load a configuration from disk or a template, and then manipulate it as necessary. Here's an example of this:

Input Data Sources

CLI and SDK

Data sources may be either files on disk or files that can be accessed via a remote URI (such as HTTPS or S3). In both cases, you should provide a string value to the file on disk or the remote path.

SDK Only

The SDK will accept Pandas DataFrames as input data. When a DataFrame is provided, the SDK will temporarily write the DataFrame to disk and upload it to Gretel Cloud. When the operation is complete, the temporary file on disk will be deleted. When showing SDK usage below, we will use the DataFrame input data method.

Creating Models with the CLI

The steps below assume you have a default Gretel Project configured. At any time if you wish to create a model in a different project you can utilize the --project <NAME> flag.

For this example, we will download the sample data to disk so you may observe the full artifact creation process:

Regardless of the model type, creating a Gretel model through the CLI will be done through the gretel models create ... command set.

At any time you can get the help menu by running:

Given our data set, and a synthetics configuration shortcut (synthetics/default) let's create a model:

By default, the CLI will attach to the job as it runs in Gretel Cloud and you will start to see verbose logging output as the job runs.

If you terminate this command, i.e. by sending a keyboard interrupt, this will cancel the job. If you wish to run the job in a "detached" mode, you may use the --wait flag and give some low number of seconds to attach to the job such as --wait 5. After 5 seconds the CLI will detach and the job will continue to run in Gretel Cloud.

Once the model is completed, the CLI will download the artifacts that were created as part of the model. You should be able to see these in the directory you specified in the --output parameter, so in this example, artifacts should be saved to the my-synthetic-data directory.

Additionally, you should see the Model ID be output from the CLI:

You will need this ID when re-using this model to generate synthetic data. Next, let's look at the downloaded artifacts.

data_preview.gz contains the synthetic data that was created as part of the model creation process
report.html.gz contains the Synthetic Quality Score report as a human readable HTML file
report_json.json.gz contains the data from the SQS report but in a JSON consumable format
logs.json.gz contain the model creation logs, these may be useful if you ever contact Gretel support

Downloading Model Artifacts

When the CLI stays attached to the Gretel Cloud job, artifacts will automatically be downloaded to the provided --output directory. If you have disconnected the CLI from Gretel Cloud, for example using the --wait option, then you may download the artifacts manually. This can be done with the following command:

Creating Models with the SDK

Next, we'll walk through creating models with the SDK. While the SDK can utilize local files data sources and remote URI data sources, for this example, we will show how you can use a Pandas DataFrame as your data source.

Once we have our Project instance, we will want to do a few things:

We use the Project instance to create a Model instance by using a specific create_model_obj() factory method. This factory method takes both our Gretel Configuration and data source (a DataFrame) as params.
With the Model instance created, we have to actually submit it to Gretel Cloud
Next we can poll the Model instance for completion
Finally we can download all of the Model Artifacts

Let's see it all in action...

In the above example, our Model instance was in memory the entire time. If you ever lose that instance or restart your Python interpreter, you can create and hydrate a new Model instance right from your Project instance:

In the next section, we'll discuss how to utilize existing models to generate synthetic data.

Running Models

Once a Gretel Model is created, you may utilize that model to generate synthetic data as many times as needed. Because you may use a model to classify and transform data as well, we generically refer to the running of a model as a Record Handler.

Compared to model creation, running a model does not require a standalone Gretel Configuration. There are three input types that you should be aware of when it comes to running models:

A Model ID (or other reference to a Model, like a Model instance in the SDK)
A number of parameters, which are essentially key-value pairs. These will vary depending on the specific type of model you are running.
Optionally, an input data file(s). Depending on the model, the input data may serve various purposes. One example of using an input data file to a record handler would be providing as set of pre-conditioned (smart seeds) inputs to a model to use during generation.

num_records: How many synthetic records to generate
max_invalid: How many records can fail validation before the job stops

Please reference specific model examples to understand what data inputs and parameters should be used when running specific models.

Running Models from the CLI

Models can be ran by using the gretel models run [OPTIONS] set of commands, at any time you may get help on these commands by running:

In order to run a model, you will need to know or access it's Model ID. When passing model run parameters to the CLI, you should use the --param option for each param such that it matches a --param KEY VALUE pattern.

Given a previously created model, let's generate 100 additional records:

When this job completes, the artifacts will be downloaded to the more-syn-data directory. For this particular job you should see logs.json.gz which are the job logs and your new synthetic data in the data.gz artifact.

If the model type supports conditioning (i.e. smart seeding), then you may provide this set of partial records or smart seeds using the --in-data flag.

When providing a data source for running a model, the job will often use the number of records in the data set to determine how many synthetic records to create. In this case, parameters like num_records will be ignored.

Running Models with the SDK

In order to run a model from the SDK, you will need a Model instance. Once you have that instance, you can create and submit a record handler object in a very similar way to model creation. When submitting a record handler to Gretel Cloud, you may track the state of the job the same way as a model.

Model Types

Gretel's models can help you transform and synthesize your sensitive data to generate provably-private versions.

Synthetics

Gretel offers the following synthetics models:

1. Data types: Numeric, categorical, text, JSON, event-based
2. Differential privacy: Optional
3. Formerly known as: Navigator Fine Tuning
1. Data types: Text
2. Differential privacy: Optional
3. Formerly known as: Gretel GPT
1. Data types: Numeric, categorical
2. Differential privacy: NOT supported
3. Formerly known as: ACTGAN
1. Data types: Numeric, categorical
2. Differential privacy: Required; you cannot run without differential privacy

Transform

Workflows and Connectors

Scaling synthetic data generation.

Gretel Workflows provide an easy to use, config driven API for automating and operationalizing Gretel. Using Connectors, you can connect Gretel Workflows to various data sources such as S3 or MySQL and schedule recurring jobs to make it easy to securely share data across your organization.

A Gretel Workflow is constructed of actions that connect to various services including object stores and databases. These actions are then composed to create a pipeline for processing data with Gretel. In the example above:

A source action is configured to extract data from a source, such as S3 or MySQL.
The extracted source data is passed as inputs to Gretel Models. Using Workflows you can chain together different types of models based on specific use cases or privacy needs.
A destination action writes output data from the models to a sink.

A Workflow is typically created for a specific use case or data source and can be compared with a data pipeline or DAG.

Release Notes

Gretel release notes are organized by release vehicle.

Platform Release Notes addresses releases for the Gretel Data Plane and Control Plane (Gretel Cloud and Hybrid)
Python SDKs
- The gretel-synthetics SDK is a source-available Python package that allows permissive use of Gretel maintained generative models.
- The gretel-client SDK is a Python interface to Gretel APIs.
Console Release Notes summarizes releases for our Console web app, updated on a weekly basis.

Platform Release Notes

Gretel's Platform is comprised of control plane and data plane components.

The Gretel Data Plane is responsible for processing user-provided prompts and/or datasets and generating synthetic data.

The Gretel Control Plane includes Gretel's APIs, job scheduling, and workflow management.

Release Schedule and Versioning

Gretel generally releases platform updates every Tuesday. We do sometimes release out-of-band to address critical bug fixes, security updates, or pre-releases for future features and capabilities.

Gretel follows a CalVer versioning schema. The schema is YYYY.MM.N:

YYYY: Calendar year.
MM: Month of year.
N: Monotonically increasing release number for the given month, so 2024.6.1 is the first release in June of 2024.

Gretel Cloud

Gretel automatically upgrades the Gretel Cloud to support enhancements and upgrades to the platform. All users get the same updates at the same time. Gretel uses the CalVer internally to track the changes and release notes are organized by these CalVer numbers to more easily communicate changes that are delivered.

Gretel Hybrid

Gretel Hybrid splits the control and data planes such that:

Gretel maintains and runs the control plane in Gretel Cloud. Gretel control plane updates are automatically shipped by Gretel for both Gretel Hybrid and Gretel Cloud.
The data plane is customer managed within customer cloud accounts. Depending on your Hybrid setup, you will need to update varying container images. More on this below.

All container image versions for Hybrid are tagged with the same CalVer number on each release. The concrete release notes are organized by version number, these version numbers should be used to update your container images as necessary.

The container images used on Gretel Hybrid can be split into three categories:

Management containers. Images are prefixed with gcc-. There are three core management containers that run on the Hybrid cluster. These containers are responsible for managing model jobs and workflows.
Workflow container. This container image is named workflow. These containers are used when running Gretel Workflows and handle things such as source and sink actions.
Model container. This container image is named model. These are containers that run the actual Gretel models for generating synthetic data.

If your Hybrid deployment directly uses Gretel's container registry or a pull through cache the workflow and model container images are automatically updated and pulled for you upon release. These containers are spawned by the management containers during model jobs and workflow runs.

Gretel's container images have several shared internal libraries. We have consolidated the number of total images to make upgrades easier. We highly recommend upgrading all container images at the same time based on release version numbers. This mirrors how we update Gretel Cloud.

Image Mirrors

If you need to explicitly pull images by tag and cannot use the latest tag, then you should use the appropriate CalVer version number for the image tag.

March 2025

2025.3.26

Feature: Change how we interface with data designer to define evaluation tasks (yaml and sdk).
Fix: Fixes some minor logging in the trainer SDK
Feature: The AIDD interface now has a with_person_samplers method for creating latent person samplers.
Fix: Handle NotFound return code for /projects/:id
Fix: Parameters for sampling-based data sources are now autogenerated to the client.
Fix: Fixes default type of AIDD evaluation report attribute.
Feature: Implements the magic interface for the new v2 data_designer SDK. Try it out by calling data_designer.magic.add_column("my_column", "my_column_description").
Feature: AIDD add_column now can take a concrete column type as input.
Feature: Columns (except for sampler and seed columns, which can't depend on other columns) are now represented with a DAG, ensuring that steps are run in the necessary order.
Feature: New ExpressionColumn added, which provides a new implantation of expressions. Expressions are now provided as straight jinja2 templates.

2025.3.4-1

Internal config updates.

2025.3.4

Task: Removes deprecated /auth/email endpoint.
Fix: Block further local IP addresses for data sources
Fix: Fix Multi-Modal Report for Evaluate Models Rendering bugs
Fix: Fix an issue with building gretel-synthetics

February 2025

2025.2.25

Feature: Enable sampling of Person objects for seeding datasets in Data Designer, based on publically available statistics datasets including the US Census.
Feature: Adds a Workflows Task to support splitting off a holdout set from a training Dataset.
Task: Update SQS from [0, 100] to [0, 10].
Fix: Corrects an issue with the CLI polling for a model run.
Task: Use Multi-Modal Report to evaluate the model by default. In order to use an older version of the report for the Evaluate model, please set task.type=sqs.

2025.2.19

Fix: Security fixes for our Java and Python images

2025.2.18

Task: Removes the use of a local docker agent for running models
Task: The combined models image is always used now via an API call
Task: Go applications are now built using Go version 1.23.6
Fix: More informative error messages for asserting generation_prompt template expectations when adding columns in DataDesigner.
Task: Move the gretel agent python code out of the gretel-client

2025.2.11

Feature: Allows disabling cleanup of artifacts in hybrid if explicitly "disabled"
Feature: Allows a gender to be specified for a transform persona
Fix: Fixes a bug that prevented workflow level evaluate running for gretel_model outputs.
Task: Switch our Azure Navigator-Tabular models from using gpt3.5 --> gpt-4o-mini
Task: Moves the validation logic to right before we create, so project_guids can still be used
Fix: Removed unnecessary prompt templates for DataDesigner that led to inconsistent data quality.

2025.2.3

Feature: Add support for doing PII Replay for specified columns. This can be used in conjunction with specified entity types or in place of them.
Task: Deprecates Classification and Regression MQS reports.
Fix: Ensure the bounds when sampling are taken into account for our dataset when doing an inference attack
Task: Add ability to independently toggle LLM-based and regexp-base NER.

January 2025

2025.1.28

Feature: CLI and SDK Enterprise Tenant Selection.
Task: Add direnv venv support.
Task: Allow specifying repository for hybrid supervisor image.
Fix: Add 10s timeout when validating connections/workflow actions.
Task: Default ner_optimized to True.
Fix: Fix client integration tests with missing custom deps.
Task: Update GenerateColumnFromTemplate task.
Task: Push Qwen coder and instruct images.
Fix: Update test harness default case.

2025.1.21

Fix: Fixes an issue with DataDesignerWorkflow.from_yaml.
Fix: Don't validate workflow connections that reference hybrid connections.

2025.1.14

Task: Update SDK's Project.search_model and CLI's models search to include more parameters.
Fix: Change blank AWS access keys/ secret keys to be an error state when creating a connection.
Task: Update default NER threshold to .7.
Fix: Categorize an error during classify to make it clearer what went wrong.
Fix: Give an earlier and clearer warning when using an invalid project name with the high level SDK.
Fix: Lower the packaging version to a version that works with ctgan.
Fix: Check deprecated access key parameter of creds when doing validation.

2025.01.07

Fix: Update tmp_project to allow passing the hybrid cluster guid, consistent with create_project.
Fix: Add better error codes to some SQL exceptions.
Fix: Adds some additional headers that Azure serverless needs for talking to navigatur tabular.
Feature: Improve edit mode stability in Navigator Tabular. Added ability to pass sample_data in the generate() method in Gretel SDK.
Task: Update getModels to include more query parameters.
Fix: Fix occasional crash in navft when training with columns of non-native python types like Datetime, Timestamp and Decimal.

December 2024

2024.12.17

Feature: Support for Amazon Nova suite of models in Data Designer.
Task: Support addition of categorical seed columns for seed generation in DataDesignerFromSampleRecords.
Fix: Issue where we sent a deprecated max_tokens field to our TGI LLMs, causing us to ignore the field.
Feature: Improved data designer seed generation.
Task: Scheduled Workflows are only available as a paid feature.
Fix: Catch and add a more accurate error code to Workflow OOM errors.

2024.12.10

Feature: Adds an Evaluate action to Gretel Workflows. This allows you to generate a single SQS report using inputs from multiple Workflow steps. For example, you could generate a report comparing raw training data from your S3 bucket against data that has been transformed and synthesized. You can also use the Holdout action to feed in an additional holdout set which is then used by our Privacy Metrics.
Feature: Adds an AWS Bedrock adapter for Navigator Tabular to the Gretel SDK.
Fix: Fixes JSON and JSONL support for files encoded with UTF-16/32.
Fix: Fixes automatic prompt naming for saved prompts.
Fix: Addresses some usability issues with the bedrock Navigator model
Task: Part of the gretel-client release.

2024.12.3

Feature: Add the new Sample-to-Dataset tasks and workflow into the DataDesigner module.
Fix: Drop windows test support in gretel-client; file paths are brittle there.
Task: Update the pypi action for package releases.
Task: Remove some user details from the Projects API.
Fix: Increase privacy protection level/privacy configuration in report when using differential privacy with Navigator Fine Tuning.

November 2024

2024.11.19

Feature: The Gretel SDK now supports AWS Bedrock/Sagemaker and Azure Models-as-a-Service for Navigator Tabular. Users can bring their own client configurations and create a Navigator adapter. Once the adapter is created users can generate and edit tabular data.
Fix: Give clarity in tv2 hybrid when an LLM is not deployed
Feature: This PR adds a new, public facing workflow action. This action splits the source dataset into a main training set that continues to be used in the workflow and a holdout test set that is saved until the very end when we calculate Privacy Metrics.
Task: Remove unnecessary backports.cached_property dependency.
Fix: Fix bug when iterating through workflow messages.
Fix: Provide transform report via the SDK
Task: Relational synthetics is being deprecated. A warning message has been added to inform users of this change in workflow task logs.
Feature: Adds Azure Fine Tuning support to the Gretel SDK. Synthetic data can be formatted into OpenAI fine-tuning and inference formats and end-to-end fine-tuning can be managed directly from the Gretel SDK.
Feature: Feature: Introduce auto for config parameters delta and max_sequences_per_example in Navigator Fine Tuning.
Feature: Add DP-FT capabilities to NavFT, mostly leveraging utilities that already existed for GPT-x.

2024.11.12

Task: Add PII Replay to SQS Report for an Evaluate model
Feature: Release Note: The release adds a new navigator module to gretel_client, which provides interfaces for Gretel's new Navigator Task Execution framework, which is in beta and will be available to select customers.

2024.11.5

Feature: Attempt to find hybrid LLMs if none specified for TV2 classify
Fix: Support reading transform configs in high level sdk
Fix: Parse and validate JudgeWithLLM Task

October 2024

2024.10.29

Task: Allow pulling the base image in the warm pool
Fix: Fix trust_remote_code for GPT-x
Task: Add datadog tracing http enabled
Task: Gate m1 features via configcat
Task: Add default llama suite config
Task: Update go-license logic to use pkg.go.dev
Task: Improve Jarvis API observability
Fix: Jarvis SQL templating issue
Task: Add httproutes for each LLM
Task: Add provenance and new GretelMetadata field, separate out types
Fix: Fix evaluation errors
Task: Add call_task method to Task interface
Fix: Use internal name for the gateway
Task: Add more restrictive limiter for get_model logs lambda

2024.10.25

Fix: Fix an issue with improperly logging out a console session
Task: Change up query logic for record handlers to use one complete status call
Task: Remove notifications for github workflow runs
Task: Update transform V2 report style

2024.10.22

Task: Add an optional configuration option passthroughImageFormat that allows for preserving the image name provided when calling image registries
Fix: Reintroduce blocking username changes
Fix: Fix an indexing bug for hybrid workflow image resolution

2024.10.15

Fix: Increase the max number of tokens allowed for intent planning
Fix: Fix image name handling for the supervisor container by the gcc-controller
Fix: Fix the handling of group_training_examples_by in Navigator FT to work for multiple fields
Fix: Fix bug in GPT-X model loading

2024.10.9

Feature: Enable fine-tuned GPT-x models to be run using vllm in generation, by use of the use_vllm generation parameter
Task: Remove dependency on registry authentication for gcc-controller
Feature: Add HTML report to Transform V2. In Hybrid mode, this is written to the output bucket along with the json report

2024.10.7

Fix: Fix an issue where fake(seed=...) no longer worked for transform_v2 configs
Feature: Add a date_time_shift, date_format and date_time_format function to transform_v2
Fix: Fix issue in gretel-hybrid's Azure Terraform module which did not respect the skip_kubernetes_resources flag. The Kubernetes namespace will no longer be managed by Terraform if the flag value evaluates to true

2024.10.3

Feature: Update Navigator FT generation logging to include more detail on format errors in the invalid records
Fix: Fix occasional NavigatorFT crash due to mishandling of carriage returns resulting in possible malformed input file errors
Fix: Fix Privacy Metrics AIA graph height for a small number of columns in the SQS Report
Fix: Add error logging in NavFT for group_by/order_by
Feature: Add flag in gretel-data-plane Helm chart for conditionally disabling Argo Workflows controller resources. This allows for using an existing Argo Workflows controller deployment that has permissions to run Workflows. Default behavior keeps the deployment of the Argo Workflows controller resources

2024.10.1

Task: Add the ability for navigator to suggest a prompt name
Fix: Increase log retry and add a sleep for retrieving workflow logs
Feature: Print the model ID when doing a trainer run to help with debugging
Feature: Set a few more prefilled endpoints that can be used for LLM templating
Fix: - Fix an issue with Project invites not respecting email case sensitivity. Project invites sent to e.x. guest@greteluser.com and Guest@greteluser.com should no longer create duplicate invites, and users receiving Project invites should no longer be missing invites
Feature: Remove an unused CRD from our public chart

September 2024

2024.9.25

Fix: Solves an issue with inference configs in production

2024.9.24

Fix: Sets the proper domain name for serverless endpoints
Fix: Fixes privacy filtering for certain datasets.
Fix: Allow the client to optionally disable SSL verification for testing purposes.

2024.9.18

Fix: Fixes updating connections when done in a hybrid context

2024.9.17

Feature: Added a BigQuery integration module that provides Gretel <> BigFrames native support in the Gretel Python SDK
Feature: Add model license information to the /v1/inference/models endpoint (if available)
Fix: Update error handling for generation failures in the ACTGAN model.
Task: More specific email login rejection messages.
Fix: Updating error message for validation in Relational Workflows.
Fix: Fix an issue where the incorrect total is returned from the /v1/workflows/runs/tasks/search endpoint.
Fix: Make downloading a model from the HF hub more resilient, by increasing retries.

2024.9.10

Fix: Ensure non-standard encoded characters can be extracted and loaded from Workflow database connectors.
Fix: Increased HTTP client timeout defaults for workflows to 30 seconds.
Fix: Fix race condition when performing model status updates.

2024.9.3

Feature: Add additional error codes for workflows
Fix: Fix variable assignment error in LSTM model training when evaluation is skipped.
Feature: Support for JSON columns in MySQL and Postgres connectors has been removed
Feature: gretel_tabular workflow action no longer attempts JSON column normalization
Feature: gretel_tabular workflow action limits tables with JSON columns to NavFT and Transform (v1 and v2) models
Feature: Model training times in gretel_tabular workflow actions are now faster via reducing data preview size and deferring evaluation.
Fix: When using the gretel-inference-llm helm chart, users have the choice of either passing in the apiKey or the apiKeySecretRef in their values.yaml. In instances where apiKey was provided, we attempted to create a k8s Secret, but failed due to a yaml templating error
Fix: Fix bug in Navigator FT where generation would sometimes fail when group_training_examples_by is set.
Fix: Adds security context to initContainer used in the inference-llm chart

August 2024

2024.8.27

*Note: With this release, we've switched to YYYY.MM.DD versioning

Fix: Allow classify within TV2 hybrid-only if deployed_llm_name is set
Feature: Add Privacy Metrics to Evaluate
Fix: Fix edit-in-place prompt and create mode for Navigator
Fix: Update the combined models image API response
Fix: Support auto param for NavFT num_input_records_to_sample to automatically choose a reasonable value for this training time param.

2024.8.4

Fix: Update error messages for NavFT max token related errors.
Fix: Adds back Navigator validation for properly coercing non-str values into string values for the tabular data we return.
Feature: Add ner_optimize setting to Tv2 for configuring GPUs. If ner_optimized is set to true a GPU will be configured, if false, a GPU won't get configured.
Fix: Bug where a timezone offset included in an input to Tv2 date_shift caused an error.
Feature: Workflow tasks that were active at the time of workflow run cancellation are now assigned cancelled status instead of errored status.
Feature: Allow parquet files to be uploaded for NavigatorFT jobs.
Fix: An issue that could lead to "Token count exceeds the limit" error in the Navigator batch jobs.

2024.8.3

Add Gretel uploaded_data_source Action
More flexible data generation in the Navigator
Support nav-ft in gretel_tabular
Bugfix: Create project with adding _user_id for NotFoundException
Bugfix: Fix test for gretel-python-client

2024.8.2

Bugfix: Fix increased Navigator FT runtime after privacy metrics release

2024.8.1

Add Privacy Metrics to Report
Restrict project invites to external users based on the domain policy

July 2024

2024.7.7

Bugfix: jobs controller issues related to models/models
Support Tv2 column classification in hybrid deployments

2024.7.6

Bugfix: fix handling of none-like values in Navigator-Fine-Tuning
Bugfix: Improve handling of not-nullable zero-values (empty strings, 0-integers) in workflows
Bugfix: fix global locales in Tv2 configs
Bugfix: prevent recursion error in TabularDP
Add membership inference attack score to reports
Grant project access to domain owners for workflows and connections

2024.7.5

Bugfix: Fix agent resolution of models image
Bugfix: Add transform (v1) and classify to new models image
Improved handling of column types in MySQL and BigQuery connectors
Set home to /run directory when running hybrid not as root
Update GPT-x DP fine tuning to use Poisson Sampler
Add text entity report to Tv2
Support resolving data from multiple sources in workflow actions
Support filtering by model_id and/or model_type on /v1/inference/models endpoint
Add quasi_identifier_count privacy_metrics to synthetic model configs
Add inference attack score to synthetics reports

2024.7.4

Bugfix: Properly set default globals and classify configuration values in Tv2
Bugfix: Resolve permissions issues running hybrid GPT-x and Tv2 jobs as non-root
Bugfix: Validation of BigQuery connector with unspecified dataset
Bugfix: Properly recognize valid MSSQL identity column types
Bugfix: Fix model-model image resolution for tagged images
Improved searching of projects via new owned_by query parameter added to /projects endpoint
Improved error messaging related to JSONL errors in Navigator

2024.7.3

Bugfix: Hybrid consolidated Model container did not have proper CUDA paths set, causing GPT-x jobs to fail
Improvements to Tv2 logging. When Tv2 is processing long NER text blobs, ensure that progress is being reported on regular intervals
Bugfix: Navigator inference requests would sometimes fail when using the Google Gemini Pro model
Text SQS update to semantic similarity score. This update is to ensure that the score is penalized for increasing number of synthetic records that are not semantically similar to any training records
Bugfix: Properly configure java.nio for Databricks connector

2024.7.2

GPT-x now uses flash attention to speed up training and inference
Bugfix: Invalid model configs would sometimes result in 500 errors.&#x20
Updates to NavFT to ensure rope_scaling_factor is consistent between model training and inference
Improvements to Navigator prompt templating
Increase NavFT rope_scaling_factor upper bound to 6 (from 2)
Hybrid deployments default to use new consolidated Model and Workflow docker images

2024.7.1

Improvements to Workflow error logging
Bugfix: Workflows using Azure Blob Storage would sometimes commit 0 byte blocks causing write failures

June 2024

Release notes for the Gretel Platform, June 2024

2024.6.9

Add support for setting crawl limits when configuring Gretel Workflow object storage connectors. To set a limit, configure limit on your object storage source connector.

2024.6.8

Improvements to Workflow config validation. Workflow action names are now validated to ensure uniqueness within a Workflow config.
Gretel BigQuery connections can now be created without specifying a dataset. You can instead configure the BigQuery dataset by passing bq_dataset when configuring a bigquery_source action.
Bugfix to database subsetting. When collecting batches of data, those batches previously needed to contain the same set of columns. This constraint would sometime break subsetting if columns were sparsely populated.
Hybrid Model docker images have now been consolidated into a single Model image.
Hybrid Workflow docker images have now been consolidated into a single Workflow image.
Intermediate Workflow artifacts are now immediately cleaned up when a Workflow completes. When a Workflow is configured with a sink, any intermediate model artifacts produced by the Workflow are cleaned up and removed when the Workflow completes.
GPT-x, update config validation to limit epsilon to be between 0.1 and 100.
GPT-x, ensure sampling probability is never larger than 1.0.
Bugfix: When writing objects to Azure Blob Storage, block sizes were written in chunks that were too small, leading to errors when writing larger object. Objects are now written in larger 25mb blocks.

2024.6.7

Standardize Tv2 column properties. The column object can be used to access specific properties of a column that is being evaluated in Tv2. See the Tv2 reference for more details.
Update Tv2 to maintain referential integrity. By default, the gretel_tabular action when using Tv2 will ensure that Pk/FK columns are not transformed. By setting run.encode_keys: true within the action, keys will be transformed to integers or UUIDs.
Bugfix in gretel_tabular where null foreign-keys can be included when using subsetting.
Bugfix for Synthetic Quality Score for field correlation stability when missing values are in the data.
Bugfix for enforcing Teams runtime limits (max objects crawled, max bytes processed) on Workflows. These limits were previously being loaded from specific users, this is now fixed so they limits are loaded by Team if the user is a member of one.

2024.6.6

- Check out the blog for even more details!
- This model is available via the models-navigator_ft container for Hybrid customers.

2024.6.5

Improve error messages within Gretel Navigator
Added new partial_mask() filter to Tv2.

2024.6.4

Update model names within Gretel Navigator
Bug fix for Gretel Navigator edit mode when adding numerical columns.

2024.6.3

For GPT-x, the delta hyperparam will only be automatically updated if dp: true. Previously it was updated regardless of DP being enabled which was unnecessary.
Improvements to the SQS Text Statistical Score for measuring quality of synthetic natural language data.

2024.6.02

Improved prompt validation for Gretel Navigator
When using Tv2 with gretel_tabular columns will no longer be attempted to be ordered in their original order. This causes issues when Tv2 configs are adding or removing columns.

2024.6.1

Tv2 NER will utilize GPUs when available.
Databricks destination connector optimizations.
Better handling for foreign key column with null values in gretel_tabular.

Console Release Notes

Gretel's Console web application provides a flexible low-code interface for getting started with Gretel, and serves as the interface for managing your models, workflows, team, and billing.

Release Schedule

Console is generally deployed daily, Monday - Thursday, certain holidays excepted.

Release Notes Schedule

Release Notes for Console are published every Monday for the previous week.

January 2025

2025.01.24 through 2025.01.31

Improvement: Updates to button styling throughout Console
Improvement: Consolidated styling of Lists

2025.01.20 through 2025.01.24

Fix: Prevent duplicate Transform_V2 run and ensure results are displayed.
Fix: No longer attempt to show ephemeral training inputs after training is complete.
Fix: Adjusted alignment of status chip in Workflow Runs list.
Fix: In editor, updated styling so non-clickable items aren't styled as a link.

2025.01.13 through 2025.01.17

Fix: Correct styling for Playground batch generate alert

2025.01.05 through 2025.01.11

Fix: Hide toast message for local file upload which says validation is not applicable

2024.12.30 through 2025.01.05

Feature: Added the ability to preview uploaded datasets within workflow builder.

December 2024

2024.12.9 through 2024.12.13

No customer-facing updates in this release period

2024.12.2 through 2024.12.6

Improvement: Set a new default row count on Navigator, getting results to the user faster
Bugfix: Fix a double click on the Tabular/Natural toggle on Navigator causing the page to be in a broken state

November 2024

2024.11.25 through 2024.11.29

Improvement: Add a button to upload one's own model config within workflow builder
Bugfix: Fix workflow builder submit validation not being able to revalidate, causing the user to not be able to resubmit

2024.11.18 through 2024.11.22

Feature: Persist main sidebar state across sessions and sync across app instances

2024.11.11 through 2024.11.15

Improvement: Add a "Choose a different file" button to Workflow Builder uploads
Bugfix: formatting of uploaded files in some cases when creating a workflow
Bugfix: Fix an issue with building the workflow properly when you select your model type before defining an input
Improvement: Moved the Save button to the bottom of the Creation Tiles in the Workflow Builder page
Feature: Whether the main sidebar navigation is collapsed or not will persist across sessions and sync across app instances
Improvement: Update the Gretel wordmark

2024.11.4 through 2024.11.8

Improvement: Remove deprecated route that was redirecting to the "From Scratch" blueprint flow
Fix: Address an issue with properly building a workflow after selecting a model type before defining an input
Fix: Improve file name detection when uploading files using the Workflow Builder

October 2024

2024.10.28 through 2024.11.1

Feat: Enable filtering by status from within the Model Project list view

2024.10.22 through 2024.10.25

No customer-facing updates in this release period

2024.10.15 through 2024.10.22

Feat: Updated Blueprint cards to use our new Categorical Label component instead of the Chip component for showing which cards are Notebook cards or Newly added cards.
Feat: Updated Project list to use our new Categorical Label component instead of the Chip component for showing whether projects are Cloud or Hybrid
Feat: Updated connection list item to use Categorical label instead of Chip for Source/Destination label

2024.10.05 through 2024.10.14

Fix: Download CSV button in Edit tabular dataset mode was not working; it works now!
Feature: Suggest users to invite to a project based on team membership
Fix: Improved state handling for the Navigator "Model" selector
Fix: Fixed design related issue where Error page was not using Gretel styling
Improvement: Change the way user's names are shown in console to make it easier to identify users
Fix: Ensure users can continue through use cases when the default cloud output is selected
Feature: Add updated clear prompt button to Navigator
Fix: Improve error handling when user has an invite from a user that can't be found (e.g., the user was deleted)
Hybrid
- Fix: Correctly set workflow output type to connection when in a hybrid project, fixes workflow creation issue
- Fix: Allow hybrid-only users to create projects, if they don't have project creation restricted
- Fix: Workflow builder yaml validation in Advanced tab now works for Hybrid Projects

2024.09.30 through 2024.10.04

Feature: Add Data privacy metrics for GPTx models
Feature: Add functionality for prefilling new saved prompt name with an AI generated suggestion.
Improvement: Connection creation UI updated for Hybrid projects
- show New Connection button on project page
- remove cloud project alert from connection creation wizard
Fix: Fixed issue where Error page was not using Gretel styling.
Fix: Allow hybrid only users to create projects
Fix: Ensure users can continue through use cases when the default cloud output is selected

September 2024

2024.09.23 through 2024.09.27

Improvement: The “Clear Prompt” button has been removed from the Navigator Playground prompt window to prevent unintentional clearing. A better alternative is coming soon.

2024.09.16 through 2024.09.20

Fix: Fixed a bug in workflow creation causing invalid configs for Hybrid projects
Fix: Updated the "Download" tooltip for Model Records, which incorrectly said all non-records were "data-previews".
Feature: Added a “License” button to Navigator which links legal license governing the current model

2024.09.09 through 2024.09.13

Improvement: Navigator Playground's saved prompt now have a default name, and the user is prevented from saving a prompt with an empty name.

2024.09.02 through 2024.09.06

Feature: new workflow creation experience that simplifies process into a single page
Feature: users can now save their prompts submitted to Navigator playground
Fix: changes to the model config template didn't update the underlying config

August 2024

2024.08.26 through 2024.08.30

Feature Launch: Released model improvements, an updated config template, and updated blueprint for Navigator Fine Tuning. It is now the default model recommended in the console as part of the General Availability launch.
Bugfix: project admins should not see the option to update other project admin permissions
Bugfix: invalid date in the org member table wasn't rendered properly
Remove Text/SQL toggle from playground
Bugfix: When using the Navigator FT model in workflows, we previously weren't setting the default num_records field (for gretel_model actions) or the default num_records_multiplier field (for gretel_tabular actions).

2024.08.19 through 2024.08.23

Update how Console decides whether to use gretel_model or gretel_tabular when helping the user build a Workflow Config via the Blueprints flow. This change fixes some potential for bugs in the final config, and better aligns with current backend capabilities.
Bugfix: Improve handling of models that could have data privacy metrics, but the options were disabled by the user.

2024.08.12 through 2024.08.16

Improve flexibility around what connection types are allowed to be used when creating a workflow. We previously constrained the allowed connection types (e.g., S3, Azure, BigQuery) when creating a workflow based on the type of model selected. This is no longer necessary, and so we've removed these constraints.
Bugfix: Don't attempt to render new data privacy metrics for models that don't support this metric.
Minor UX improvement for score displays in Models List.

Create Synthetic Data

Model Configurations

Gretel configurations are declarative objects that specify how a model should be created. Configurations can be authored in YAML or JSON.

All Gretel models follow the same high-level configuration file format structure. All configurations include schema_version and name keys, as well as a models array that is keyed by a [model_id]. Within the [model_id] object, all model configurations have a data_source key.

[model_id] is replaced with the type of model you wish to train (e.g. navigator_ft, gpt_x, actgan, tabular_dp, or transform_v2).
The mapping between Gretel models and configuration model_id values is:
- Tabular Fine-Tuning: navigator_ft
- Text Fine-Tuning: gpt_x
- Tabular GAN: actgan
- Tabular DP: tabular_dp
- Transform: transform_v2
data_source must point to a valid and accessible file in CSV, JSON, or JSONL format.
- Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
  - Note: Some models have specific data source format requirements
- data_source: __tmp__ can be used when the source file is specified elsewhere using:
  - --in_data parameter via CLI,
  - parameter via SDK,
  - dataset button via Console.

Each Gretel model has different additional keys within the model_id object and unique configuration parameters specific to that model. For details on the configuration parameters for each model, see the specific model page:

Synthetics

This section covers the model training and generation APIs shared across all Gretel models.

Synthetic Models

Gretel offers the following synthetics models:

1. Data types: Numeric, categorical, text, JSON, event-based
2. Differential privacy: Optional
3. Formerly known as: Navigator Fine Tuning
1. Data types: Text
2. Differential privacy: Optional
3. Formerly known as: GPT
1. Data types: Numeric, categorical
2. Differential privacy: NOT supported
3. Formerly known as: ACTGAN
1. Data types: Numeric, categorical
2. Differential privacy: Required; you cannot run without differential privacy

Supported Features

This section compares features of different generative data models supported by Gretel APIs.

✅ = Supported

✖️ = Not yet supported

Model Configuration

All Gretel Synthetics models follow a similar configuration file format structure. Here is an example model-config.yaml

[model_id] is replaced with the type of model you wish to train (e.g. navigator_ft, gpt_x, actgan, tabular_dp).
data_source must point to a valid and accessible file in CSV, JSON, or JSONL format.
- Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
- data_source: __tmp__ can be used when the source file is specified elsewhere using:
  - --in_data parameter via CLI,
  - parameter via SDK,
  - dataset button via Console.
- The params object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on the data_source.

Create and Train a Model

Use the following CLI command to create and train a synthetic model.

--in_data is optional if data_source specified in the config, and can be used to override the value in the config.
--in_data is required if data_source: __tmp__ is used in the config
--name is optional, and can be used to override the name specified in the config

Designate project

Create model object and submit for training

During training, the following model artifacts are created:

Generate data from a model

Use the gretel models run command to generate data from a synthetic model.

--model-id supports both a model uid and the JSON that models create outputs
There are many different --param options, depending on the model.
- num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.
--in_data is optional and used for conditional data generation when supported by the model

Create and submit record handler

There are many different params options, depending on the model.

num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.

View results

Gretel Tabular Fine-Tuning

LLM-based AI system supporting multi-modal data.

Gretel Tabular Fine-Tuning (navigator_ft) is an AI system combining a Large-Language Model pre-trained specifically on tabular datasets with learned schema based rules. It can train on datasets of various sizes (we recommend 10,000 or more records) and generate synthetic datasets with unlimited records.

navigator_ft excels at matching the correlations (both within a single record and across multiple records) and distributions in its training data across multiple tabular modalities, such as numeric, categorical, free text, JSON, and time series values.

navigator_ft is particularly useful when:

Your dataset contains both numerical / categorical data AND free text data
You want to reduce the chance of replaying values from the original dataset, particularly rare values
Your dataset is event-driven, oriented around some column that groups rows into closely related events in a sequence

Model creation

The config below shows all the available training and generation parameters for Tabular Fine-Tuning. Leaving all parameters unspecified (we will use defaults) is a good starting point for training on datasets with independent records, while the group_training_examples_by parameter is required to capture correlations across records within a group. The order_training_records_by parameter is strongly recommended if records within a group follow a logical order, as is the case for time series or sequential events.

Parameter descriptions

data_source (str, required) - __tmp__ or point to a valid and accessible file in CSV, JSONL, or Parquet format.
group_training_examples_by (str or list of str, optional) - Column(s) to group training examples by. This is useful when you want the model to learn inter-record correlations for a given grouping of records.
order_training_examples_by (str, optional) - Column to order training examples by. This is useful when you want the model to learn sequential relationships for a given ordering of records. If you provide this parameter, you must also provide group_training_examples_by.
params - Parameters that control the model training process:
- num_input_records_to_sample (int or auto, required, defaults to auto) - This parameter is a proxy for training time. It sets the number of records from the input dataset that the model will see during training. It can be smaller (we downsample), larger (we resample), or the same size as your input dataset. Setting this to the same size as your input dataset is effectively equivalent to training for a single epoch. A starting value to experiment with is 25,000. If set to auto, we will automatically choose an appropriate value.
- batch_size (int, required, defaults to 1) - The batch size per device for training. Recommended to increase this when differential privacy is enabled. However, if the value is too high, you could get an out-of-memory error. A good size to start with is 8.
- gradient_accumulation_steps (int, required, defaults to 8) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.
- learning_rate (float, required, defaults to 0.0005) - The initial learning rate for AdamW optimizer.
- warmup_ratio (float, required, defaults to 0.05) - Ratio of total training steps used for a linear warmup from 0 to the learning rate.
- weight_decay (float, required, defaults to 0.01) - The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in the AdamW optimizer.
- lora_alpha_over_r (float, required, defaults to 1.0) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2.
- lora_r (int, required, defaults to 32) - The rank of the LoRA update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
- lora_target_modules (list of str, required, defaults to ["q_proj", "k_proj", "v_proj", "o_proj"]) - The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.
- rope_scaling_factor (int, required, defaults to 1) - Scale the base LLM's context length by this factor using RoPE scaling to handle datasets with more columns, or datasets containing groups with more than a few records. If you hit the error for maximum tokens, you can try increasing the rope_scaling_factor. Maximum is 6, and you may first want to try increasing to 2.
- max_sequences_per_example (int, optional, defaults to auto) - This controls how examples are assembled for training and automatically set to a suitable value with auto (default).
- use_structured_generation (bool, optional, default false) - With DP, we might have issues learning the tabular format, so in order to get more valid records, it helps to do structured generation.
privacy_params - To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.
- dp (bool, optional, default false) - Flag to turn on differentially private fine tuning when a data source is provided.
- epsilon (float, optional, default 8) - Privacy loss parameter for differential privacy. Lower values indicate higher privacy.
- per_sample_max_grad_norm (float, optional, default 0.1) - Clipping norm for gradients per sample to ensure privacy. For each data sample, the gradient norm (magnitude of the gradient vector) is calculated. If it exceeds per_sample_max_grad_norm, it is scaled down to this threshold. This ensures that no single sample’s gradient contributes more than a set maximum amount to the overall update.
generate - Parameters that control model inference:
- num_records (int, required, defaults to 5000) - Number of records to generate. If you want to generate more than 50_000 records, we recommend breaking the generation job into smaller batches, which you can run in parallel.
- temperature (float, required, defaults to 0.75) - The value used to control the randomness of the generated data. Higher values make the data more random.
- repetition_penalty (float, required, defaults to 1.2) - The value used to control the likelihood of the model repeating the same token.
- top_p (float, required, defaults to 1.0) - The cumulative probability cutoff for sampling tokens.
- stop_params (optional) - Optional mechanism to stop generation if too many invalid records are being created. This helps guard against extremely long generation jobs that likely do not have the potential to generate high-quality data. To turn this parameter on, you must set two parameters:
  - invalid_record_fraction (float, required) - The fraction of invalid records generated by the model that will stop generation after the patience limit is reached.
  - patience (int, required) - Number of consecutive generations where the invalid_record_fraction is reached before stopping generation.

Minimum requirements

If running this system in hybrid mode, the following instance specifications are recommended:

CPU: Minimum 4 cores, 32GB RAM.

GPU (Required): Minimum Nvidia A10G, L4, RTX4090 or better CUDA compliant GPU with 24GB+ RAM and Ada or newer architecture. For faster training and generation speeds and/or rope_scaling_factor values above 2, we recommend GPUs with 40+GB RAM such as NVIDIA A100 or H100.

Limitations and Biases

The default context length for the underlying model in Tabular Fine-Tuning can handle datasets with roughly 50 columns (less if modeling inter-row correlations using group_training_examples_by). Similarly, the default context length can handle event-driven data with sequences up to roughly 20 rows. To go beyond that, increase rope_scaling_factor. Note that the exact threshold (where the job will crash) depends on the number of tokens needed to encode each row, so decreasing the length of column names, abbreviating values, or reducing the number of columns can also help.
navigator_ft is a great first option to try for most datasets. However, for unique datasets or needs, other models may be a better fit. For heavily numerical tables or use cases requiring 1 million records or more to be generated (navigator_ft can generate batches of up to 130,000 records at a time), we recommend using actgan. It will typically be much faster at generating results in these scenarios. For text-only datasets where you are willing to trade off generation time for an additional quality boost, we recommend using gpt_x.
Given the model is an LLM, mappings from the training data often persist in the synthetic output, but there is no guarantee. If you require mappings across columns to persist, we recommend doing pre-processing to concatenate the columns or post-processing to filter out rows where the mappings did not persist.
Pre-trained models such as the underlying model in Tabular Fine-Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.

Gretel Text Fine-Tuning

Model type: Generative pre-trained transformer for text generation

Gretel Text Fine-Tuning simplifies the process of training popular Large Language Models (LLMs) to generate synthetic text. It offers support for differentially private training, ensuring data privacy, and includes automated quality reporting with Gretel's Text Synthetic Quality Score (SQS). This allows you to create labeled examples to train or test other machine learning models, fine-tune the model on your data, or prompt it with examples for inference.

Model creation

To prompt the base model directly without fine-tuning, set data_source to null at initialization.

When fine-tuning Gretel Text Fine-Tuning models, these constraints apply:

Use 100+ examples if possible. Less than 100 - just prompt the base model directly.
Providing only 1-5 records will cause an error.
If your training dataset is a multi-column format, you MUST set the column_name.

data_source (required) - Use __tmp__ or a valid CSV, JSON, or JSONL file. Leave blank to skip fine-tuning and use the base LLM weights, for few-shot or zero-shot generation.
column_name (optional) - Column with text for training if multi-column input. This parameter is required if multi-column input is used.
params - Controls the model training process.
- batch_size (optional, default 4) - Batch size per GPU/TPU/CPU. Lower if out of memory.
- epochs (optional, default 3) - Number of training epochs.
- weight_decay (optional, default 0.01) - Weight decay for AdamW optimizer. 0 to 1.
- warmup_steps (optional, default 100) - Warmup steps for linear lr increase.
- lr_scheduler (optional, default linear) - Learning rate scheduler type.
- learning_rate (optional, default 0.0002) - Initial AdamW learning rate.
- max_tokens (optional, default 512) - Max input length in tokens.
- validation (optional) - Validation set size. Integer is absolute number of samples.
- gradient_accumulation_steps (optional, default 8) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.
- lora_r (optional, default 8) - Rank of the matrices that are updated. A lower value means fewer trainable model parameters.
- lora_alpha_over_r (optional, default 1) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, values of 0.5, 1 or 2 work well.
- target_modules (optional, default null) - List of module names or regex expression of the module names to replace with LoRA. When unspecified, modules will be chosen according to the model architecture (e.g. Mistral, Llama).
privacy_params - To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.
- dp (optional, default false) - Flag to turn on differentially private fine tuning when a data source is provided.
- epsilon (optional, default 8) - Privacy loss parameter for differential privacy. Specify the maximum value available for model fine tuning.
- entity_column_name (optional, default null) - Column representing unit of privacy. e.g. name or id. When null, record-level differential privacy will be maintained, i.e. the final model does not change by much when the input dataset changes by one record. When specified as e.g. user_id, user-level differential privacy is maintained.
generate (optional) - Controls generated outputs during training.
- num_records (optional, default 10) - Number of outputs.
- maximum_text_length (optional, default 100) - Max tokens per output.

Data generation

Parameters Documentation

General Configuration

schema_version (optional): Defines the version of the configuration schema.
name (optional): Name of the model configuration.

Models

models (required): List of model configurations.
- gpt_x: Configuration for a specific model instance.
  - data_source (required): URLs or paths to the data files (CSV, JSON, JSONL). For temporary data, use "tmp".
  - pretrained_model (optional): Pretrained LLM model to use. Defaults to "gretelai/gpt-auto".
  - prompt_template (optional): Template for prompting the model.
  - column_name (optional): Name of the column with text data if using multi-column input. Required parameter if using multi-column input.
  - validation (optional): Size of the validation set, specified as an integer (absolute number of samples).

Training Parameters

params (optional): Configuration for training parameters.
- batch_size (default 4): Number of samples per batch per GPU/TPU/CPU.
- epochs (optional): Number of complete passes through the training dataset.
- steps (default 750): Total number of training steps to perform.
- weight_decay (default 0.01): Weight decay coefficient for the AdamW optimizer, a regularization parameter.
- warmup_steps (default 100): Number of steps for learning rate warmup.
- lr_scheduler (default linear): Type of learning rate scheduler.
- learning_rate (default 0.0002): Initial learning rate for the AdamW optimizer.
- max_tokens (default 512): Maximum number of tokens for each input sequence.
- gradient_accumulation_steps (default 8): Number of steps to accumulate gradients before updating model parameters.

Parameter-Efficient Fine-Tuning (PEFT) Parameters

peft_params (optional): Parameters for fine-tuning using PEFT.
- lora_r (default 8): Rank of the low-rank adaptation matrix in LoRA.
- lora_alpha_over_r (default 1.0): Scaling factor for the LoRA adaptation.
- target_modules (optional): Specific modules to apply LoRA adaptation.

Privacy Parameters

privacy_params (optional): Configuration for differential privacy (DP).
- dp (default false): Enable differentially private training using DP-SGD.
- epsilon (default 8.0): Privacy budget parameter for DP.
- delta (default "auto"): Privacy parameter for DP, usually a very small number.
- per_sample_max_grad_norm (default 1.0): Clipping norm for gradients per sample to ensure privacy.
- entity_column_name (optional): Column name for entity-level differential privacy.

Generation Parameters

generate (optional): Parameters controlling the generation of synthetic text.
- num_records (default 10): Number of records to generate.
- seed_records_multiplier (default 1): Multiplier for the number of rows emitted per prompt in prompt-based generation.
- maximum_text_length (default 100): Maximum number of tokens per generated text.
- top_p (default 0.89876): Probability threshold for nucleus sampling (top-p).
- top_k (default 43): Number of highest probability tokens to keep for top-k sampling.
- num_beams (default 1): Number of beams for beam search. Use 1 to disable beam search.
- do_sample (default true): Enable sampling if true, otherwise use greedy search.
- do_early_stopping (default true): Enable early stopping in beam search if true.
- typical_p (default 0.8): Typical probability mass to consider in sampling.
- temperature (default 1.0): Sampling temperature. Higher values increase randomness.

Usage

Training Configuration: Define your data source and configure model parameters. Optionally, enable privacy settings.
Data Generation: Supports unconditional and prompt-based text generation. Configure generation parameters to control output features.

Make sure to set data_source and pretrained_model as per your requirements. Use column_name for specifying the text column in multi-column data inputs.

Model Information

The Gretel Text Fine-Tuning model supports fine-tuning and inference of commercially viable large language models. Specific model information can be found on each model card linked below.

Supported Models

gretelai/gpt-auto: Automatically selects the best available LLM for model training
mistralai/Mistral-7B-Instruct-v0.2
meta-llama/Meta-Llama-3-8B-Instruct

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.

CPU: Minimum 4 cores, 32GB RAM.

GPU (Required). Minimum Nvidia A10G, RTX3090 or better CUDA compliant GPU with 24GB+ RAM is required to run basic language models. For fine-tuning on datasets with more than 1,000 examples, a NVIDIA A100 or H100 with 40+GB RAM is recommended.

Limitations and Biases

Large-scale language models such as Gretel Text Fine-Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results. For more information on each model, please read the model cards linked under "model information".

Reference

Transform configurations consist of (optional) global parameters followed by a sequence of transformation steps. Rows, columns, and values not touched by any transformation step are maintained as-is in the output. In other words, Transform configs are implicitly "passthrough".

Below is a "kitchen sink" config showing most of Transform capabilities. Don't worry if it looks overwhelming. We will dissect each step in the reference below.

Globals

The entire globals section is optional. You can use it to re-configure the following default entity detection and transformation settings:

classify: Dictionary of classification configuration parameters. Note that classification is only performed once for each model, and currently only maps entire columns to entities (searching for entities within free text fields similarly to Transform's use_nlp option is not currently supported in Transform). Subsequent model runs will assume the schema remains unchanged, and continue to use the column to entity mapping detected during the first run. NOTE: This will send column headers and a sample of data to perform the classification to Gretel Navigator or a hybrid-deployed Gretel Inference LLM.
- enable: Boolean specifying whether to perform classification. Defaults to true when running within Gretel Cloud; defaults to false otherwise. When false, sets column.entity to none for all columns. When true, classification accuracy currently necessitates sending column names and a few (equal to num_samples) randomly selected values from each column to the Gretel Cloud.
- num_samples: Number of randomly selected values from each column to use for classification. Defaults to 3, but you can set it to a higher number for more accurate classification, or a lower number if you have privacy or security concerns with sending randomly sampled values from your dataset. Setting num_samples: 0 will use only column names as the input to classification.
ner: Named entity recognition
seed: Integer seed value used to generate fake values consistently. Defaults to null. When the seed is set to null, a random integer is generated at the beginning of each Transform run and used as the seed to transform values consistently within the current run (subsequent runs will generate their own random seed). This means rerunning with a null seed can cause inconsistent transforms (i.e. Alice -> Bob for the first run, Alice -> Jane for the second). If you set the seed to a specific number, transforms will be consistent across runs (i.e. Alice -> Bob always). The seed also doubles as a salt for the hash function. While there are privacy benefits to inconsistent transformations, we recommend setting a fixed seed for consistent transformation for use cases involving downstream synthetic data generation, or analysis on the transformed dataset.

You can also access global constants in transformation steps. For example, a transformation step with value: globals.locales | first will set that field's value to the first locale in the list of locales.

Steps

steps contain an ordered list of data transformation actions to be executed in the same order as they are defined in the Transform config.

Vars

Each step can optionally contain a vars section, which defines custom variables to be used in any Jinja expression within the step. Unlike globals, vars are scoped to an individual step, and are initialized using Jinja expressions that are evaluated at the beginning of each step.

Columns

The columns section of each step contains transformations applying to an entire column at once. Namely: adding a new column, dropping (removing) a column, and renaming a column.

Add

You can add a new blank column (which you can later fill in using a rows update action) by specifying its name and optional position. If position is left unspecified, the new column is added as the last column. Initially all values in the new column will be null, but you can populate them using a rows.update rule. For example, the config section below adds a primary_key column, positions it as the first column in the dataset, and then populates it with the index of the row:

Drop

To drop a column, specify its name in a columns drop action. For example, the config section below drops the FirstName and LastName columns:

You can also drop columns based on a condition expressed. condition has access to the entire Transform Jinja environment, as well as a few additional objects:

column: Dictionary containing the following column properties. For example, condition: column.entity in vars.entities_to_drop drops all columns matching the list of PII entities defined in the entities_to_drop variable.
- name: the name or header of the column in the dataset.
- entity: the detected PII entity type of the column, or none if the column does not match any PII entity type from the list under globals.classify.entities.
- type: the detected data type of the column, one of "empty", "numeric", "categorical", "binary", "text", or "other".
- position: zero-indexed position of the column in the dataset. For a dataset with 10 columns, column.position is equal to 0 for the first column and 9 for the last column.

Rename

You can rename a column by specifying its current name (name) and new name (value). For example, the config section below renames the MiddleName column to MiddleInitial:

Rows

Each step can also contain a rows section, listing transformation rules that process the dataset row by row. The two currently supported operations are drop and update, respectively allowing for selective removal of rows or modification of row data based on specified rules.

Drop

The drop operation within the rows section is used to remove rows from the dataset that meet certain conditions. The specified condition must be a valid Jinja template expression. Rows that satisfy the condition are excluded from the resulting dataset.

For instance, to exclude rows where the user_id column is empty, the configuration can be specified as follows:

You can use more complex Jinja expressions for conditions that involve multiple columns, logical operators, or functions. condition has access to the entire Transform Jinja environment, as well as a few additional objects:

row: Dictionary of the row's contents. For example, row.user_id refers to the value of the user_id column within that row.
index: Zero-based index of the row in the dataset. Note that the index of a row may change during processing if previous steps delete or add rows. For example, the rule below drops every other record from the dataset:

Update

The update operation allows you to modify the values of specific rows. It can be used to set new values for columns, generate fake data, anonymize sensitive information, or apply any transformation that can be expressed as a Jinja template.

Each update operation must contain one of name, entity, type or condition which are different ways to specify what to update, as well as value, which is contains the updated value. name and entity must be strings or list of strings, while condition and value are Jinja templates.

You can also optionally specify a fallback_value to be used if evaluating value throws an error. We recommend doing this when passing dynamic inputs to functions in value (for example, setting the Faker locale based on the contents of another column), preferably with a simple template (e.g. using static parameter values) for fallback_value to avoid further errors. In the event where both value and fallback_value fail to parse, the value will be set to the error message to aid with debugging.

condition, value, and fallback_value in row update rules have access to the row drop Jinja environment including vars, row, and index, as well as a few additional objects:

column: Dictionary referring to the current column whose value is being changed. The properites of the column that can be accessed are:
- name: The name of the column
- entity: The name of an entity that is in the column
- type: A Gretel extracted generic type for the column, one of:
  - empty
  - numeric
  - categorical
  - text
  - binary
  - other
- dtype: The Pandas dtype of the column (object, int32, etc)
- position: The numerical (index) position of the column in the table
this: Literal referring to the current value that is being changed. For example, value: this is a no-op which leaves the current value unchanged, while value: this | sha256 replaces the current value with its SHA-256 hash.

Here's how the update operation works with examples:

Setting a static value

The rule below sets the value of the column namedstatus_column to the string processed for all rows.

Incrementing an index

In the example below, we use the index special variable to set the value of the column row_index as the index of the record in the dataset. e.g. for a dataset containing 100 rows, the value of row_index for the last row will be 99.

Generating fake PII

The example below replaces values in all columns detected to contain email addresses with fake email addresses. Notice that unlike previous examples where the update rule was conditioned on name (the name of a column), the rule below is conditioned on entity (the type of entity contained within a column), which may match multiple columns. For example, if the dataset contains personal_email and work_email columns, the rule below will replace the contents of both with fake email addresses.

Modifying based on a condition

You can also conditionally update rows using flexible Jinja conditions. These conditions may match any number of columns and any number of rows (unlike name and entity conditions which apply to all rows).

For example, you can set the value of the flag_for_review column to true for all rows where the value of the amount column is greater than 1,000:

Classification

Transform incorporates a classification feature to detect personal identifiable information (PII) within data. This feature simplifies selecting and transforming specific types of PII by tagging each column with its appropriate entity, if any.

PII detection

Here is an example configuration that uses classification for detecting these 3 entities and applying transformations:

Since these align with Faker built-in entities, we could also write a single rule that applies to all detected entities:

With this setting, Transform will first classify entities in the dataset, then replace detected entities with faker-generated ones for each row in the dataset.

If your list of entities contains custom entities not supported by Faker, you can leverage fallback_value to apply other transformations. For example, the policy below attempts to fake all entities, and falls back to hashing unsupported entities. Since iban is supported by Faker while employee_id is not, the output of this policy will be fake IBAN values in the IBAN column, and hashes of the actual employee IDs in the employee ID column.

If instead you wish to replace unsupported entities by the entity name between brackets, you could set fallback_value: "<" + column.entity + ">" . You could also generate custom fake values, for example if you wanted to replace all entities not supported by Faker by the letter "E" followed by a random 6 digit number, you could set fallback_value: "E" + fake.pyint(100000, 999999) | string, or use Jinja's concatenation operator ~ which automatically converts integers to strings: fallback_value: "E" ~ fake.pyint(100000, 999999).

Named Entity Recognition

Similarly to column classification, Transform supports flexible Named Entity Recognition (NER) functionality including the ability to detect and transform custom entity types.

To get started, list the entities to detect under the globals.ner.entities section and use one of the four built-in NER transformation filters:

redact_entities replaces detected entities with the entity type. For example, "I met Sally" becomes "I met <first_name>".
fake_entities replaces detected entities with randomly generated fake values using the Faker function corresponding to the entity type. For example, "I met Sally" could become "I met Joe". When using fake_entities, ensure the name of the entity in the globals.classify.entities section exactly matches the name of a Faker function. Entities without a matching Faker function are redacted by default, and you can customize the fallback behavior using the on_error parameter, e.g. fake_entities(on_error="hash") hashes the non-Faker-matching entities instead of redacting them.
hash_entities replaces detected entities with salted hashes of their value. For example, "I met Sally" may become "I met 515acf74f".
label_entities is similar to redact_entities, but also includes the entity value. For example, "I met Sally" becomes "I met <entity type="first_name" value="Sally">". This can be useful for downstream post-processing (such as highlighting detected entities within the original text, applying more complex replacement logic for specific entity types, etc.), both within Transform and externally.

You can tweak the ner_threshold parameter if you notice too many or too few detections. You can think of the NER threshold as the level of confidence required in the model's detection before labeling an entity. Increasing the NER threshold decreases the number of detected entities, while decreasing the NER threshold increases the number of detected entities. Values between 0.5 and 0.8 are good starting points for avoiding false positives. Values below 0.5 are good if you don't want any leaked entities.

The sample config below shows how to apply fake_entities (falling back to redact_entities) for a list of custom entity types across all free text fields:

Additionally, if you would like to speed up Named Entity Recognition by having it run on hardware with a GPU, you can set the globals.ner.ner_optimized flag to true:

Classification in Hybrid

Once you've done that, you can specify the Gretel Inference LLM model via Transform's globals.classify.deployed_llm_name configuration field. This name should match the gretelLLMConfig.modelName defined in the Gretel Inference LLM's values.yml.

Here's how to perform the above PII detection using mistral-7b deployed in your Gretel Hybrid Cluster:

Jinja environment

Objects

Every Jinja environment in Transform can access the objects below:

Filters

Transform extends the capabilities of the standard Jinja filters with its own specific set. These include:

hash: Computes the SHA-256 hash of a value. For example, this | hash returns a hash of the value in the matched column in a row update rule. It can also take in its own salt, i.e. this | hash(salt="my-salt"), but by default it uses the seed value of the run as the salt. If the seed is unset, the hash will be different for the same values across runs.
isna: Returns true if a value is null or missing.
fake: Invokes the Faker library to generate fake data of the entity that's passed to the filter. This is useful if the entity name dynamic, e.g. column.type | fake is equivalent to fake.first_name() if column.type is equal to "first_name".
lookup_locales: Maps a pycountry Country to a list of Faker locales for that country. For example "Canada" | lookup_country | lookup_locales returns ["en_CA", "fr_CA"].
normalize: Removes special characters and converts Unicode strings to an ASCII representation.
tld: Maps a pycountry Country object to its corresponding top-level domain. For example, "France" | lookup_country | tld evaluates to .fr.

Gretel.ai

Welcome to Gretel!

What's Next?

Additional Documentation

Gretel Basics

Getting Started

Introduction

What's Next?

Quickstart

1. Set up Gretel

2. Start using Gretel

Gretel Console

3. Try Our Blueprints and Use Case Examples

Blueprints

Use Case Examples

Safe Synthetic Generation

Data Designer

Partner Integrations

Other

Environment Setup

Console

Measure Accuracy and Download Results

Additional Console Features

Share Projects

Create your First Synthetic Dataset

CLI & SDK

Installation

Prerequisites

Gretel Client

Gretel Hybrid Dependencies

Authentication

Cloud Provider Authentication for Gretel Hybrid

Gretel Fundamentals

Architecture

Gretel Components

Gretel Control Plane

Projects

Gretel Data Plane

Gretel Configurations

Service Limits and Pricing

Deployment Options

Gretel Cloud

Gretel Hybrid

Projects

Projects Overview

CLI Project Management

Creating Projects

Deleting Projects

SDK Project Management

Creating Projects

Unique Project Name Helper

Temporary Projects

Accessing Projects

Deleting Projects

Inputs and Outputs

Input Formats

Output Formats

JSON Outputs

Field Names for JSON Data

Parquet Outputs

Field Names for Parquet Data

Creating Models

Gretel Configuration

CLI and SDK

SDK Only

Input Data Sources

CLI and SDK

SDK Only

Creating Models with the CLI

Downloading Model Artifacts

Creating Models with the SDK

Running Models

Running Models from the CLI

Running Models with the SDK

Model Types

Synthetics

Transform

Workflows and Connectors

Release Notes

Platform Release Notes