Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Notebooks for common Gretel use cases.
Follow along with these use cases to familiarize yourself with core Gretel features. These examples provide a starting point for common use cases which you can modify to suit your specific needs.
To help decide which approach may be best for you, you can use this flow chart.
Note: The Data Designer functionality demonstrated in this notebook is currently in Preview. To access these features, please join the waitlist.
Use structured outputs feature to generate synthetic data with complex, nested data structures, with support for both Pydantic and JSON schema definitions.
Create multi-turn user-assistant dialogues tailored for fine-tuning language models.
AWS
Azure
AWS
Databricks
Azure
Use Gretel's Navigator SDK to generate or edit tabular data from a user-provided prompt.
Generate synthetic daily oil price data using the DoppelGANger GAN for time-series data.
Generate secure, high-quality synthetic numeric, categorical, time-sequences, and text using tools.
Create with Gretel, ensuring compliance, secure sharing, and actionable insights for AI and machine learning in healthcare.
Safely leverage sensitive or proprietary text data for
Ensure data quality and privacy by applying flexible to real and synthetic datasets.
Create pipelines that connect to your data sources, automate, and
Synthetically generate a high-quality and diverse for measuring the quality of your agent.
Use the to create diverse, large-scale synthetic datasets tailored to your needs with nothing but a few samples
Create a for Python code examples.
Create a for natural language prompts and SQL code examples.
Synthetically from text and PDFs, and evaluate the quality and diversity of outputs.
Use to create safe, scalable synthetic data for training AI to understand and execute tool commands.
How to use Model-as-a-Service. .
How to safely fine-tune LLMs on sensitive medical text for healthcare AI applications using
Enhance finance chatbots with privacy-first to boost performance while ensuring compliance with privacy regulations.
Create an end-to-end RAG chatbot and synthetic evals using
A practical guide to synthetic data generation with
Take the first step on your journey with synthetic data.
Begin your journey with Gretel and start creating privacy guaranteed synthetic data today.
Start by following our Quickstart guide to install Gretel and train a basic model using the console or a notebook.
Follow along with Gretel Blueprints which cover some common foundational use cases.
Review our specific use case Use Case Examples which you can test out and modify for your own needs.
After following along with the recommended journey above you can dive into the Gretel Fundamentals section to understand the core Gretel concepts you'll be working with regularly.
Start generating synthetic data in minutes.
Create and share data with best-in-class accuracy and privacy guarantees with Gretel.
Sign up for a free account at https://console.gretel.ai.
Retrieve your API key.
For more detailed instructions, see Environment Setup.
Gretel's Console provides easy to create synthetic data from a prompt or your existing datasets without writing any code. Check out our Console setup guide to start using Gretel via our Console.
Follow along with Gretel Blueprints which cover some common foundational use cases.
Review our specific use case Use Case Examples which you can test out and modify for your own needs.
Dive into the Gretel Fundamentals section to understand the core Gretel concepts you'll be working with regularly.
Gretel's core concepts.
These fundamentals will cover the core functionality that you should understand when working with Gretel. Before going further, you should have followed our getting started guide and installed and configured the Gretel Client.
Here are the core fundamentals you will be familiar with after going through the next few sections:
Architecture. Review a summary of Gretel's core system components.
Deployment options. Gretel Cloud empowers you to train models and generate synthetic data without needing to manage complex operating systems or GPU configurations. Gretel Hybrid enables you to deploy the Gretel Data Plane into your own cloud tenant, providing all of Gretel's incredible features and benefits without the need for data to leave the boundaries of your own enterprise network.
Projects. Gretel Projects can be thought of as repositories that hold models. Projects are created by single users and can be shared with various permissions.
Inputs and Outputs. Gretel Models support a number of input and output data formats. For concepts related to input and output data sources like relational databases or object stores, see the Workflows and connectors section.
Creating models. Create models and train them against your source data sets.
Running models. Running models will let you generate unlimited amounts of synthetic data.
Model types. This overview page will give you a glimpse into the different possibilities when creating and training models with Gretel.
Workflows and connectors. Workflows and connectors provide an easy way to connect to sources and sinks for working with synthetic data generation at scale.
Where do Gretel Models run?
Gretel jobs run within the Gretel Data Plane. Gretel provides two deploy options for the Gretel Data Plane that you may utilize depending on your requirements.
Gretel Cloud is a comprehensive, fully managed service for synthetic data generation and it operates within Gretel's cloud compute infrastructure, allowing Gretel to handle all concerns related to compute, automation, and scalability. Gretel Cloud provides a seamless solution that simplifies the technical demands of setting up your own machine learning cloud infrastructure.
When you create your Gretel account you're given instant access to Gretel Cloud and Gretel Cloud Workers, so you can start your first model training job instantly.
Gretel Hybrid operates within your own cloud tenant and is deployed on Kubernetes. Gretel Hybrid is supported on GCP, Azure, and AWS through the use of the managed Kubernetes services offered by these cloud providers. Gretel Hybrid interfaces with the Gretel Control Plane API for job scheduling and job related metadata but customer owned data will never egress from your cloud environment. Gretel Hybrid is particularly well suited for handling sensitive or regulated data that cannot leave your cloud tenant's boundaries. Gretel Hybrid combines the benefits of using your infrastructure for training synthetic data models with Gretel’s advanced tools, offering a balance of control and convenience.
To learn more about Gretel Hybrid, check out the Gretel Hybrid section in our documentation.
The developer platform for synthetic data.
With Gretel, developers can get started in minutes with open source reference examples and simple APIs for generating unlimited amounts of synthetic data, labeling personally identifiable information, or anonymizing and removing biases from data. Gretel services are controlled by a simple web-based interface and run in Gretel’s managed cloud service or within your own private cloud environment.
After reviewing our Getting Started guide, check out the Gretel Fundamentals section to learn about the core concepts you'll encounter frequently when using Gretel.
New to the Gretel SDK? Start here!
A foundational series of notebooks for fine-tuning and generating synthetic data with the Gretel SDK.
Create a Gretel Account and generate an API Key to get started!
Sign up for Gretel using your work email or existing Google or GitHub accounts in the Gretel Console. All new accounts automatically get added to the free Developer plan. Learn more about our free and paid plans on the pricing page.
To use the Gretel CLI (Command Line Interface), you'll need your API Key. Get it by clicking the API Key menu in the sidebar, and then copy it to your clipboard. You'll also need this key for running notebooks that use the Gretel Cloud APIs. You can regenerate your key at any time using the secondary actions menu (the three dots).
You can interact with Gretel through the interface of your choice: the Gretel Console, CLI, or Python SDK. For more information on setting up each interface, check out the following pages:
Getting to know the Gretel Console
The Gretel Console provides a fast and easy way to generate synthetic data, classify and redact PII and use our AI models without having to download or install any tools. Sign up for a free Developer account and choose a use case in the dashboard to get started.
Use cases allow you to train and run any of our models in four steps. Just launch one of the cards and upload your training dataset, or use the sample dataset we provide. A configuration file is already selected for you so you don't have to tweak any parameters. Our auto-parameters and auto-privacy settings will tune the configuration to your training dataset, ensuring the highest chance of success.
While the model is running, you can track progress in the log window, train a new model, or try another use case.
When your model has completed training, you'll see your SQS (Synthetic Quality Score) and be able to download the full report along with your synthetic data from the Downloads page. We automatically generate some records for you as part of model training, and you can easily generate more using the Generate button in the Model Header.
Projects can be created, filtered and sorted from the Projects page. Select a project to create a new model in that project.
You can also manage your Account in the Console, view Documentation and Announcements, and create a support ticket if you need additional help.
The Members section inside each project allows you to quickly share that project with collaborators. The following access permissions are supported: Read-only, Read/Write, Administrator, Co-Owner.
Here's a quick walkthrough of creating synthetic data in the Gretel Console.
Once the model has been trained, you can use the SQS and Privacy Levels to determine whether the data meets your quality standards. If so, quickly generate more data whenever you want, or fine tune the configuration settings to improve your scores. See our tips on improving synthetic data quality.
Learn to use and manage projects that allow you to store and collaborate on data.
Gretel Projects can be thought of as repositories that hold models. Projects are created by single users and can be shared with various permissions:
Read: Users may access data artifacts (such as synthetic data and reports)
Write: Users may create and run models.
Administrator: Users may add other users to a project.
Owner: Full control.
The most important thing to note about Projects is that the name
attribute of a project is globally unique. If you are familiar with services like Simple Storage Service (S3), then Project naming will feel very similar since S3 bucket names are also globally unique within a specific service provider (such as AWS).
Projects have the following attributes you should be familiar with:
name
: A globally unique name for the project. When you create a project without specifying a specific name, Gretel will generate one for you. This will be a randomized name based on your username
and a unique hash slug. If you specify a name
that is already used, Project creation will fail.
display_name
: This can be any descriptive name for the Project that will control how the Project is listed and displayed in the Gretel Console. It is non-unique.
description
: This optional field can be provided to provide a user-friendly description of the Project.
Next, let's look at creating and using Projects from the Gretel CLI.
At any point, you can get help on project management in the CLI by running:
You can create a project with auto-naming by running:
This will return a message (and the full Project object) that looks something like this:
Now, you may use this Project name as a reference in future operations.
You may also specify other Project attributes at creation time. For example, let's try selecting a unique project name and setting a display name for the console:
Which returns:
If you follow the Console link, you will now see your new project by it's display name:
If the Project name
you choose is not available, the CLI will return an error.
To delete a project, either the name or project-id is required. An example on how to delete the above project would be:
The Gretel Python SDK gives more flexibility and control around Project management. Within the SDK, the Projects module and class should be the primary orientation point for doing most of your work with Gretel.
The SDK differs in that when creating or accessing Projects, you will be given an instance of a Project class that you can interact with. Let's take a look.
Similar to our CLI interface, you can create a project with no input attributes:
Similarly, you can provide Project attributes to the create_project()
method:
As mentioned earlier, Project names are globally unique. However, we have created a utility in the SDK that allows users to "share" identical project names such that any user could have their own version of a project called "test" or "foo".
This helper will either create a new project or fetch an existing one, giving you back a Project instance. Additionally, the display name of the project will automatically be set for you based on the name you provide. Let's take a look:
In this mode, every user could use the exact my-new-awesome-project
string and a unique slug for that user will be appended to the Project name. This may be especially useful if you are re-running Notebooks or routines and do not want to use a combination of create_project()
and get_project()
to determine if a project already exists or not.
In certain occasions, you may want to create a Project only for the purposes of creating a model and extracting the specific outputs (Synthetic Data, Synthetic Quality Report, etc). Once you have extracted the data you need, you can delete the Project, which will then delete all of the models and artifacts related to those models.
For this use case, there is a temporary project context manager you can use. Once the context handler exits, the Project will be deleted:
If you already have a Gretel Project, in order to run model operations, you will need to load an instance of the Project class in the SDK. We'll use our example Project from above: my-awesome-project
to show how to do this.
To delete a project from the SDK, you utilize the delete()
method on a Project instance:
Once you delete a project, the class instance is not usable anymore. If you try and do any meaningful operation with it, you'll receive a GretelProjectError
such as:
GretelProjectError: Cannot call method. The project has been marked for deletion.
Use our flagship synthetics model, Tabular Fine-Tuning, to (text, numeric, categorical, and time-sequence) with optional differential privacy guarantees.
Fine tune LLMs to generate .
Generate .
. Define desired attributes, generate synthetic data, and refine through fast previews and detailed evaluations.
While use case flows make it easy to train new models, you can also create one from scratch. Start by creating a new project. Click the Projects button in the sidebar, or the button in the top navigation bar.
Get up and running with Gretel's CLI and SDK.
The Gretel CLI and Python SDK are made available through both PyPi (most common) and GitHub.
We require using Python 3.9+ when using the CLI and SDK. You can download Python 3.9 (or newer) here and install manually, or you may wish to install Python 3.9+ from your terminal. If you are working with a new Python installation or environment you should also verify that pip is installed.
To get started, you will need to setup your environment and install the appropriate packages.
The most straightforward way to install the gretel-client
CLI and SDK is with pip:
The -U
flag will ensure the most recent version is installed. Occasionally we will ship a Release Candidate (RC) version of the package. These are generally safe to install, you may optionally include this with the inclusion of the --pre
flag.
If you wish to have the most recent development features, you may also choose to install directly from GitHub with the following command. This may be suggested from our Customer Success team if you are testing new features that have not been fully released yet.
If you are using Gretel Hybrid to run Gretel jobs on your own cloud infrastructure, the Gretel CLI and SDK will require your cloud provider's respective Python libraries. To install these dependencies run the relevant command below.
After installing the package, you should configure authentication with Gretel Cloud. This will be required in order to create and utilize any models.
If you are installing Gretel on a system that you own or wholly control, we highly recommend configuring the CLI and SDK once with our configuration assistant. If you do this once, you will be able to use the CLI and SDK without doing specific authentication before running any commands.
To begin the CLI configuration process, use the command:
This will walk you through some prompts. You may press <ENTER>
to accept the default which is shown in square brackets for each prompt. The prompt will look similar to:
Press <ENTER>
to accept the default value for the Endpoint. (https://api.gretel.cloud
)
The Artifact Endpoint is only required for Gretel Hybrid users. If you are using Gretel Cloud, press <ENTER>
to accept the default value of cloud
. If you are a Gretel Hybrid user the configured value should be the URI for the Sink Bucket which was created during the Gretel Hybrid deployment. This would be the resource identifier for an Amazon S3 Bucket, Azure Storage Container, Google Cloud Storage Bucket.
Amazon S3 Example: s3://your-sink-bucket
Azure Storage Example: azure//your-sink-bucket
Google Cloud Storage Example: gcs://your-sink-bucket
The Default Runner is set for cloud
. Press <ENTER>
to accept the default value unless you are a Gretel Hybrid user or are running Gretel locally on your own machine(s). We recommend keeping cloud
as the default runner, which will utilize Gretel Cloud's auto-scaling GPU and CPU fleet to create and utilize models.
If you are a Gretel Hybrid user set this value to hybrid
to utilize hybrid runners.
If you need to run compute on your own machine(s) set this value to local
.
When prompted for your Gretel API Key, paste the key you created in the Gretel Console.
When prompted for your Default Project, you may optionally enter a Project Name or press <ENTER>
to accept the default.
Finally, you can test your configuration using the command:
If the configuration is good to go, you should get back an output like this:
At this point, you are authenticated with Gretel, and can use the CLI without needing to re-authenticate. If you run into trouble, feel free to contact us for help!
There are a few different options to configure your Gretel Cloud connection through the SDK.
If you are using an ephemeral environment (such as Google Colab, etc) and you only wish to configure your connection for the duration of your Python session. You can configure your connection like this:
Never commit code with your Gretel API key exposed! Generally you should load your Gretel API key in from some secure secrets manager or an environment variable.
See below for additional options if you are creating a Notebook, etc. such that you can always configure API key prompting.
Prompting
If you wish to maintain code that others may use, you can also use the following modification for configuring your session with Gretel Cloud. By using the prompt
value, you'll be presented with a dialogue to import your API key.
Hybrid Support
If you want to configure your session to run in Hybrid mode, run the following as part of configure_session
:
The hybrid environment configuration will apply to everything run with the Gretel client, including libraries like Gretel Trainer and Gretel Relational.
See additional storage setup instructions per cloud provider here.
Gretel Python Client docs can be found here.
The Gretel Client uses cloud provider specific libraries to interact with the underlying object storage via the smart_open
library. If you're a Gretel Hybrid user you may need to configure your environment with proper credentials for your specific cloud provider.
Gretel release notes are organized by release vehicle.
Platform Release Notes addresses releases for the Gretel Data Plane and Control Plane (Gretel Cloud and Hybrid)
Python SDKs
The gretel-synthetics
SDK is a source-available Python package that allows permissive use of Gretel maintained generative models.
The gretel-client
SDK is a Python interface to Gretel APIs.
Console Release Notes summarizes releases for our Console web app, updated on a weekly basis.
Get familiar with Gretel's architectural components.
Gretel has three architectural components that you will want to be familiar with:
Gretel Control Plane: The control plane for scheduling work such as creating models and generating, classifying, or transforming data. This includes the Gretel REST API, Console and CLI tool. The REST API is hosted as a service and is used to manage accounts, projects, and metadata for projects, workflows, and models.
Regardless of where Gretel Workers run, they will communicate to Gretel’s REST API to communicate timing information, errors, and additional metadata. If you use workers in your own environment, no training data or sensitive information will be sent back to Gretel’s API.
Gretel Data Plane: Containers that consume Gretel Configurations and handle requests to process records. When a worker consumes a Gretel Configuration, it creates a re-usable model. Additionally, workers can utilize existing models to generate, transform, and classify records. The data plane also includes several controller microservices that are responsible for detecting queued jobs and scheduling the required worker containers. Gretel Cloud's managed data plane will execute all of your workloads by default. Gretel Hybrid allows customers to deploy their own Gretel Data Plane into their preferred cloud environment which will enable customers to utilize all of Gretel's incredible features without the need for data to leave the boundaries of your cloud tenant. See Deployment Options for more details.
Gretel Configurations: Declarative objects that are used to create models. Gretel offers several configuration templates to help you get started with popular use cases such as creating synthetic datasets or anonymizing PII. These configurations are sent to the Gretel REST API to create models. These models can then be used to generate, transform, and classify data. Further information can be found in the Model Configurations page.
These components work together to enable developers to build robust and flexible privacy engineering systems.
The Gretel Control Plane is responsible for creating and managing projects, models, workflows, and job scheduling. The Control Plane is accessible via our REST API. We also consider other core Gretel components part of the control plane, such as the Gretel Console and Gretel CLI which are both responsible for interacting with the Control Plane API.
The primary object within Gretel that you will be working with is a Project. Projects are like repositories that contain models, workflows, and other associated data. You can invite other users to a project and control their permissions.
The following primitives exist within a Gretel Project:
Project Artifacts: These are datasets that can be uploaded and stored with your project. These artifacts are typically datasets that can be used to create models. Project artifacts can be uploaded by anyone with “write” access to a project. Additionally, project artifacts will be kept with the project until they are explicitly deleted. When using the Gretel Console or CLI you use Gretel Cloud Workers by default, and project artifacts will automatically be created for you from your training data. Project artifacts will have a specific structure. If your training data is called my-data.csv
then an example artifact key might be: gretel_89bdba626464477aaeeef96fc8b2b613_my-data.csv
. This key can be used as a data source for training or running models.
Models: Models are created on source datasets. You configure a model to be created using a Gretel Configuration which allows you to specify a source dataset, model type, and various parameters. You can train a model to generate synthetic data, transform records, or classify records. For each model that is created, the following artifacts are created:
A model archive, which can be referenced to generate, transform, and classify data at scale.
A model report. For synthetic models, this will be the Gretel Synthetic Report. For transforms, this will be a Gretel Transform Report.
Sample data. A small sample of synthesized or transformed data will be created as part of the model creation process.
Model Servers: After a model has been created, you may run that model as many times as you like to generate, transform, and classify new data. The result of the model server will be an output dataset that can be shared or used for your downstream use case.
Uploading project artifacts, model creation, and model server creation can only be done by Project members that have “write” access or higher.
Whether you are utilizing Gretel's managed data plane (Gretel Cloud) or deploying your own data plane (Gretel Hybrid), the Data Plane is responsible for running jobs created via the Gretel Control Plane. The Data Plane consists of two primary components: Gretel Workers that create and run models, and the controller microservices responsible for creating and scheduling Gretel Worker containers. Gretel Workers are containerized applications that are designed to communicate directly with Gretel Cloud. All communications will occur over HTTPS (Port 443) to api.gretel.cloud
. If you are running your own Gretel Data Plane (using Gretel Hybrid), your environment will need open outbound communication with the Control Plane API.
Workers are stateful and will transition through different statuses during their run time. Additionally, during their run time, the workers will periodically check in with Gretel Cloud to transmit usage information (for billing), status updates, generalized run logs, and error / troubleshooting diagnostic information.
A Gretel Worker can exist in one of the following states:
created
- A request for a worker has been made. This is the default state for a worker and will stay in this state until a worker is launched. By default, a user may have up to 10 created workers. This essentially serves as your “queue” for creating or running models.
pending
- This state indicates that the scheduling service has obtained the request and is provisioning a worker for your model or model server.
active
- A worker is creating a model, generating, or processing records. Once a worker is in this state it will begin periodically sending control plane and logging information back to the Gretel Control Plane.
completed
- A worker successfully completed its job. If it was a Gretel Cloud Worker, all model or server artifacts have been uploaded and stored in Gretel Cloud. If using a Gretel Hybrid worker, then all artifacts should have been written to the private location specified when starting the job.
error
- A worker countered an error. Basic error and troubleshooting information should have been sent to the Gretel Control Plane.
cancelled
- A user has cancelled the worker. When a worker is cancelled, the worker will promptly shut down operation and cease all processing.
lost
- A worker will be marked as lost if the Gretel Control Plane has been unable to communicate with the worker after some period of time.
In the event of an error
, cancelled
, or lost
status, a worker cannot recover from this state. A new model or server will have to be created once the underlying issue is fixed.
To create a model, a Gretel worker is launched and will download a configuration from the Gretel Control Plane. Once the configuration is loaded, the worker will obtain the training data and begin creating a synthetic, transform, or classification model.
To run a model, a Gretel worker is launched which we consider a "model server". Depending on the model type, a model server can be used to generate, transform, or classify data.
Workers can be automatically launched for you in Gretel Cloud. This is the default mode when uploading a configuration from the Console or the CLI. In cloud mode, once a request for a model is received, Gretel will provision a worker for you and the model and associated artifacts (such as quality reports, sample data, etc) will also be stored in Gretel Cloud. You may download these artifacts at any time. With a model created and stored in Gretel Cloud, model servers can be created to utilize the model and generate, transform, or classify data.
Gretel configurations are declarative objects that specify how a model should be created. Configurations can be authored in YAML or JSON. To help you get started, we have several Configuration Templates. You may download and edit these templates as necessary or directly reference them when using the CLI (see our tutorials on using the templates directly for model creation). You can also edit configurations directly in the Gretel Console, using the Config Editor.
The configuration file is the primary way to specify how a model can be created. When a model is requested to be created, a copy of this configuration will be sent to the Gretel Control Plane. Regardless of where a Gretel Worker is run, this configuration will be stored in Gretel's Control Plane and associated with the model.
When a Gretel Worker is scheduled (in our cloud or your own environment), it will contact Gretel Cloud and download a copy of the configuration and then start the model creation process.
All Gretel models follow a similar configuration file format structure.
To learn more about the configurations, please see the Model Configurations documentation.
Please see our pricing page for details on our various plans. You can get started completely free with 15 credits on our Developer Plan. The following limits apply:
Maximum Queued Jobs (10). This is the maximum number of jobs that can be in a created
state. If you are using Gretel Cloud workers, these jobs are automatically queued to start. While a worker is in this state, you may delete it or cancel it at any time. When this number is exceeded, API calls will return a 4xx
error when attempting to create new models or model servers.
Maximum Running Workers (4). This is the maximum number of jobs that can be in an active
state. When using Gretel Cloud workers, if this limit is exceeded, Gretel will wait for work to complete and then automatically start a new job from the queue of created
jobs. When running local workers, if the worker starts and the limit is exceeded, the job will be put into an error
state.
Maximum Worker Duration (1 hour). This is the maximum amount of time a worker can be in an active
state either creating or serving a model. If the job exceeds this limit, the job will be put into an error
state.
Once a Gretel Model is created, you may utilize that model to generate synthetic data as many times as needed. Because you may use a model to classify and transform data as well, we generically refer to the running of a model as a Record Handler.
Compared to model creation, running a model does not require a standalone Gretel Configuration. There are three input types that you should be aware of when it comes to running models:
A Model ID (or other reference to a Model, like a Model
instance in the SDK)
A number of parameters, which are essentially key-value pairs. These will vary depending on the specific type of model you are running.
Optionally, an input data file(s). Depending on the model, the input data may serve various purposes. One example of using an input data file to a record handler would be providing as set of pre-conditioned (smart seeds) inputs to a model to use during generation.
num_records
: How many synthetic records to generate
max_invalid
: How many records can fail validation before the job stops
Models can be ran by using the gretel models run [OPTIONS]
set of commands, at any time you may get help on these commands by running:
In order to run a model, you will need to know or access it's Model ID. When passing model run parameters to the CLI, you should use the --param
option for each param such that it matches a --param KEY VALUE
pattern.
Given a previously created model, let's generate 100 additional records:
When this job completes, the artifacts will be downloaded to the more-syn-data
directory. For this particular job you should see logs.json.gz
which are the job logs and your new synthetic data in the data.gz
artifact.
If the model type supports conditioning (i.e. smart seeding), then you may provide this set of partial records or smart seeds using the --in-data
flag.
When providing a data source for running a model, the job will often use the number of records in the data set to determine how many synthetic records to create. In this case, parameters like num_records
will be ignored.
In order to run a model from the SDK, you will need a Model
instance. Once you have that instance, you can create and submit a record handler object in a very similar way to model creation. When submitting a record handler to Gretel Cloud, you may track the state of the job the same way as a model.
Gretel's models can help you transform and synthesize your sensitive data to generate provably-private versions.
Gretel offers the following synthetics models:
Data types: Numeric, categorical, text, JSON, event-based
Differential privacy: Optional
Formerly known as: Navigator Fine Tuning
Data types: Text
Differential privacy: Optional
Formerly known as: Gretel GPT
Data types: Numeric, categorical
Differential privacy: NOT supported
Formerly known as: ACTGAN
Data types: Numeric, categorical
Differential privacy: Required; you cannot run without differential privacy
Scaling synthetic data generation.
Gretel Workflows provide an easy to use, config driven API for automating and operationalizing Gretel. Using Connectors, you can connect Gretel Workflows to various data sources such as S3 or MySQL and schedule recurring jobs to make it easy to securely share data across your organization.
A Gretel Workflow is constructed of actions that connect to various services including object stores and databases. These actions are then composed to create a pipeline for processing data with Gretel. In the example above:
A source action is configured to extract data from a source, such as S3 or MySQL.
The extracted source data is passed as inputs to Gretel Models. Using Workflows you can chain together different types of models based on specific use cases or privacy needs.
A destination action writes output data from the models to a sink.
A Workflow is typically created for a specific use case or data source and can be compared with a data pipeline or DAG.
A model in Gretel is an algorithm that can be used to generate, transform, or label data.
Powered by data, models can be thought of as the building blocks of machine learning. This page walks through the basics of initializing and training models for synthetic data, data transformations, and data classification.
When creating a model, Gretel Cloud performs the following steps:
Load the Gretel Configuration
Upload the training data to Gretel Cloud
Gretel Cloud provisions a worker and begins model training
When the job is completed, several Model Artifacts, including output data and reports can then be downloaded client-side
We'll show how to use both the CLI and SDK to create Gretel models in their own sections below.
Gretel Configurations generally start as declarative YAML files, which can then be provided to the SDK, CLI, or Gretel Console for starting a model creation job. Between the CLI and SDK, however, there are some differences (and similarities) on how you can define and provide a Gretel Configuration.
The CLI and SDK can work with Gretel Configurations that are YAML files. The CLI and SDK can access files on-disk or through remote URIs (HTTPS, S3, etc).
The SDK can also load Gretel Configurations as Python dictionaries as an alternative to YAML. This way, you may either load a configuration from disk or a template, and then manipulate it as necessary. Here's an example of this:
Data sources may be either files on disk or files that can be accessed via a remote URI (such as HTTPS or S3). In both cases, you should provide a string value to the file on disk or the remote path.
The SDK will accept Pandas DataFrames as input data. When a DataFrame is provided, the SDK will temporarily write the DataFrame to disk and upload it to Gretel Cloud. When the operation is complete, the temporary file on disk will be deleted. When showing SDK usage below, we will use the DataFrame input data method.
For this example, we will download the sample data to disk so you may observe the full artifact creation process:
Regardless of the model type, creating a Gretel model through the CLI will be done through the gretel models create ...
command set.
At any time you can get the help menu by running:
Given our data set, and a synthetics configuration shortcut (synthetics/default
) let's create a model:
By default, the CLI will attach to the job as it runs in Gretel Cloud and you will start to see verbose logging output as the job runs.
If you terminate this command, i.e. by sending a keyboard interrupt, this will cancel the job. If you wish to run the job in a "detached" mode, you may use the --wait
flag and give some low number of seconds to attach to the job such as --wait 5
. After 5 seconds the CLI will detach and the job will continue to run in Gretel Cloud.
Once the model is completed, the CLI will download the artifacts that were created as part of the model. You should be able to see these in the directory you specified in the --output
parameter, so in this example, artifacts should be saved to the my-synthetic-data
directory.
Additionally, you should see the Model ID be output from the CLI:
You will need this ID when re-using this model to generate synthetic data. Next, let's look at the downloaded artifacts.
data_preview.gz
contains the synthetic data that was created as part of the model creation process
report.html.gz
contains the Synthetic Quality Score report as a human readable HTML file
report_json.json.gz
contains the data from the SQS report but in a JSON consumable format
logs.json.gz
contain the model creation logs, these may be useful if you ever contact Gretel support
When the CLI stays attached to the Gretel Cloud job, artifacts will automatically be downloaded to the provided --output
directory. If you have disconnected the CLI from Gretel Cloud, for example using the --wait
option, then you may download the artifacts manually. This can be done with the following command:
Next, we'll walk through creating models with the SDK. While the SDK can utilize local files data sources and remote URI data sources, for this example, we will show how you can use a Pandas DataFrame as your data source.
Once we have our Project
instance, we will want to do a few things:
We use the Project
instance to create a Model
instance by using a specific create_model_obj()
factory method. This factory method takes both our Gretel Configuration and data source (a DataFrame) as params.
With the Model
instance created, we have to actually submit it to Gretel Cloud
Next we can poll
the Model
instance for completion
Finally we can download all of the Model Artifacts
Let's see it all in action...
In the above example, our Model
instance was in memory the entire time. If you ever lose that instance or restart your Python interpreter, you can create and hydrate a new Model
instance right from your Project
instance:
In the next section, we'll discuss how to utilize existing models to generate synthetic data.
In the example below, we will create a record handler for a Gretel LSTM (synthetics
) model that we previously created in . The Gretel LSTM model utilizes two parameters:
- Gretel’s flagship LLM-based model for generating privacy-preserving, real-world quality synthetic data across numeric, categorical, text, JSON, and event-based tabular data with up to ~50 columns.
- Gretel’s model for generating privacy-preserving synthetic text using your choice of top performing open-source models.
- Gretel’s model for quickly generating synthetic numeric and categorical data for high-dimensional datasets (>50 columns) while preserving relationships between numeric and categorical columns.
- Gretel’s model for generating differentially-private data with very low epsilon values (maximum privacy). It is best for basic analytics use cases (e.g. pairwise modeling), and runs on CPU. If your use case is training an ML model to learn deep insights in the data, Tabular Fine-Tuning is your best option.
You can learn more about Gretel Synthetics models .
Gretel’s model combines data classification with data transformation to easily detect and anonymize or mutate sensitive data. Use this data classification to detect a variety of such as PII, in both structured and unstructured text.
We generally recommend combining Gretel Transform with Gretel Synthetics using to redact or replace sensitive data before training a synthetics model.
You can learn more about Gretel Transform .
For more information, please refer to the full .
Both the CLI and SDK can reference configurations through "template shortcuts." For various models and use cases, Gretel maintains . The template can be referenced by using a directory/filename
pattern (no file extension required). So the string synthetics/default
will automatically fetch and .
The various types of data source formats can be reviewed here: . This section will cover how these data sources can be provided to the CLI and SDK.
First, you'll need to create a Project
instance to work with. Creating a Project
instance can be reviewed here: .
Supported input and output formats
Gretel Models support a number of input and output data formats which are outlined on this page. Gretel also provides a way for you to connect directly to your source and destination data sources using Gretel Connectors.
Gretel Models support input datasets in the following formats:
CSV (Comma Separated Values)
CSV data input is supported for Synthetics, Transform and Classify jobs.
The first row of the CSV file will be treated as column names, and these are required for processing.
JSON (JavaScript Object Notation)
The files may be formatted as a single JSON doc, or as JSONLines (where each line is a separate JSON doc).
Processing JSONL files is much more efficient for larger datasets, therefore we recommend it over regular JSON.
The JSON documents may be flat or contain nested objects. While there is no limit to the number of levels in a nested document, the more complex the structure, the longer the data will take to process.
JSON datasets are currently supported for Classify and Transform only. Support for Synthetics is coming soon.
The following compression algorithms for column data are supported: snappy, gzip, brotli, zstd.
Parquet datasets are currently supported for Classify and Transform only. Support for Synthetics is coming soon.
Uploading Parquet datasets as project artifacts is currently only supported in the Gretel CLI and SDK. The ability to upload these in the Gretel Console is coming soon.
Results are automatically output in the same format as the input dataset.
For JSON datasets in Classify, there will be an additional field for each detected entity: json_path
. This field contains the JSONPath location of that detected entity within the JSON document. See below for a sample classify result on a JSON dataset.
For Transform, the output will be written in the same format as the input, however whitespaces and order of fields from the input will not be preserved.
In CSV files, field names correspond to the column name. JSON data doesn't have columns, but we still want to be able to reference fields for reporting purposes. Therefore, field names are created by referencing the dot-delimited path from the root of the document to each scalar value (the leaf). In the example below, the field that contains the value test@example.com
will be referenced as: user.emails.address
.
Note that in the example above, the array index is omitted. Thus the values inside the array will be aggregated together since typically all elements inside an array have the same schema. This method of field naming works well for JSON datasets that have a uniform schema across all the records. The naming convention could vary in the case of optional fields, etc.
For Classify, the result structure for Parquet datasets will be the same as that of JSON datasets. Since Parquet data can be nested in a similar way as JSON data, each detected entity will contain a json_path
field.
For Transform, the output will use the same schema and Parquet version as the input file.
Field names that appear in Classify and Transform reports when processing Parquet files correspond to column names in the Parquet schema. For columns that contain nested data, field names are constructed in the same way as for JSON data (see above).
If you would like us to import a different format, let us know.
Gretel's Platform is comprised of control plane and data plane components.
The Gretel Data Plane is responsible for processing user-provided prompts and/or datasets and generating synthetic data.
The Gretel Control Plane includes Gretel's APIs, job scheduling, and workflow management.
Gretel generally releases platform updates every Tuesday. We do sometimes release out-of-band to address critical bug fixes, security updates, or pre-releases for future features and capabilities.
Gretel follows a CalVer versioning schema. The schema is YYYY.MM.N:
YYYY: Calendar year.
MM: Month of year.
N: Monotonically increasing release number for the given month, so 2024.6.1
is the first release in June of 2024.
Gretel automatically upgrades the Gretel Cloud to support enhancements and upgrades to the platform. All users get the same updates at the same time. Gretel uses the CalVer internally to track the changes and release notes are organized by these CalVer numbers to more easily communicate changes that are delivered.
Gretel Hybrid splits the control and data planes such that:
Gretel maintains and runs the control plane in Gretel Cloud. Gretel control plane updates are automatically shipped by Gretel for both Gretel Hybrid and Gretel Cloud.
The data plane is customer managed within customer cloud accounts. Depending on your Hybrid setup, you will need to update varying container images. More on this below.
The container images used on Gretel Hybrid can be split into three categories:
Management containers. Images are prefixed with gcc-
. There are three core management containers that run on the Hybrid cluster. These containers are responsible for managing model jobs and workflows.
Workflow container. This container image is named workflow
. These containers are used when running Gretel Workflows and handle things such as source and sink actions.
Model container. This container image is named model
. These are containers that run the actual Gretel models for generating synthetic data.
If your Hybrid deployment directly uses Gretel's container registry or a pull through cache the workflow
and model
container images are automatically updated and pulled for you upon release. These containers are spawned by the management containers during model jobs and workflow runs.
Gretel's container images have several shared internal libraries. We have consolidated the number of total images to make upgrades easier. We highly recommend upgrading all container images at the same time based on release version numbers. This mirrors how we update Gretel Cloud.
If you need to explicitly pull images by tag and cannot use the latest
tag, then you should use the appropriate CalVer version number for the image tag.
Feature: Enable sampling of Person
objects for seeding datasets in Data Designer, based on publically available statistics datasets including the US Census.
Feature: Adds a Workflows Task to support splitting off a holdout set from a training Dataset.
Task: Update SQS from [0, 100] to [0, 10].
Fix: Corrects an issue with the CLI polling for a model run.
Task: Use Multi-Modal Report to evaluate the model by default. In order to use an older version of the report for the Evaluate model, please set task.type=sqs
.
Fix: Security fixes for our Java and Python images
Task: Removes the use of a local docker agent for running models
Task: The combined models image is always used now via an API call
Task: Go applications are now built using Go version 1.23.6
Fix: More informative error messages for asserting generation_prompt
template expectations when adding columns in DataDesigner
.
Task: Move the gretel agent python code out of the gretel-client
Feature: Allows disabling cleanup of artifacts in hybrid if explicitly "disabled"
Feature: Allows a gender to be specified for a transform persona
Fix: Fixes a bug that prevented workflow level evaluate running for gretel_model outputs.
Task: Switch our Azure Navigator-Tabular models from using gpt3.5 --> gpt-4o-mini
Task: Moves the validation logic to right before we create, so project_guids can still be used
Fix: Removed unnecessary prompt templates for DataDesigner
that led to inconsistent data quality.
Feature: Add support for doing PII Replay for specified columns. This can be used in conjunction with specified entity types or in place of them.
Task: Deprecates Classification and Regression MQS reports.
Fix: Ensure the bounds when sampling are taken into account for our dataset when doing an inference attack
Task: Add ability to independently toggle LLM-based and regexp-base NER.
Feature: CLI and SDK Enterprise Tenant Selection.
Task: Add direnv venv support.
Task: Allow specifying repository for hybrid supervisor image.
Fix: Add 10s timeout when validating connections/workflow actions.
Task: Default ner_optimized
to True
.
Fix: Fix client integration tests with missing custom deps.
Task: Update GenerateColumnFromTemplate
task.
Task: Push Qwen coder and instruct images.
Fix: Update test harness default case.
Fix: Fixes an issue with DataDesignerWorkflow.from_yaml
.
Fix: Don't validate workflow connections that reference hybrid connections.
Task: Update SDK's Project.search_model
and CLI's models search
to include more parameters.
Fix: Change blank AWS access keys/ secret keys to be an error state when creating a connection.
Task: Update default NER threshold to .7.
Fix: Categorize an error during classify to make it clearer what went wrong.
Fix: Give an earlier and clearer warning when using an invalid project name with the high level SDK.
Fix: Lower the packaging version to a version that works with ctgan.
Fix: Check deprecated access key parameter of creds when doing validation.
Fix: Update tmp_project to allow passing the hybrid cluster guid, consistent with create_project.
Fix: Add better error codes to some SQL exceptions.
Fix: Adds some additional headers that Azure serverless needs for talking to navigatur tabular.
Feature: Improve edit mode stability in Navigator Tabular. Added ability to pass sample_data
in the generate()
method in Gretel SDK.
Task: Update getModels to include more query parameters.
Fix: Fix occasional crash in navft when training with columns of non-native python types like Datetime
, Timestamp
and Decimal
.
Feature: Support for Amazon Nova suite of models in Data Designer.
Task: Support addition of categorical seed columns for seed generation in DataDesignerFromSampleRecords
.
Fix: Issue where we sent a deprecated max_tokens
field to our TGI LLMs, causing us to ignore the field.
Feature: Improved data designer seed generation.
Task: Scheduled Workflows are only available as a paid feature.
Fix: Catch and add a more accurate error code to Workflow OOM errors.
Feature: Adds an Evaluate action to Gretel Workflows. This allows you to generate a single SQS report using inputs from multiple Workflow steps. For example, you could generate a report comparing raw training data from your S3 bucket against data that has been transformed and synthesized. You can also use the Holdout action to feed in an additional holdout set which is then used by our Privacy Metrics.
Feature: Adds an AWS Bedrock adapter for Navigator Tabular to the Gretel SDK.
Fix: Fixes JSON and JSONL support for files encoded with UTF-16/32.
Fix: Fixes automatic prompt naming for saved prompts.
Fix: Addresses some usability issues with the bedrock Navigator model
Task: Part of the gretel-client release.
Feature: Add the new Sample-to-Dataset tasks and workflow into the DataDesigner module.
Fix: Drop windows test support in gretel-client; file paths are brittle there.
Task: Update the pypi action for package releases.
Task: Remove some user details from the Projects API.
Fix: Increase privacy protection level/privacy configuration in report when using differential privacy with Navigator Fine Tuning.
Feature: Change how we interface with data designer to define evaluation tasks (yaml and sdk).
Fix: Fixes some minor logging in the trainer SDK
Feature: The AIDD interface now has a with_person_samplers
method for creating latent person samplers.
Fix: Handle NotFound
return code for /projects/:id
Fix: Parameters for sampling-based data sources are now autogenerated to the client.
Fix: Fixes default type of AIDD evaluation report attribute.
Feature: Implements the magic
interface for the new v2 data_designer
SDK. Try it out by calling data_designer.magic.add_column("my_column", "my_column_description")
.
Feature: AIDD add_column
now can take a concrete column type as input.
Feature: Columns (except for sampler and seed columns, which can't depend on other columns) are now represented with a DAG, ensuring that steps are run in the necessary order.
Feature: New ExpressionColumn
added, which provides a new implantation of expressions. Expressions are now provided as straight jinja2 templates.
Internal config updates.
Task: Removes deprecated /auth/email endpoint.
Fix: Block further local IP addresses for data sources
Fix: Fix Multi-Modal Report for Evaluate Models Rendering bugs
Fix: Fix an issue with building gretel-synthetics
Feature: The Gretel SDK now supports AWS Bedrock/Sagemaker and Azure Models-as-a-Service for Navigator Tabular. Users can bring their own client configurations and create a Navigator adapter. Once the adapter is created users can generate and edit tabular data.
Fix: Give clarity in tv2 hybrid when an LLM is not deployed
Feature: This PR adds a new, public facing workflow action. This action splits the source dataset into a main training set that continues to be used in the workflow and a holdout test set that is saved until the very end when we calculate Privacy Metrics.
Task: Remove unnecessary backports.cached_property
dependency.
Fix: Fix bug when iterating through workflow messages.
Fix: Provide transform report via the SDK
Task: Relational synthetics is being deprecated. A warning message has been added to inform users of this change in workflow task logs.
Feature: Adds Azure Fine Tuning support to the Gretel SDK. Synthetic data can be formatted into OpenAI fine-tuning and inference formats and end-to-end fine-tuning can be managed directly from the Gretel SDK.
Feature: Feature: Introduce auto
for config parameters delta
and max_sequences_per_example
in Navigator Fine Tuning.
Feature: Add DP-FT capabilities to NavFT, mostly leveraging utilities that already existed for GPT-x.
Task: Add PII Replay to SQS Report for an Evaluate model
Feature: Release Note: The release adds a new navigator
module to gretel_client
, which provides interfaces for Gretel's new Navigator Task Execution framework, which is in beta and will be available to select customers.
Feature: Attempt to find hybrid LLMs if none specified for TV2 classify
Fix: Support reading transform configs in high level sdk
Fix: Parse and validate JudgeWithLLM Task
Fix: Solves an issue with inference configs in production
Fix: Sets the proper domain name for serverless endpoints
Fix: Fixes privacy filtering for certain datasets.
Fix: Allow the client to optionally disable SSL verification for testing purposes.
Fix: Fixes updating connections when done in a hybrid context
Feature: Added a BigQuery integration module that provides Gretel <> BigFrames native support in the Gretel Python SDK
Feature: Add model license information to the /v1/inference/models
endpoint (if available)
Fix: Update error handling for generation failures in the ACTGAN model.
Task: More specific email login rejection messages.
Fix: Updating error message for validation in Relational Workflows.
Fix: Fix an issue where the incorrect total
is returned from the /v1/workflows/runs/tasks/search
endpoint.
Fix: Make downloading a model from the HF hub more resilient, by increasing retries.
Fix: Ensure non-standard encoded characters can be extracted and loaded from Workflow database connectors.
Fix: Increased HTTP client timeout defaults for workflows to 30 seconds.
Fix: Fix race condition when performing model status updates.
Feature: Add additional error codes for workflows
Fix: Fix variable assignment error in LSTM model training when evaluation is skipped.
Feature: Support for JSON columns in MySQL and Postgres connectors has been removed
Feature: gretel_tabular workflow action no longer attempts JSON column normalization
Feature: gretel_tabular workflow action limits tables with JSON columns to NavFT and Transform (v1 and v2) models
Feature: Model training times in gretel_tabular workflow actions are now faster via reducing data preview size and deferring evaluation.
Fix: When using the gretel-inference-llm helm chart, users have the choice of either passing in the apiKey or the apiKeySecretRef in their values.yaml. In instances where apiKey was provided, we attempted to create a k8s Secret, but failed due to a yaml templating error
Fix: Fix bug in Navigator FT where generation would sometimes fail when group_training_examples_by is set.
Fix: Adds security context to initContainer used in the inference-llm chart
Task: Allow pulling the base image in the warm pool
Fix: Fix trust_remote_code for GPT-x
Task: Add datadog tracing http enabled
Task: Gate m1 features via configcat
Task: Add default llama suite config
Task: Update go-license logic to use pkg.go.dev
Task: Improve Jarvis API observability
Fix: Jarvis SQL templating issue
Task: Add httproutes for each LLM
Task: Add provenance and new GretelMetadata field, separate out types
Fix: Fix evaluation errors
Task: Add call_task method to Task interface
Fix: Use internal name for the gateway
Task: Add more restrictive limiter for get_model logs lambda
Fix: Fix an issue with improperly logging out a console session
Task: Change up query logic for record handlers to use one complete status call
Task: Remove notifications for github workflow runs
Task: Update transform V2 report style
Task: Add an optional configuration option passthroughImageFormat
that allows for preserving the image name provided when calling image registries
Fix: Reintroduce blocking username changes
Fix: Fix an indexing bug for hybrid workflow image resolution
Fix: Increase the max number of tokens allowed for intent planning
Fix: Fix image name handling for the supervisor container by the gcc-controller
Fix: Fix the handling of group_training_examples_by
in Navigator FT to work for multiple fields
Fix: Fix bug in GPT-X model loading
Feature: Enable fine-tuned GPT-x
models to be run using vllm
in generation, by use of the use_vllm
generation parameter
Task: Remove dependency on registry authentication for gcc-controller
Feature: Add HTML report to Transform V2. In Hybrid mode, this is written to the output bucket along with the json report
Fix: Fix an issue where fake(seed=...)
no longer worked for transform_v2 configs
Feature: Add a date_time_shift, date_format and date_time_format function to transform_v2
Fix: Fix issue in gretel-hybrid's Azure Terraform module which did not respect the skip_kubernetes_resources
flag. The Kubernetes namespace will no longer be managed by Terraform if the flag value evaluates to true
Feature: Update Navigator FT generation logging to include more detail on format errors in the invalid records
Fix: Fix occasional NavigatorFT crash due to mishandling of carriage returns resulting in possible malformed input file
errors
Fix: Fix Privacy Metrics AIA graph height for a small number of columns in the SQS Report
Fix: Add error logging in NavFT for group_by/order_by
Feature: Add flag in gretel-data-plane Helm chart for conditionally disabling Argo Workflows controller resources. This allows for using an existing Argo Workflows controller deployment that has permissions to run Workflows. Default behavior keeps the deployment of the Argo Workflows controller resources
Task: Add the ability for navigator to suggest a prompt name
Fix: Increase log retry and add a sleep for retrieving workflow logs
Feature: Print the model ID when doing a trainer run to help with debugging
Feature: Set a few more prefilled endpoints that can be used for LLM templating
Fix: - Fix an issue with Project invites not respecting email case sensitivity. Project invites sent to e.x. guest@greteluser.com
and Guest@greteluser.com
should no longer create duplicate invites, and users receiving Project invites should no longer be missing invites
Feature: Remove an unused CRD from our public chart
*Note: With this release, we've switched to YYYY.MM.DD versioning
Fix: Allow classify within TV2 hybrid-only if deployed_llm_name
is set
Feature: Add Privacy Metrics to Evaluate
Fix: Fix edit-in-place prompt and create mode for Navigator
Fix: Update the combined models image API response
Fix: Support auto param for NavFT num_input_records_to_sample to automatically choose a reasonable value for this training time param.
Fix: Update error messages for NavFT max token related errors.
Fix: Adds back Navigator validation for properly coercing non-str values into string values for the tabular data we return.
Feature: Add ner_optimize
setting to Tv2 for configuring GPUs. If ner_optimized
is set to true
a GPU will be configured, if false
, a GPU won't get configured.
Fix: Bug where a timezone offset included in an input to Tv2 date_shift caused an error.
Feature: Workflow tasks that were active at the time of workflow run cancellation are now assigned cancelled status instead of errored status.
Feature: Allow parquet files to be uploaded for NavigatorFT jobs.
Fix: An issue that could lead to "Token count exceeds the limit" error in the Navigator batch jobs.
Add Gretel uploaded_data_source Action
More flexible data generation in the Navigator
Support nav-ft in gretel_tabular
Bugfix: Create project with adding _user_id for NotFoundException
Bugfix: Fix test for gretel-python-client
Bugfix: Fix increased Navigator FT runtime after privacy metrics release
Add Privacy Metrics to Report
Restrict project invites to external users based on the domain policy
Gretel's Console web application provides a flexible low-code interface for getting started with Gretel, and serves as the interface for managing your models, workflows, team, and billing.
Console is generally deployed daily, Monday - Thursday, certain holidays excepted.
Release Notes for Console are published every Monday for the previous week.
Bugfix: jobs controller issues related to models/models
Support Tv2 column classification in hybrid deployments
Bugfix: fix handling of none-like values in Navigator-Fine-Tuning
Bugfix: Improve handling of not-nullable zero-values (empty strings, 0-integers) in workflows
Bugfix: fix global locales in Tv2 configs
Bugfix: prevent recursion error in TabularDP
Add membership inference attack score to reports
Grant project access to domain owners for workflows and connections
Bugfix: Fix agent resolution of models image
Bugfix: Add transform (v1) and classify to new models image
Improved handling of column types in MySQL and BigQuery connectors
Set home to /run
directory when running hybrid not as root
Update GPT-x DP fine tuning to use Poisson Sampler
Add text entity report to Tv2
Support resolving data from multiple sources in workflow actions
Support filtering by model_id and/or model_type on /v1/inference/models
endpoint
Add quasi_identifier_count privacy_metrics to synthetic model configs
Add inference attack score to synthetics reports
Bugfix: Properly set default globals and classify configuration values in Tv2
Bugfix: Resolve permissions issues running hybrid GPT-x and Tv2 jobs as non-root
Bugfix: Validation of BigQuery connector with unspecified dataset
Bugfix: Properly recognize valid MSSQL identity column types
Bugfix: Fix model-model image resolution for tagged images
Improved searching of projects via new owned_by
query parameter added to /projects
endpoint
Improved error messaging related to JSONL errors in Navigator
Bugfix: Hybrid consolidated Model container did not have proper CUDA paths set, causing GPT-x jobs to fail
Improvements to Tv2 logging. When Tv2 is processing long NER text blobs, ensure that progress is being reported on regular intervals
Bugfix: Navigator inference requests would sometimes fail when using the Google Gemini Pro model
Text SQS update to semantic similarity score. This update is to ensure that the score is penalized for increasing number of synthetic records that are not semantically similar to any training records
Bugfix: Properly configure java.nio
for Databricks connector
GPT-x now uses flash attention to speed up training and inference
Bugfix: Invalid model configs would sometimes result in 500 errors. 
Updates to NavFT to ensure rope_scaling_factor
is consistent between model training and inference
Improvements to Navigator prompt templating
Increase NavFT rope_scaling_factor
upper bound to 6 (from 2)
Hybrid deployments default to use new consolidated Model and Workflow docker images
Improvements to Workflow error logging
Bugfix: Workflows using Azure Blob Storage would sometimes commit 0 byte blocks causing write failures
Improvement: Updates to button styling throughout Console
Improvement: Consolidated styling of Lists
Fix: Prevent duplicate Transform_V2 run and ensure results are displayed.
Fix: No longer attempt to show ephemeral training inputs after training is complete.
Fix: Adjusted alignment of status chip in Workflow Runs list.
Fix: In editor, updated styling so non-clickable items aren't styled as a link.
Fix: Correct styling for Playground batch generate alert
Fix: Hide toast message for local file upload which says validation is not applicable
Feature: Added the ability to preview uploaded datasets within workflow builder.
No customer-facing updates in this release period
Improvement: Set a new default row count on Navigator, getting results to the user faster
Bugfix: Fix a double click on the Tabular/Natural toggle on Navigator causing the page to be in a broken state
Improvement: Add a button to upload one's own model config within workflow builder
Bugfix: Fix workflow builder submit validation not being able to revalidate, causing the user to not be able to resubmit
Feature: Persist main sidebar state across sessions and sync across app instances
Improvement: Add a "Choose a different file" button to Workflow Builder uploads
Bugfix: formatting of uploaded files in some cases when creating a workflow
Bugfix: Fix an issue with building the workflow properly when you select your model type before defining an input
Improvement: Moved the Save button to the bottom of the Creation Tiles in the Workflow Builder page
Feature: Whether the main sidebar navigation is collapsed or not will persist across sessions and sync across app instances
Improvement: Update the Gretel wordmark
Improvement: Remove deprecated route that was redirecting to the "From Scratch" blueprint flow
Fix: Address an issue with properly building a workflow after selecting a model type before defining an input
Fix: Improve file name detection when uploading files using the Workflow Builder
Improvement: The “Clear Prompt” button has been removed from the Navigator Playground prompt window to prevent unintentional clearing. A better alternative is coming soon.
Fix: Fixed a bug in workflow creation causing invalid configs for Hybrid projects
Fix: Updated the "Download" tooltip for Model Records, which incorrectly said all non-records were "data-previews".
Feature: Added a “License” button to Navigator which links legal license governing the current model
Improvement: Navigator Playground's saved prompt now have a default name, and the user is prevented from saving a prompt with an empty name.
Feature: new workflow creation experience that simplifies process into a single page
Feature: users can now save their prompts submitted to Navigator playground
Fix: changes to the model config template didn't update the underlying config
Feat: Enable filtering by status from within the Model Project list view
No customer-facing updates in this release period
Feat: Updated Blueprint cards to use our new Categorical Label component instead of the Chip component for showing which cards are Notebook cards or Newly added cards.
Feat: Updated Project list to use our new Categorical Label component instead of the Chip component for showing whether projects are Cloud or Hybrid
Feat: Updated connection list item to use Categorical label instead of Chip for Source/Destination label
Fix: Download CSV button in Edit tabular dataset mode was not working; it works now!
Feature: Suggest users to invite to a project based on team membership
Fix: Improved state handling for the Navigator "Model" selector
Fix: Fixed design related issue where Error page was not using Gretel styling
Improvement: Change the way user's names are shown in console to make it easier to identify users
Fix: Ensure users can continue through use cases when the default cloud output is selected
Feature: Add updated clear prompt button to Navigator
Fix: Improve error handling when user has an invite from a user that can't be found (e.g., the user was deleted)
Hybrid
Fix: Correctly set workflow output type to connection when in a hybrid project, fixes workflow creation issue
Fix: Allow hybrid-only users to create projects, if they don't have project creation restricted
Fix: Workflow builder yaml validation in Advanced tab now works for Hybrid Projects
Feature: Add Data privacy metrics for GPTx models
Feature: Add functionality for prefilling new saved prompt name with an AI generated suggestion.
Improvement: Connection creation UI updated for Hybrid projects
show New Connection button on project page
remove cloud project alert from connection creation wizard
Fix: Fixed issue where Error page was not using Gretel styling.
Fix: Allow hybrid only users to create projects
Fix: Ensure users can continue through use cases when the default cloud output is selected
Feature Launch: Released model improvements, an updated config template, and updated blueprint for Navigator Fine Tuning. It is now the default model recommended in the console as part of the General Availability launch.
Bugfix: project admins should not see the option to update other project admin permissions
Bugfix: invalid date in the org member table wasn't rendered properly
Remove Text/SQL toggle from playground
Bugfix: When using the Navigator FT model in workflows, we previously weren't setting the default num_records
field (for gretel_model
actions) or the default num_records_multiplier
field (for gretel_tabular
actions).
Update how Console decides whether to use gretel_model
or gretel_tabular
when helping the user build a Workflow Config via the Blueprints flow. This change fixes some potential for bugs in the final config, and better aligns with current backend capabilities.
Bugfix: Improve handling of models that could have data privacy metrics, but the options were disabled by the user.
Improve flexibility around what connection types are allowed to be used when creating a workflow. We previously constrained the allowed connection types (e.g., S3, Azure, BigQuery) when creating a workflow based on the type of model selected. This is no longer necessary, and so we've removed these constraints.
Bugfix: Don't attempt to render new data privacy metrics for models that don't support this metric.
Minor UX improvement for score displays in Models List.
Release notes for the Gretel Platform, June 2024
Add support for setting crawl limits when configuring Gretel Workflow object storage connectors. To set a limit, configure limit
on your object storage source connector.
Improvements to Workflow config validation. Workflow action names are now validated to ensure uniqueness within a Workflow config.
Gretel BigQuery connections can now be created without specifying a dataset
. You can instead configure the BigQuery dataset by passing bq_dataset
when configuring a bigquery_source
action.
Bugfix to database subsetting. When collecting batches of data, those batches previously needed to contain the same set of columns. This constraint would sometime break subsetting if columns were sparsely populated.
Hybrid Model docker images have now been consolidated into a single Model image.
Hybrid Workflow docker images have now been consolidated into a single Workflow image.
Intermediate Workflow artifacts are now immediately cleaned up when a Workflow completes. When a Workflow is configured with a sink, any intermediate model artifacts produced by the Workflow are cleaned up and removed when the Workflow completes.
GPT-x, update config validation to limit epsilon
to be between 0.1 and 100.
GPT-x, ensure sampling probability is never larger than 1.0.
Bugfix: When writing objects to Azure Blob Storage, block sizes were written in chunks that were too small, leading to errors when writing larger object. Objects are now written in larger 25mb blocks.
Standardize Tv2 column
properties. The column
object can be used to access specific properties of a column that is being evaluated in Tv2. See the Tv2 reference for more details.
Update Tv2 to maintain referential integrity. By default, the gretel_tabular
action when using Tv2 will ensure that Pk/FK columns are not transformed. By setting run.encode_keys: true
within the action, keys will be transformed to integers or UUIDs.
Bugfix in gretel_tabular
where null foreign-keys can be included when using subsetting.
Bugfix for Synthetic Quality Score for field correlation stability when missing values are in the data.
Bugfix for enforcing Teams runtime limits (max objects crawled, max bytes processed) on Workflows. These limits were previously being loaded from specific users, this is now fixed so they limits are loaded by Team if the user is a member of one.
Check out the blog for even more details!
This model is available via the models-navigator_ft
container for Hybrid customers.
Improve error messages within Gretel Navigator
Added new partial_mask()
filter to Tv2.
Update model names within Gretel Navigator
Bug fix for Gretel Navigator edit mode when adding numerical columns.
For GPT-x, the delta
hyperparam will only be automatically updated if dp: true
. Previously it was updated regardless of DP being enabled which was unnecessary.
Improvements to the SQS Text Statistical Score for measuring quality of synthetic natural language data.
Improved prompt validation for Gretel Navigator
When using Tv2 with gretel_tabular
columns will no longer be attempted to be ordered in their original order. This causes issues when Tv2 configs are adding or removing columns.
Tv2 NER will utilize GPUs when available.
Databricks destination connector optimizations.
Better handling for foreign key column with null values in gretel_tabular
.
Gretel configurations are declarative objects that specify how a model should be created. Configurations can be authored in YAML or JSON.
All Gretel models follow the same high-level configuration file format structure. All configurations include schema_version
and name
keys, as well as a models
array that is keyed by a [model_id]
. Within the [model_id]
object, all model configurations have a data_source
key.
[model_id]
is replaced with the type of model you wish to train (e.g. navigator_ft
, gpt_x
, actgan
, tabular_dp
, or transform_v2
).
The mapping between Gretel models and configuration model_id
values is:
Tabular Fine-Tuning: navigator_ft
Text Fine-Tuning: gpt_x
Tabular GAN: actgan
Tabular DP: tabular_dp
Transform: transform_v2
data_source
must point to a valid and accessible file in CSV, JSON, or JSONL format.
Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
Note: Some models have specific data source format requirements
data_source: __tmp__
can be used when the source file is specified elsewhere using:
--in_data
parameter via CLI,
parameter via SDK,
dataset button
via Console.
Each Gretel model has different additional keys within the model_id
object and unique configuration parameters specific to that model. For details on the configuration parameters for each model, see the specific model page:
LLM-based AI system supporting multi-modal data.
Gretel Tabular Fine-Tuning (navigator_ft
) is an AI system combining a Large-Language Model pre-trained specifically on tabular datasets with learned schema based rules. It can train on datasets of various sizes (we recommend 10,000 or more records) and generate synthetic datasets with unlimited records.
navigator_ft
excels at matching the correlations (both within a single record and across multiple records) and distributions in its training data across multiple tabular modalities, such as numeric, categorical, free text, JSON, and time series values.
navigator_ft
is particularly useful when:
Your dataset contains both numerical / categorical data AND free text data
You want to reduce the chance of replaying values from the original dataset, particularly rare values
Your dataset is event-driven, oriented around some column that groups rows into closely related events in a sequence
The config below shows all the available training and generation parameters for Tabular Fine-Tuning. Leaving all parameters unspecified (we will use defaults) is a good starting point for training on datasets with independent records, while the group_training_examples_by
parameter is required to capture correlations across records within a group. The order_training_records_by
parameter is strongly recommended if records within a group follow a logical order, as is the case for time series or sequential events.
data_source
(str, required) - __tmp__
or point to a valid and accessible file in CSV, JSONL, or Parquet format.
group_training_examples_by
(str or list of str, optional) - Column(s) to group training examples by. This is useful when you want the model to learn inter-record correlations for a given grouping of records.
order_training_examples_by
(str, optional) - Column to order training examples by. This is useful when you want the model to learn sequential relationships for a given ordering of records. If you provide this parameter, you must also provide group_training_examples_by
.
params
- Parameters that control the model training process:
num_input_records_to_sample
(int or auto, required, defaults to auto
) - This parameter is a proxy for training time. It sets the number of records from the input dataset that the model will see during training. It can be smaller (we downsample), larger (we resample), or the same size as your input dataset. Setting this to the same size as your input dataset is effectively equivalent to training for a single epoch. A starting value to experiment with is 25,000. If set to auto
, we will automatically choose an appropriate value.
batch_size
(int, required, defaults to 1
) - The batch size per device for training. Recommended to increase this when differential privacy is enabled. However, if the value is too high, you could get an out-of-memory error. A good size to start with is 8.
gradient_accumulation_steps
(int, required, defaults to 8
) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.
learning_rate
(float, required, defaults to 0.0005
) - The initial learning rate for AdamW
optimizer.
warmup_ratio
(float, required, defaults to 0.05
) - Ratio of total training steps used for a linear warmup from 0
to the learning rate.
weight_decay
(float, required, defaults to 0.01
) - The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in the AdamW optimizer.
lora_alpha_over_r
(float, required, defaults to 1.0
) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2.
lora_r
(int, required, defaults to 32
) - The rank of the LoRA update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
lora_target_modules
(list of str, required, defaults to ["q_proj", "k_proj", "v_proj", "o_proj"]
) - The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'
.
rope_scaling_factor
(int, required, defaults to 1
) - Scale the base LLM's context length by this factor using RoPE scaling to handle datasets with more columns, or datasets containing groups with more than a few records. If you hit the error for maximum tokens, you can try increasing the rope_scaling_factor
. Maximum is 6, and you may first want to try increasing to 2.
max_sequences_per_example
(int, optional, defaults to auto
) - This controls how examples are assembled for training and automatically set to a suitable value with auto
(default).
use_structured_generation
(bool, optional, default false) - With DP, we might have issues learning the tabular format, so in order to get more valid records, it helps to do structured generation.
privacy_params
- To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.
dp
(bool, optional, default false
) - Flag to turn on differentially private fine tuning when a data source is provided.
epsilon
(float, optional, default 8
) - Privacy loss parameter for differential privacy. Lower values indicate higher privacy.
per_sample_max_grad_norm
(float, optional, default 0.1
) - Clipping norm for gradients per sample to ensure privacy. For each data sample, the gradient norm (magnitude of the gradient vector) is calculated. If it exceeds per_sample_max_grad_norm
, it is scaled down to this threshold. This ensures that no single sample’s gradient contributes more than a set maximum amount to the overall update.
generate
- Parameters that control model inference:
num_records
(int, required, defaults to 5000
) - Number of records to generate. If you want to generate more than 50_000
records, we recommend breaking the generation job into smaller batches, which you can run in parallel.
temperature
(float, required, defaults to 0.75
) - The value used to control the randomness of the generated data. Higher values make the data more random.
repetition_penalty
(float, required, defaults to 1.2
) - The value used to control the likelihood of the model repeating the same token.
top_p
(float, required, defaults to 1.0
) - The cumulative probability cutoff for sampling tokens.
stop_params
(optional) - Optional mechanism to stop generation if too many invalid records are being created. This helps guard against extremely long generation jobs that likely do not have the potential to generate high-quality data. To turn this parameter on, you must set two parameters:
invalid_record_fraction
(float, required) - The fraction of invalid records generated by the model that will stop generation after the patience
limit is reached.
patience
(int, required) - Number of consecutive generations where the invalid_record_fraction
is reached before stopping generation.
If running this system in hybrid mode, the following instance specifications are recommended:
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required): Minimum Nvidia A10G, L4, RTX4090 or better CUDA compliant GPU with 24GB+ RAM and Ada or newer architecture. For faster training and generation speeds and/or rope_scaling_factor
values above 2, we recommend GPUs with 40+GB RAM such as NVIDIA A100 or H100.
The default context length for the underlying model in Tabular Fine-Tuning can handle datasets with roughly 50 columns (less if modeling inter-row correlations using group_training_examples_by
). Similarly, the default context length can handle event-driven data with sequences up to roughly 20 rows. To go beyond that, increase rope_scaling_factor
. Note that the exact threshold (where the job will crash) depends on the number of tokens needed to encode each row, so decreasing the length of column names, abbreviating values, or reducing the number of columns can also help.
navigator_ft
is a great first option to try for most datasets. However, for unique datasets or needs, other models may be a better fit. For heavily numerical tables or use cases requiring 1 million records or more to be generated (navigator_ft
can generate batches of up to 130,000 records at a time), we recommend using actgan
. It will typically be much faster at generating results in these scenarios. For text-only datasets where you are willing to trade off generation time for an additional quality boost, we recommend using gpt_x
.
Given the model is an LLM, mappings from the training data often persist in the synthetic output, but there is no guarantee. If you require mappings across columns to persist, we recommend doing pre-processing to concatenate the columns or post-processing to filter out rows where the mappings did not persist.
Pre-trained models such as the underlying model in Tabular Fine-Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
This section covers the model training and generation APIs shared across all Gretel models.
Gretel offers the following synthetics models:
Data types: Numeric, categorical, text, JSON, event-based
Differential privacy: Optional
Formerly known as: Navigator Fine Tuning
Data types: Text
Differential privacy: Optional
Formerly known as: GPT
Data types: Numeric, categorical
Differential privacy: NOT supported
Formerly known as: ACTGAN
Data types: Numeric, categorical
Differential privacy: Required; you cannot run without differential privacy
This section compares features of different generative data models supported by Gretel APIs.
✅ = Supported
✖️ = Not yet supported
All Gretel Synthetics models follow a similar configuration file format structure. Here is an example model-config.yaml
[model_id]
is replaced with the type of model you wish to train (e.g. navigator_ft
, gpt_x
, actgan
, tabular_dp
).
data_source
must point to a valid and accessible file in CSV, JSON, or JSONL format.
Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
data_source: __tmp__
can be used when the source file is specified elsewhere using:
--in_data
parameter via CLI,
parameter via SDK,
dataset button
via Console.
The params
object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on the data_source
.
Use the following CLI command to create and train a synthetic model.
--in_data
is optional if data_source
specified in the config, and can be used to override the value in the config.
--in_data
is required if data_source: __tmp__
is used in the config
--name
is optional, and can be used to override the name
specified in the config
Designate project
Create model object and submit for training
During training, the following model artifacts are created:
Use the gretel models run
command to generate data from a synthetic model.
--model-id
supports both a model uid
and the JSON that models create
outputs
There are many different --param
options, depending on the model.
num_records
param is supported by all synthetic models and is used to tell the model how many new rows to generate.
--in_data
is optional and used for conditional data generation when supported by the model
Create and submit record handler
There are many different params
options, depending on the model.
num_records
param is supported by all synthetic models and is used to tell the model how many new rows to generate.
View results
Model type: Generative pre-trained transformer for text generation
Gretel Text Fine-Tuning simplifies the process of training popular Large Language Models (LLMs) to generate synthetic text. It offers support for differentially private training, ensuring data privacy, and includes automated quality reporting with Gretel's Text Synthetic Quality Score (SQS). This allows you to create labeled examples to train or test other machine learning models, fine-tune the model on your data, or prompt it with examples for inference.
To prompt the base model directly without fine-tuning, set data_source
to null
at initialization.
When fine-tuning Gretel Text Fine-Tuning models, these constraints apply:
Use 100+ examples if possible. Less than 100 - just prompt the base model directly.
Providing only 1-5 records will cause an error.
If your training dataset is a multi-column format, you MUST set the column_name
.
data_source
(required) - Use __tmp__
or a valid CSV, JSON, or JSONL file. Leave blank to skip fine-tuning and use the base LLM weights, for few-shot or zero-shot generation.
column_name
(optional) - Column with text for training if multi-column input. This parameter is required if multi-column input is used.
params
- Controls the model training process.
batch_size
(optional, default 4) - Batch size per GPU/TPU/CPU. Lower if out of memory.
epochs
(optional, default 3) - Number of training epochs.
weight_decay
(optional, default 0.01) - Weight decay for AdamW optimizer. 0 to 1.
warmup_steps
(optional, default 100) - Warmup steps for linear lr increase.
lr_scheduler
(optional, default linear) - Learning rate scheduler type.
learning_rate
(optional, default 0.0002) - Initial AdamW learning rate.
max_tokens
(optional, default 512) - Max input length in tokens.
validation
(optional) - Validation set size. Integer is absolute number of samples.
gradient_accumulation_steps
(optional, default 8) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.
lora_r
(optional, default 8) - Rank of the matrices that are updated. A lower value means fewer trainable model parameters.
lora_alpha_over_r
(optional, default 1) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, values of 0.5, 1 or 2 work well.
target_modules
(optional, default null) - List of module names or regex expression of the module names to replace with LoRA. When unspecified, modules will be chosen according to the model architecture (e.g. Mistral, Llama).
privacy_params
- To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.
dp
(optional, default false) - Flag to turn on differentially private fine tuning when a data source is provided.
epsilon
(optional, default 8) - Privacy loss parameter for differential privacy. Specify the maximum value available for model fine tuning.
entity_column_name
(optional, default null) - Column representing unit of privacy. e.g. name
or id
. When null, record-level differential privacy will be maintained, i.e. the final model does not change by much when the input dataset changes by one record. When specified as e.g. user_id
, user-level differential privacy is maintained.
generate
(optional) - Controls generated outputs during training.
num_records
(optional, default 10) - Number of outputs.
maximum_text_length
(optional, default 100) - Max tokens per output.
General Configuration
schema_version (optional): Defines the version of the configuration schema.
name (optional): Name of the model configuration.
Models
models (required): List of model configurations.
gpt_x: Configuration for a specific model instance.
data_source (required): URLs or paths to the data files (CSV, JSON, JSONL). For temporary data, use "tmp".
pretrained_model (optional): Pretrained LLM model to use. Defaults to "gretelai/gpt-auto".
prompt_template (optional): Template for prompting the model.
column_name (optional): Name of the column with text data if using multi-column input. Required parameter if using multi-column input.
validation (optional): Size of the validation set, specified as an integer (absolute number of samples).
Training Parameters
params (optional): Configuration for training parameters.
batch_size (default 4): Number of samples per batch per GPU/TPU/CPU.
epochs (optional): Number of complete passes through the training dataset.
steps (default 750): Total number of training steps to perform.
weight_decay (default 0.01): Weight decay coefficient for the AdamW optimizer, a regularization parameter.
warmup_steps (default 100): Number of steps for learning rate warmup.
lr_scheduler (default linear): Type of learning rate scheduler.
learning_rate (default 0.0002): Initial learning rate for the AdamW optimizer.
max_tokens (default 512): Maximum number of tokens for each input sequence.
gradient_accumulation_steps (default 8): Number of steps to accumulate gradients before updating model parameters.
Parameter-Efficient Fine-Tuning (PEFT) Parameters
peft_params (optional): Parameters for fine-tuning using PEFT.
lora_r (default 8): Rank of the low-rank adaptation matrix in LoRA.
lora_alpha_over_r (default 1.0): Scaling factor for the LoRA adaptation.
target_modules (optional): Specific modules to apply LoRA adaptation.
Privacy Parameters
privacy_params (optional): Configuration for differential privacy (DP).
dp (default false): Enable differentially private training using DP-SGD.
epsilon (default 8.0): Privacy budget parameter for DP.
delta (default "auto"): Privacy parameter for DP, usually a very small number.
per_sample_max_grad_norm (default 1.0): Clipping norm for gradients per sample to ensure privacy.
entity_column_name (optional): Column name for entity-level differential privacy.
Generation Parameters
generate (optional): Parameters controlling the generation of synthetic text.
num_records (default 10): Number of records to generate.
seed_records_multiplier (default 1): Multiplier for the number of rows emitted per prompt in prompt-based generation.
maximum_text_length (default 100): Maximum number of tokens per generated text.
top_p (default 0.89876): Probability threshold for nucleus sampling (top-p).
top_k (default 43): Number of highest probability tokens to keep for top-k sampling.
num_beams (default 1): Number of beams for beam search. Use 1 to disable beam search.
do_sample (default true): Enable sampling if true, otherwise use greedy search.
do_early_stopping (default true): Enable early stopping in beam search if true.
typical_p (default 0.8): Typical probability mass to consider in sampling.
temperature (default 1.0): Sampling temperature. Higher values increase randomness.
Training Configuration: Define your data source and configure model parameters. Optionally, enable privacy settings.
Data Generation: Supports unconditional and prompt-based text generation. Configure generation parameters to control output features.
Make sure to set data_source and pretrained_model as per your requirements. Use column_name for specifying the text column in multi-column data inputs.
The Gretel Text Fine-Tuning model supports fine-tuning and inference of commercially viable large language models. Specific model information can be found on each model card linked below.
Supported Models
gretelai/gpt-auto
: Automatically selects the best available LLM for model training
mistralai/Mistral-7B-Instruct-v0.2
meta-llama/Meta-Llama-3-8B-Instruct
If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia A10G, RTX3090 or better CUDA compliant GPU with 24GB+ RAM is required to run basic language models. For fine-tuning on datasets with more than 1,000 examples, a NVIDIA A100 or H100 with 40+GB RAM is recommended.
Large-scale language models such as Gretel Text Fine-Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results. For more information on each model, please read the model cards linked under "model information".
Hello Navigator Fine-Tuning! Our newest multi-modal model is live!
Gretel has that may be helpful as starting points for creating your model.
For example, to generate realistic stock prices in the , we would set group_training_examples_by
to "stock" and order_training_records_by
to "date". This ensures that correlations within each stock ticker are maintained across multiple days, and the daily price and volume fluctuations are reasonable.
lr_scheduler
(str, required, defaults to cosine
) - The scheduler type to use. See the of SchedulerType
for all possible values.
- Gretel’s flagship LLM-based model for generating privacy-preserving, real-world quality synthetic data across numeric, categorical, text, JSON, and event-based tabular data with up to ~50 columns.
- Gretel’s model for generating privacy-preserving synthetic text using your choice of top performing open-source models.
- Gretel’s model for quickly generating synthetic numeric and categorical data for high-dimensional datasets (>50 columns) while preserving relationships between numeric and categorical columns.
- Gretel’s model for generating differentially-private data with very low epsilon values (maximum privacy). It is best for basic analytics use cases (e.g. pairwise modeling), and runs on CPU. If your use case is training an ML model to learn deep insights in the data, Tabular Fine-Tuning is your best option.
Need help choosing the right synthetic model? Check out our for a detailed model comparison based on real world datasets.
Some have specific data source format requirements
Parameters are specific to each model type. See a full list of supported parameters in each of the pages.
Gretel has that may be helpful as starting points for creating your model.
Initialize a model to begin using Gretel Text Fine-Tuning. Use the gpt_x
tag to select this model. Here is a sample config to create and fine-tune a Gretel Text Fine-Tuning model. All Gretel models use a common interface for training synthetic data models from their config. See the reference for how to .
pretrained_model
(optional, defaults to meta-llama/Meta-Llama-3-8B-Instruct
) - Gretel supports PEFT and LORA for fast adaptation of LLMs from models. Use a causal language model from the .
peft_params
- Gretel Text Fine-Tuning uses Low-Rank Adaptation () of LLMs, which makes fine-tuning more efficient by drastically reducing the number of trainable parameters by updating weights of smaller matrices through low-rank decomposition.
delta
(optional, default auto) - . It is typically set to be much less than 1/n
, where n
is the number of training records. By default, delta
is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.2
. You can also choose your own value for delta
. Decreasing delta
(for example to 1/n^2
, which corresponds to delta: 0.000004
for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.
Tag
navigator_ft
gpt_x
actgan
tabular_dp
timeseries_dgan
Type
Language Model
Language Model
Generative Adversarial Network
Statistical
Generative Adversarial Network
Model
Pre-trained Transformer
Pre-trained Transformer
GAN
Probabilistic Graphical Model
GAN
Privacy filters
✖️
✖️
✅
✖️
✖️
Privacy metrics
✅
✖️
✅
✅
✖️
Differential privacy
✖️
✅
✖️
✅
✖️
✅
✅
✅
✅
✖️
Tabular
✅
✖️
✅
✅
✅
Time-series
✅
✖️
✖️
✖️
✅
Natural language
✅
✅
✖️
✖️
✖️
Conditional generation
✖️
✅
✅
✖️
✖️
Pre-trained
✅
✅
✖️
✖️
✖️
Gretel cloud
✅
✅
✅
✅
✅
Hybrid cloud
✅
✅
✅
✅
✅
Requires GPU
✅
✅
✅
✖️
✅
data_preview.gz
A preview of your synthetic dataset in CSV format.
logs.json.gz
Log output from the synthetic worker that is helpful for debugging.
report.html.gz
HTML report that offers deep insight into the quality of the synthetic model.
report-json.json.gz
A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.
The gretel-synthetics
Python package release notes can be found on GitHub.
The gretel-client
Python package release notes can be found on GitHub.
Adversarial model that supports tabular data, structured numerical data, and high column count data.
The Gretel Tabular GAN model API provides access to a generative data model for tabular data. Gretel Tabular GAN supports advanced features such as conditional data generation. Tabular GAN works well with datasets featuring primarily numeric data, high column counts, and highly unique categorical fields.
This model can be selected using the actgan
model tag. Below is an example configuration that may be used to create a Gretel Tabular GAN model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to Create and Train a Model.
The configuration below contains additional options for training a Gretel Tabular GAN model, with the default options displayed.
data_source
(str, required) - __tmp__
or point to a valid and accessible file in CSV, JSON, or JSONL format.
embedding_dim
(int, required, defaults to 128
) - Size of the random sample passed to the Generator (z vector).
generator_dim
(List(int), required, defaults to [256, 256]
) - Size of the output samples for each of the Residuals. Adding more numbers to this list will create more Residuals, one for each number. This is equivalent to increasing the depth of the Generator.
discriminator_dim
(List(int), required, defaults to [256, 256]
) - Size of the output samples for each of the discriminator linear layers. A new Linear layer will be created for each number added to this list.
generator_lr
(float, required, defaults to 2e-4
) - Learning rate for the Generator.
generator_decay
(float, required, defaults to 1e-6
) - Weight decay for the Generator's Adam optimizer.
discriminator_lr
(float, required, defaults to 2e-4
) - Learning rate for the discriminator.
discriminator_decay
(float, required, defaults to 1e-6
) - Weight decay for the discriminator's Adam optimizer.
batch_size
(int, required, defaults to 500
) - Determines the number of examples the model see's each step. Importantly, this must be a multiple of 10
as specified by the Tabular GAN training scheme.
epochs
(int, required, defaults to 300
) - Number of training iterations the model will undergo during training. A larger number will result in longer training times, but potentially higher quality synthetic data.
binary_encoder_cutoff
(int, required, defaults to 150
) - Number of unique categorical values in a column before encoding switches from One Hot to Binary Encoding for the specific column. Decrease this number if you have Out of Memory issues. Will result in faster training times with a potential loss in performance in a few select cases.
binary_encoder_nan_handler
(str, optional, defaults to mode
) - Method for handling invalid generated binary encodings. When generating data, it is possible the model outputs binary encodings that do not map to a real category. This parameter specifies what value to use in this case. Possible choices are: "mode". Note that this will not replace all nans, and the generated data can have nans if the training data has nans.
cbn_sample_size
(int, optional, defaults to 250,000
) - If set, clustering for continuous-valued columns is performed on a sample of the data records. This option can significantly reduce training time on large datasets with only negligible impact on performance. When setting this option to 0
or to a value larger than the data size, no subsetting will be performed.
discriminator_steps
(int, required, defaults to 1
) - The discriminator and Generator take different number of steps per batch. The original WGAN paper took 5 discriminator steps for each Generator step. In this case we default to 1
which follows the original Tabular GAN implementation.
log_frequency
(bool, required, defaults to True
) - Determines the use of log frequency of categorical counts during conditional sampling. In some cases, switching to False improves performance.
verbose
(bool, required, defaults to False
) - Whether to print training progress during training.
pac
(int, required, defaults to 10
) - Number of samples to group together when applying the discriminator. Must equally divide batch_size
.
data_upsample_limit
(int, optional, defaults to 100
) - If the training data has fewer than this many records, the data will be automatically upsampled to the specified limit. Setting this to 0
will disable upsampling.
auto_transform_datetime
(bool, optional, defaults to False
) - When enabled, every column will be analyzed to determine if it is made up of DateTime objects. For each column that is detected, Tabular GAN will automatically convert DateTimes to Unix Timestamps (epoch seconds) for model training and then after sampling convert them back into a DateTime string.
conditional_vector_type
(str, required, defaults to single_discrete
) - Controls conditional vector usage in model architecture which influences the effectiveness and flexibility of the native conditional generation. Possible choices are: "single_discrete", "anyway". single_discrete
is the original CTGAN architecture. anyway
will improve efficiency of conditional generation by guiding the model towards the requested seed values.
conditional_select_mean_columns
(float, optional) - Target number of columns to select for conditioning during training. Only used when conditional_vector_type=anyway
. Use if typical number of seed columns required for conditional generation is known. The model will be better at conditional generation when using approximately this many seed columns. If set, conditional_select_column_prob
must be empty.
conditional_select_column_prob
(float, optional) - Probability of selecting a column for conditioning during training. Only used when conditional_vector_type=anyway
. If set, conditional_select_mean_columns
must be empty.
reconstruction_loss_coef
(float, required, defaults to 1.0
) - Multiplier on reconstruction loss. Higher values should provide more efficient conditional generation. Only used when conditional_vector_type=anyway
.
force_conditioning
(bool or auto
, required, defaults to auto
) - When True, skips rejection sampling and directly sets the requested seed values in generated data. Conditional generation will be faster when enabled, but may reduce quality of generated data. If True with single_discrete
, all correlation between seed columns and generated columns is lost! auto
chooses a preferred value for force_conditioning
based on the other configured parameters, logs will show what value was chosen.
Differential privacy is currently not supported for the Gretel Tabular GAN model.
To use conditional data generation (smart seeding), you can provide an input csv containing the columns and values you want to seed with during data generation. (No changes are needed at model creation time.) Column names in the input file should be a subset of the column names in the training data used for model creation. All seed column data types (string, int, float) are supported when conditional_vector_type=anyway
and conditional generation is more efficient, so that setting is preferred when conditional generation is a priority. Conditional generation with string data type seed columns only is also available when conditional_vector_type=single_discrete
.
Example CLI command to seed the data generation from a trained Tabular GAN model:
Example CLI to generate 1000 additional records from a trained Tabular GAN model:
The underlying model used is an Anyway Conditional Tabular Generative Adversarial Network (ACTGAN). There are Generator and Discriminator models that are trained adversarially. The model is initialized from random weights and trained on the customer provided dataset. This model is an extension of the popular CTGAN model. These algorithmic extensions improve speed, accuracy, memory usage, and conditional generation.
More details about the original underlying model can be found in their excellent paper. https://arxiv.org/abs/1907.00503
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.
If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is required to run basic language models.
In general, this model trains faster in wall-clock time than comparable LSTMs, but often performs worse on text or high cardinality categorical variables.
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
Tabular GAN technical limitations:
When force_conditioning=False
(the default with conditional_vector_type=single_discrete
), conditional generation may not produce a record for every seeded row. So you might only get 90 records back after using a seed file with 100 records with smart seeding. Use conditional_vector_type=anyway
to increase the likelihood of generating all requested seed rows. The parameter force_conditioning=True
is also available to guarantee a row is generated for all seed rows, but with the possibility of lower data quality.
Statistical model for synthetic data generation with strong differential privacy guarantees.
The Gretel Tabular DP model API provides access to a probabilistic graphical model for generating synthetic tabular data with strong differential privacy guarantees. Gretel Tabular DP works well on datasets with primarily categorical variables, relatively low cardinality (<100 unique categories per variable) and under 100 variables.
This model can be selected using the tabular_dp
model tag. Below is an example configuration to create a Gretel Tabular DP model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to Create and Train a Model.
data_source
(str, required) - __tmp__
or point to a valid and accessible file in CSV format.
epsilon
(float, required, defaults to 1
) - Privacy loss parameter for differential privacy.
delta
(float or auto
, required, defaults to auto
) - Probability of accidentally leaking information. It is typically set to be less than 1/n
, where n
is the number of training records. By default, delta
is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.5
. You can also choose your own value for delta
. Decreasing delta
(for example to 1/n^2
, which corresponds to delta: 0.000004
for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.
infer_domain
(bool, required, defaults to True
) - Whether to determine the data domain (i.e. min/max for continuous attributes, number of categories for categorical attributes) exactly using the training data. Otherwise the domain must be provided in the config. True
by default. If False
, domain
parameter must be specified.
domain
- Domain of each attribute in the dataset. For numeric variables, only the min and max should be specified (int or float). For categorical variables, only the number of categories should be specified (int). See below for an example of a configuration with domain specified for a dataset containing three variables - state, age and capital gains.
To reference the default tabular-dp configuration in a workflow, use the following, e.g.
Example CLI script to generate 1000 additional records from a trained Tabular DP model:
The underlying model is a probabilistic graphical model (PGM), which is estimated using low dimensional distributions measured with differential privacy. This model follows three steps:
Automatically select a subset of correlated pairs of variables using a differentially private algorithm.
Measure distributions of the selected pairs with differential privacy. These distributions, called marginals, are effectively large contingency tables of noisy counts.
Estimate a PGM that captures the relationship as described by the noisy marginals, and sample synthetic data from the PGM.
More details about the model can be found in the paper Winning the NIST Contest: A scalable and general approach to differentially private synthetic data.
If running this system in local mode (on-premises), the following instance type is recommended. Note that a GPU is not required.
CPU: Minimum 4 cores, 16GB RAM.
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
Conditional generation is not supported.
Privacy Filters are not supported. This is because privacy filters directly utilize training records to provide privacy protections. The process does not involve any addition of calibrated noise. Hence, enabling privacy filters would invalidate the differential privacy guarantee.
Gretel Tabular DP is not appropriate for time series data where maintaining correlations across sequential records is important, as the underlying model has an assumption of independence between records.
Gretel Tabular DP is not appropriate for text data if novel text is desired in the synthetic data. Use Gretel GPT to generate differentially private synthetic text.
Gretel Transform combines data classification with data transformation to easily detect and anonymize or mutate sensitive data.
Gretel Transform offers custom transformation logic, an expanded library of detectable and fakeable entities, and PII and custom entity detections.
Gretel Transform is a general-purpose programmatic dataset editing tool. Most commonly, Gretel customers use it to:
De-identify datasets, for example by detecting Personally Identifiable Information (PII) and replacing it with fake PII of the same type.
Pre-process datasets before using them to train a synthetic data model, for example to remove low quality records such as records containing too many blank values, or columns containing UUIDs or hashes which are not relevant for synthetic data models since they contain no discernible correlations or distributions for the model to learn.
Post-process synthetic data generated from a synthetic data model, for example to validate that the generated records respect business-specific rules, and drop or fix any records that don't.
As with other Gretel models, you can configure Transform using YAML. Transform config files consist of two sections:
globals
which contains default parameter values (such as the locale and seed used to generate fake values) and user-defined variables applicable throughout the config.
steps
which lists transformation steps applied sequentially. Transformation steps can define variables (vars
), and manipulate columns
(add
, drop
, and rename
) and rows
(drop
and update
). In practice, most Transform configs contain a single step, but more steps can be useful if for example the value of column B depends on the original (non-transformed) value of column A, but column A must also be eventually transformed. In that case, the first step could set the new value of column B, leaving column A unchanged, before ultimately setting the new value of column A in the second step.
Below is an example config which shows this config structure in action:
The config above:
Sets the default locale for fake values to Canada (English) and Canada (French). When multiple locales are provided, a random one is chosen from the list for each fake value.
Adds a new column named row_index
initially containing only blank values.
Drops invalid rows, which we define here as rows containing blank user_id
values. condition
is a Jinja template expression, which allows for custom validation logic.
Sets the value of the new row_index
column to the index of the record in the original dataset (this can be helpful for use cases where the ability to "reverse" transformations or maintain a mapping between the original and transformed values is important).
Replaces all values within columns detected as containing phone numbers (including phone_number_1
and phone_number_2
) with fake phone numbers having area codes in Canada, since the default locale is set to en_CA
and fr_CA
in the globals
section. fake
is a Faker object supporting all standard Faker providers.
Drops the sensitive user_id
column. Note that this is done in the second step, since that column is needed in the first step to drop invalid rows.
Renames the phone_number_1
and phone_number_2
columns respectively to cell_phone
and home_phone
.
To get started with building your own Transform config for de-identification or pre/post processing datasets, see the Examples page for starter configs for several use cases, and the Reference page for the full list of supported transformation steps, template expression syntax, and detectable entities.
Below are a few complete sample configs to help you quickly get started with some of the most common Transform use cases.
Fallback on hashing entities not supported by Faker. If you don't require NER, remove the last rule (type: text -> fake_entities
) to run this config more than 10x faster assuming your dataset contains free text columns.
If you need to preserve certain ID columns for auditability or to maintain relationships between tables, you can explicitly exclude these columns from any transformation rules.
You can use the built-in Python re
library for regex operations in Python. Below we go a step further by listing all regular expressions we are looking to replace along with their Faker function mapping in the regex_to_faker
variable, then iterate through them to replace all of their occurrences in all free text columns.
Transform can be used to post-process synthetic data to increase accuracy, for example by dropping invalid rows according to custom business logic, or by ensuring calculated field values are accurate.
We published a guide containing best practices for cleaning and pre-processing real world data can help train better synthetic data models. The config below automates several steps from this guide, and can be chained in a Workflow to run ahead of synthetic model training.
Below is a template to help you get started writing your own Transform config. It includes common examples, the complete list of Supported Entities, and helper text to guide you as you write your own Transform configuration.
Adversarial model for time series data.
The Gretel DGAN model API provides access to a generative data model for time-series data. This model supports time varying features, fixed attributes, categorical variables, and works well with many time sequence examples to train on.
This model can be selected using the timeseries_dgan
model tag. An example configuration is provided below, but note that you will often need to update some of the options to match your input data. The DGAN model supports 2 input formats, wide and long, that we'll explain in detail in the Data format section. These formats and related parameters tell the DGAN model how to parse your data source as time-series. The training data (data source) is a table, for example a csv file, using the common interface to train or fine-tune all Gretel models. See the reference example on how to Create and Train a Model.
The DGAN model will generate synthetic time-series of a particular length, determined by the max_sequence_len
parameter. The training examples must also be that same length. As with all machine learning models, the more examples of these sequences are available to train the model, the better the model's performance. We have several config parameters to inform the DGAN model how to convert your input csv to these training example sequences.
We support 2 data styles to provide time-series data to the DGAN model: long and wide.
This is the most versatile data format to use. We assume the input table has 1 time point per row and use the config options to specify attributes, features, etc. For example, stock price data in this format might look like the following table:
2022-06-01
0
AAPL
125
135
115
126
100000
2022-06-02
0
AAPL
126
140
121
137
500000
...
0
...
...
...
...
...
...
2022-06-30
0
AAPL
185
193
170
177
250000
2022-06-01
1
V
222
233
213
214
50000
2022-06-02
1
V
214
217
200
203
75000
...
1
...
...
...
...
...
...
2022-06-30
1
V
234
261
212
236
150000
...
...
...
...
...
...
...
...
Here, we use each stock (symbol) to split the data into examples. Each example time-series corresponds to max_sequence_len
rows in the input. Then each generated example in the synthetic data is like a new stock with a sequence of prices that exhibits similar types of behavior as observed in the training data.
In addition to the price information that changes each day, we also have an attribute, Sector, that is fixed for each example. The model can utilize this if certain sector's stocks tend to be more volatile than others. In this case, the Sector is also a discrete variable, and it must already be ordinal encoded for the input data passed to Gretel's APIs. So 0 might correspond to technology sector, and 1 to financial sector. Consider using sklearn's OrdinalEncoder to convert a string column.
Use the following config snippet for this type of setup, updating the column names as needed for your data:
If there's not a good column to split the data into examples, we support automatic splitting when no example_id_column
is provided (though attributes are not supported in this mode). We'll split the input data (after sorting on time_column
if provided) into chunks of the required length.
When using the auto splitting feature, note that the generated data will have an additional column, called example_id
, with integer values. These values show how you should group the generated data for analyses. Temporal correlations within the same example_id
value will match the training data, but any comparisons across different example_id
values are not meaningful. So it's not recommended to concatenate all the generated examples into one very long sequence. There will be discontinuities every max_sequence_len
rows, because each example is generated independently.
When using long data style, variable sequence lengths are supported. So, when the number of rows in the input for each stock symbol is variable, data must be supplied in long format. Wide data style (described below) is not compatible with modeling variable sequence lengths.
An alternative data style if there's exactly 1 feature (time varying variable). We assume each example is 1 row in the input table. Let's use just the closing price, but otherwise the same underlying data as in the long data style example above:
0
126
137
...
177
1
213
203
...
236
...
...
...
...
...
With the sequence being represented as columns, each row is now one training example. Again we have the Sector attribute that is already ordinal encoded. The model doesn't need the Symbol column because no splitting into examples is required, so it should be dropped before sending the data to Gretel. The following config snippet will work with the above input:
Full list of configuration options for the DGAN model.
Data parameters:
df_style
(string, required, defaults to 'long') - Either 'wide'
or 'long'
indicating the format style of the input data.
example_id_column
(string, optional, defaults to null) - Column name to split data into examples for long style data. Effectively performs a group by operation on this column, and each group becomes an example. If null, the rows are automatically split into training examples based on max_sequence_len
. Note generated synthetic data will contain an example_id
column when this automatic splitting is used.
attribute_columns
(list of strings, optional, defaults to null) - Column names of fixed attributes that do not vary over time for each example sequence. Used by both 'wide' and 'long' formats. If null, the model will not use any attributes. Note that in 'long' format, this column must be constant for each example, so there must be a 1-to-1 mapping from values in the example_id_column
and each attribute column. Because of this, auto splitting (when example_id_column
is null) does not currently support attribute columns.
feature_columns
(list of strings, optional, defaults to null) - Column names of features, the variables that vary over time. Used by both 'wide'
and 'long'
formats. If specified, only these columns will be used for features. If null, then all columns in the input data that are not used in other column parameters will be the features.
time_column
(string, optional, defaults to null) - Column name of date or time values to sort before creating example sequences in 'long'
format. Will automatically select a column that looks like a date or time if time_column='auto'
. If null, the order from the input data is used. Generated synthetic data will contain this column using an arbitrary set of values from one training example. So if different examples have different time ranges (e.g., because auto splitting was used), one sequence of time values will be used for all synthetic data.
discrete_columns
(list of strings, optional, defaults to null) - Column names (either attributes or features) to model as categorical variables. DGAN will automatically model any string type columns as categorical variables, in addition to columns explicitly listed here.
max_sequence_len
(int, required) - Maximum length of generated synthetic sequences and training example sequences. Sequences may be of variable length (i.e. some sequences may be shorter than max_sequence_len
), and synthetic sequences will follow similar pattern of lengths as in training data. To have DGAN automatically choose a good max_sequence_len
and sample_len
based on the training data (when example_id_column is provided), set both max_sequence_len
and sample_len
to auto
.
sample_len
(int, required) - Number of time points to produce from 1 RNN cell in the generator. Must evenly divide max_sequence_len
. When max_sequence_len
is smaller (<20), recommended to use sample_len=1
. For longer sequences, the model often learns better when max_sequence_len/sample_len
is between 10 and 20. Increasing sample_len
is also an option if DGAN is running out of memory (receiving sigkill errors from Gretel API) as it should lead to fewer parameters and a smaller memory footprint for the model. If using max_sequence_len: auto
, then sample_len
can also be set to auto
.
data_source
(str, required) - Input data, must point to a valid and accessible file URL. Often set automatically by CLI (--in-data
) or may use local file with SDK and upload_data_source=True
.
Model structure parameters
apply_feature_scaling
(bool, required, defaults to True) - Automatically scale continuous variables (in attributes or features) to the appropriate range as specified by normalization
. If False, the input data must already be scaled to the appropriate range ([-1,1]
or [0,1]
) or the model will not work.
apply_example_scaling
(bool, required, defaults to True) - Internally rescale continuous features in each example and model the range for each example. This helps the model learn when different examples have very different value ranges. E.g., for stock price data, there may be penny stocks (prices usually between $0.001 and $1), stocks in the $1-$100 range, others in $100-$1000 range, and the Berkshire Hathaways in the $100,000 to $1,000,000 range.
normalization
(string, required, defaults to 'MINUSONE_ONE'
) - Defines internal range of continuous variables. Supported values are 'MINUSONE_ONE'
where continuous variables are in [-1,1]
and tanh activations are used, and 'ZERO_ONE'
where continuous variables are in [0,1]
and sigmoid activations are used. Also see apply_feature_scaling
.
use_attribute_discriminator
(bool, required, defaults to True) - Use a second discriminator that only operates on the attributes as part of the GAN. Helps ensure the attribute distributions are accurate. Also see attribute_loss_coef.
attribute_noise_dim
(int, required, defaults to 10) - Width of noise vector in the GAN generator to create the attributes.
feature_noise_dim
(int, required, defaults to 10) - Width of noise vector in the GAN generator to create the features.
attribute_num_layers
(int, required, defaults to 3) - Number of hidden layers in the feed-forward MLP to create attributes in the GAN generator.
attribute_num_units
(int, required, defaults to 100) - Number of units in each layer of the feed-forward MLP to create attributes in the GAN generator.
feature_num_layers
(int, required, defaults to 1) - Number of LSTM layers in the RNN to create features in the GAN generator.
feature_num_units
(int, required, defaults to 100) - Number of units in each LSTM layer to create features in the GAN generator.
Training parameters
batch_size
(int, required, defaults to 1000) - Size of batches for training and generation. Larger values should run faster, so try increasing if training is taking a long time. If batch_size
is too large for the model setup, the memory footprint for training may be too big for available RAM and crashes (sigkill errors from Gretel API).
epochs
(int, required, defaults to 400) - Number of epochs to train (iterations through the training data while optimizing parameters).
gradient_penalty_coef
(float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss.
attribute_gradient_penalty_coef
(float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss for the attribute discriminator (if enabled with use_attribute_discriminator
).
attribute_loss_coef
(float, required, defaults to 1.0) - When use_attribute_discriminator
is True, the coefficient on the attribute discriminator loss when combined with the generic discriminator loss. Try increasing this param if attribute discriminator is enabled but the attribute distributions of the generated data do not match the training data.
generator_learning_rate
(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train parameters of the GAN generator.
discriminator_learning_rate
(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN discriminator.
attribute_discriminator_learning_rate
(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN attribute discriminator (if enabled with use_attribute_discriminator
).
discriminator_rounds
(int, required, defaults to 1) - Number of optimization steps of the discriminator(s) to perform for each batch. Some GAN literature mentions using 5 or 10 for this parameter to improve model performance.
generator_rounds
(int, required, defaults to 1) - Number of optimization steps of the generator to perform for each batch.
Differential privacy is currently not supported for the Gretel DGAN model.
Conditional data generation (smart seeding) is currently not supported for the Gretel DGAN model.
Sample CLI to generate 1000 additional examples from a trained DGAN model:
Also see the example on how to Generate data from a model.
The underlying model is DoppelGANger, a generative adversarial network (GAN) specifically constructed for time series data. The model is initialized from random weights and trained on the provided dataset using the Wasserstein GAN loss. We use our PyTorch implementation of DoppelGANger in gretelai/gretel-synthetics based on the original paper by Lin et al. Additional details about the model can be found in that paper: http://arxiv.org/abs/1909.13403
If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is recommended to run the DGAN model.
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
As an open beta model, there are several technical limitations:
Model training is sometimes unstable, so if you see poor performance, retraining a few times with the same data and config can sometimes lead to notably better results from one run.
All training and generated sequences must be the exact same length (max_sequence_len
).
Synthetic quality report is not supported.
DGAN does not model missing data (NaNs) for continuous variables. DGAN will handle some NaNs in the input data by replacing missing values via interpolation. However, if there are too many missing values, the model may not have enough data or examples to train and will throw an error. NaN or missing values will never be generated for continuous variables. (This does not apply to categorical variables, where missing values are fully supported and modeled as just another category.)
While Gretel Transform will attempt to classify any arbitrary entity type, we have specifically fine-tuned GLiNER on the list of entities specified here.
We have fine-tuned GLiNER on the entity types shown in the table below, although Gretel Transform will attempt to classify any arbitrary entity type specified.
account_number
Account Number
An account number is a unique identifier for a financial account, such as a bank account or credit card.
GDPR, HIPAA, CPRA
address
Address
A physical address, including street, city, state, and / or country.
GDPR, HIPAA
api_key
API Key
An API key is a unique identifier that authenticates a user, developer, or program to an application programming interface (API).
GDPR, CPRA
bank_routing_number
Bank Routing Number
An American bank association routing number
GDPR, HIPAA, CPRA
biometric_identifier
Biometric Identifier
A biometric identifier is a unique physical characteristic of an individual used to identify them.
GDPR, HIPAA, CPRA
certificate_license_number
Certificate License Number
A unique, traceable number assigned to a certificate or license.
GDPR, HIPAA, CPRA
city
City
A city in the world.
GDPR, HIPAA, CPRA
company_name
Company Name
A company name.
coordinate
GPS Coordinate
A combination of latitude and longitude into a single tuple.
GDPR, HIPAA
country
Country
A country in the world.
credit_card_number
Credit Card Number
A credit card number is 12 to 19 digits long. It is used for payment transactions globally.
GDPR, HIPAA, CPRA
customer_id
Customer ID
A customer ID, or customer identifier, is a unique code or number that identifies a customer or entity.
cvv
Credit Card Verification Value
A CVV is a unique three or four digit number on a payment card that helps prevent fraud.
GDPR, HIPAA, CPRA
date
Date
A date. This includes most date formats, as well as the names of common world holidays.
HIPAA
date_of_birth
Date of Birth
A date of birth.
GDPR, HIPAA, CPRA
date_time
Date Time
A date and timestamp. This includes most date/time formats.
HIPAA
device_identifier
Device Identifier
A device identifier is a unique string of numbers and letters that identifies a device, such as a mobile phone or computer.
GDPR, HIPAA, CPRA
An email address identifies the mailbox that emails are sent to or from. The maximum length of the domain name is 255 characters, and the maximum length of the local-part is 64 characters.
GDPR, HIPAA, CPRA
employee_id
Employee ID
An ID number associated with an employee to identify them within their system.
first_name
First Name
A first name for a person.
GDPR, HIPAA, CPRA
health_plan_beneficiary_number
Health Plan Beneficiary Number
A health plan beneficiary number is a unique number assigned to an individual by their health insurance provider to identify them within their system.
HIPAA
ipv4
IP Address (version 4)
An Internet Protocol (IP) address for IPv4.
GDPR, HIPAA, CPRA
ipv6
IP Address (version 6)
An Internet Protocol (IP) address for IPv6.
GDPR, HIPAA, CPRA
last_name
Last Name
A last name for a person.
GDPR, HIPAA, CPRA
license_plate
License Plate Number
A license plate number used to identify a vehicle.
GDPR
medical_record_number
Medical Record Number
A medical record number is a unique identifier for a patient's medical records in a healthcare system.
HIPAA
name
Name
A full person name, which can include first names, middle names or initials, and last names.
GDPR, HIPAA, CPRA
national_id
National ID
A national identity number is a unique identifier issued by a government to track its citizens and residents.
GDPR, HIPAA, CPRA
password
Password
A password used to login to a computer network.
CPRA
phone_number
Phone Number
A telephone number.
GDPR, HIPAA, CPRA
pin
Personal Identification Number
A personal identification number (PIN) is a numerical code issued with a payment card that is required to be entered to complete various financial transactions.
GDPR
postcode
Postal Code
Postal code used by the United States Postal Service.
GDPR, HIPAA
ssn
US Social Security Number
A United States Social Security number (SSN) is a 9-digit number issued to US citizens, permanent residents, and temporary residents. The Social Security number has effectively become the United States national identification number.
GDPR, HIPAA, CPRA
state
USA State
A state in the United States of America.
GDPR
street_address
Street Address
A physical street address.
GDPR, HIPAA, CPRA
swift_bic
Business Identifier Code
A SWIFT code is the same as a Bank Identifier Code (BIC). It's a unique identification code for a particular bank. These codes are used when transferring money between banks, particularly for international wire transfers. Banks also use the codes for exchanging other messages.
GDPR, HIPAA, CPRA
tax_id
Tax ID
A Taxpayer Identification Number (TIN) is an identification number used by the Internal Revenue Service (IRS) in the administration of tax laws.
GDPR, HIPAA, CPRA
time
Time
A timestamp of a specific time of day.
unique_identifier
Unique ID
A Universally Unique Identifier (UUID).
url
URL
A Uniform Resource Locator (URL).
GDPR, HIPAA, CPRA
user_name
User Name
A username used to uniquely identify a user on a computer network.
CPRA
vehicle_identifier
Vehicle Identification Number
A VIN is composed of 17 characters (digits and capital letters) and acts as a unique identifier for a vehicle.
GDPR, CPRA
Transform supports the following transformation types for entities:
Fake: Replaces the value with synthetic data
Note that to fake
values, only entities supported by Faker will work
Hash: Anonymizes by converting data to a unique alphanumeric value
Normalize: Ensures data consistency by removing spaces and punctuation marks
Expression: Allows custom transformations
Null: No transformation applied
Terminology and core concepts that make up Gretel Workflows.
A Workflow is the top-level organizational unit for a Workflow config. Workflows are a part of projects and share the same project permissions. See Permissionsfor more details. Projects can have multiple workflows.
A Workflow is typically created for a specific use case or data source. You can think of a Workflow like a data pipeline or DAG.
The core configuration interface is a YAML config. You can edit and create Workflow YAML configs from the Console, SDK or CLI. These configs define what the workflow does, and when.
For a more detailed reference please see the Config Syntax docs.
Workflows are composed of many Workflow Actions. Actions are configured with inputs and produce outputs that determine the execution flow of the Workflow.
Each Workflow Action is responsible for integrating with some service and performing some processing on its set of inputs. These services could be external data stores (e.g. for reading source data or writing synthetic data), or Gretel (e.g. for training and running models).
Connections are used to authenticate a Gretel Action to an external service such as GCS or Snowflake. Each action is tied to at most one external service, and needs to be configured with a connection for the appropriate service.
For more detail on connections, including a full list of available connector types, see Connectors.
Triggers are managed as a property on the workflow config and can be used to schedule Workflows.
See Scheduled Workflows for more information.
A Workflow Run represents the concrete execution of a Workflow. When a Workflow is either manually triggered or triggered from a schedule, a Workflow Run is created.
To use data extracted by a connector as training input to a Gretel model, we need to understand how data is passed between Workflow Actions. Each Workflow Action produces a set of outputs that can be referenced by downstream actions as inputs.
These inputs are configured on each action's config
block as a template expression. The properties of these inputs might take a number of different forms depending on the type of data being worked with.
The file
data structure holds information about a data file, such as a CSV in object storage.
data
string, the data handle
filename
string, the stem of the file (e.g. events.csv
)
source_filename
string, the name of the file with any path prefix (e.g. sources/events.csv
)
The table
data structure holds information about a table extracted from a relational database or data warehouse.
data
string, the data handle
name
string, the name of the table
A dataset
is an umbrella data structure containing collections of files and tables, as well as metadata like table relationships used internally by various actions. All actions output exactly one dataset
.
files
file
list
tables
table
list
Some actions natively work with file
s, such as actions interfacing with object stores. Others natively work with table
s, such as those connecting to relational databases. A dataset
will contain both a file
and a table
representation of every data source. This allows you to create workflows that extract data from one kind of data source but write to a different type of destination.
file
and table
names are formatted with downstream compatibility in mind. An object store source action will preserve file
names as-is, and create database-friendly names for the corresponding table
representation.
Similarly, a database source action will preserve table
names as-is, and create file storage-friendly names for the corresponding file
representation.
All Gretel Workflow actions output a dataset
object that can then be referenced from a template expression in subsequent actions. Some actions require an entire dataset
as input, while others require finer-grained inputs like file
names and data handles. Each action documents its required inputs.
For more detail on template expression syntax, see the Config Syntax docs.
Automate creating, training and running Gretel Models.
Gretel Workflows offers two action types for working with Gretel models: gretel_model
and gretel_tabular
. Both take a collection of data output from some source action and create and run jobs for each file or table in the dataset. The main difference between these two actions is that gretel_tabular
understands relationships between tables in a dataset (if any exist) and can guarantee referential integrity between tables is maintained in the output.
Reference docs for Gretel Models.
Gretel provides a number of different model types which may be utilized directly or combined via workflows. This page will outline the different categories of models that Gretel offers.
Gretel configurations are declarative objects that specify how a model should be created. Configurations can be authored in YAML or JSON. Each of the below models will be declared and configured via a model configuration.
For more information, please refer to the Model Configurations documentation. For more information about each of the specific model types, refer to their individual sections.
Gretel offers the following synthetics models:
Tabular Fine-Tuning - Gretel’s flagship LLM-based model for generating privacy-preserving, real-world quality synthetic data across numeric, categorical, text, JSON, and event-based tabular data with up to ~50 columns.
Data types: Numeric, categorical, text, JSON, event-based
Differential privacy: Optional
Formerly known as: Navigator Fine Tuning
Text Fine-Tuning - Gretel’s model for generating privacy-preserving synthetic text using your choice of top performing open-source models.
Data types: Text
Differential privacy: Optional
Formerly known as: GPT
Tabular GAN - Gretel’s model for quickly generating synthetic numeric and categorical data for high-dimensional datasets (>50 columns) while preserving relationships between numeric and categorical columns.
Data types: Numeric, categorical
Differential privacy: NOT supported
Formerly known as: ACTGAN
Tabular DP - Gretel’s model for generating differentially-private data with very low epsilon values (maximum privacy). It is best for basic analytics use cases (e.g. pairwise modeling), and runs on CPU. If your use case is training an ML model to learn deep insights in the data, Tabular Fine-Tuning is your best option.
Data types: Numeric, categorical
Differential privacy: Required; you cannot run without differential privacy
You can learn more about Gretel Synthetics models here.
Gretel’s Transform model combines data classification with data transformation to easily detect and anonymize or mutate sensitive data. Gretel’s data classification can detect a variety of Supported Entities such as PII, which can be used for defining transforms.
We generally recommend combining Gretel Transform with Gretel Synthetics using workflows to redact or replace sensitive data before training a synthetics model.
You can learn more about Gretel Transform here.
You can use the flow chart below to help determine whether Transform, Synthetics (with or without Differential Privacy), or the combination is best for your use case.
If you decided that you should use Synthetics as part of your use case, you can use the next flow chart to help determine which Synthetics model may be best.
Use Gretel's privacy protection mechanisms to prevent adversarial attacks and better meet your data sharing needs.
In addition to the privacy inherent in the use of synthetic data, we can add supplemental protection by means of Gretel's privacy mechanisms. These file configuration settings help to ensure that the generated data is safe from adversarial attacks.
There are three privacy protection mechanisms:
Differential Privacy: Differential Privacy is supported with Tabular Fine-Tuning (numeric, categorical, and free text data), Text Fine-Tuning (free text data only), and Tabular DP (numeric and categorical data only, when very small ε < 5 is required). To enable Differential Privacy for Tabular Fine-Tuning and Text Fine-Tuning, you need to set dp: true
. Tabular DP always runs with differential privacy.
Similarity Filters: Similarity filters ensure that no synthetic record is overly similar to a training record. Overly similar training records can be a severe privacy risk as adversarial attacks commonly exploit such records to gain insights into the original data. Similarity Filtering is enabled by the privacy_filters.similarity
configuration setting. Similarity filters are available for Gretel Tabular GAN.
Outlier Filters: Outlier filters ensure that no synthetic record is an outlier with respect to the training dataset. Outliers revealed in the synthetic dataset can be exploited by Membership Inference Attacks, Attribute Inference, and a wide variety of other adversarial attacks. They are a serious privacy risk. Outlier Filtering is enabled by the privacy_filters.outliers
configuration setting. Outlier filters are available for Gretel Tabular GAN.
Synthetic model training and generation are driven by a configuration file. Here is an example configuration with differential privacy enabled for Tabular Fine-Tuning.
Here is an example configuration with privacy filters set for Gretel Tabular GAN.
Your Data Privacy Score is calculated by measuring the protection of your data against simulated adversarial attacks.
Values can range from Excellent to Poor, and we provide a list detailing whether your Data Privacy Score is sufficient for a given data-sharing use case.
We provide a summary of the protection level against Membership Inference Attacks and Attribute Inference Attacks.
For each metric, we provide a breakdown of the attack results that contributed to the score.
Membership Inference Protection is a measure of how well-protected your data is from membership inference attacks. A membership inference attack is a type of privacy attack on machine learning models where an adversary aims to determine whether a particular data sample was part of the model's training dataset. By exploiting the differences in the model's responses to data points from its training set versus those it has never seen before, an attacker can attempt to infer membership. This type of attack can have critical privacy implications, as it can reveal whether specific individuals' data was used to train the model. To simulate this attack, we take a 5% holdout from the training data prior to training the model. Based on directly analyzing the synthetic output, a high score indicates that your training data is well-protected from this type of attack. The score is based on 360 simulated attacks, and the percentages indicate how many fell into each protection level.
Attribute Inference Protection is a measure of how well-protected your data is from attribute inference attacks. An attribute inference attack is a type of privacy attack on machine learning models where an adversary seeks to infer missing attributes or sensitive information about individuals from their data that was used to train the model. By leveraging the model's output, the attacker can attempt to predict unknown attributes of a data sample. This type of attack poses significant privacy risks, as it can uncover sensitive details about individuals that were not intended to be revealed by the data owners. Based on directly analyzing the synthetic output, an overall high score indicates that your training data is well-protected from this type of attack. For a specific attribute, a high score indicates that even when other attributes are known, that specific attribute is difficult to predict.
The Gretel Transform model can be applied to multiple related tables in a database at once, providing structured transformations without losing referential integrity across tables.
This functionality is executed through Gretel Workflows.
Use a native connector to extract data from your source.
Train and run models via the gretel_tabular action.
Optionally, write output data to a destination sink.
Optionally, write output reports to an object store of your choice.
The gretel_tabular
action can be used to train and generate records from Gretel Models. It helps maintain referential integrity between related tables. gretel_tabular
also allows specifying different model configs for different tables. This functionality is currently available only via the SDK. Read about Gretel Tabular.
The example notebooks above use a special connection, sample_mysql_telecom
, which connects to a demo telecommunications database:
Automate and operationalize synthetic data using Gretel Workflows
Gretel Workflows provide an easy to use, config driven API for automating and operationalizing Gretel. Using Connectors, you can connect Gretel Workflows to various data sources such as S3 or MySQL and schedule recurring jobs to make it easy to securely share data across your organization.
A Gretel Workflow is constructed of actions that connect to various services including object stores and databases. These actions are then composed to create a pipeline for processing data with Gretel. In the example above:
A source action is configured to extract data from a source, such as S3 or MySQL.
The extracted source data is passed as inputs to Gretel Models. Using Workflows you can chain together different types of models based on specific use cases or privacy needs.
A destination action writes output data from the models to a sink.
Log into the Gretel Console.
Navigate to the Workflows page using the menu item in the left side bar and follow the instructions to create a new workflow.
The wizard-based flow will guide you through model selection, data source and destination creation, and workflow configuration.
Once completed, all workflow runs can be viewed for a particular workflow via the Workflow page, or for all workflows and models on the Activity page.
For more detailed steps by step instructions see Managing Workflows.
Workflows are configured using YAML. Below is an example workflow config that crawls an Amazon S3 bucket and creates an anonymized synthetic copy of the bucket contents in a destination bucket.
This second example workflow config connects to a MySQL database, creates a synthetic version of the database, and writes it to an output MySQL database.
Next, we'll dive deeper into the components that make up Workflows. You may also want to check out a list of supported sources and sinks here: Connectors.
Specifying primary and foreign keys on data sourced from object stores (where such metadata does not exist as it does in a relational database)
Removing a foreign key to break a cyclic table relationship
Renaming tables
The dataset_editor
action provides a way to apply alterations like these and more to datasets. It accepts a dataset from some other action as input, and outputs a modified version of that dataset for downstream actions to consume.
A table relationship is used to relate two tables. The most common example is a foreign key constraint in a relational database.
A table_relationship
contains the following properties:
For example, a relational database storing users
and their sessions
might have a foreign key user_id
on the sessions
table pointing to the users.id
column. That key can be represented as a table_relationship
:
To add or remove relationships via the dataset_editor
action, use the add_table_relationships
and remove_table_relationships
attributes, both of which accept a table_relationship
list.
There are two ways to rename tables. First, tables can be renamed individually:
Alternatively, common prefixes and suffixes can be added or removed in bulk. This is particularly useful for renaming tables sourced from object storage.
Note that both these renaming mechanisms only apply to tables in the dataset; the corresponding file representations in the dataset are unaffected.
Tables in a dataset can be removed entirely:
Actions downstream of drop-tables
will have no awareness of the extraneous_data
table.
Dropping a table from a dataset also drops the corresponding file representation.
Primary keys can be specified on tables.
The dataset editor can combine multiple datasets into one, allowing a single downstream action to operate on data extracted from disparate sources.
Table names must be unique across all datasets. The rename_all_tables
option (see "Renaming tables" above) can be used to resolve name conflicts.
Note that actions only accept a single input action (input: s3-read
in the example above). To use outputs from multiple actions in a single config, the other actions must be transitive dependencies via the defined input action. In this particular example, the s3-read
action would need to include input: mysql-extract
; the S3 action does not use any outputs from MySQL, but defining it as a dependency ensures outputs from both actions are accessible to the dataset editor action.
All dataset modifications above can be performed in a single action. The order of operations is:
datasets.rename_all_tables
(and merge if there are multiple datasets)
drop_tables
rename_tables
set_primary_keys
remove_table_relationships
add_table_relationships
The example below lists these in order for convenience; the actual order of these keys in your yaml config does not matter.
Transform configurations consist of (optional) global parameters followed by a sequence of transformation steps. Rows, columns, and values not touched by any transformation step are maintained as-is in the output. In other words, Transform configs are implicitly "passthrough".
Below is a "kitchen sink" config showing most of Transform capabilities. Don't worry if it looks overwhelming. We will dissect each step in the reference below.
The entire globals
section is optional. You can use it to re-configure the following default entity detection and transformation settings:
classify
: Dictionary of classification configuration parameters. Note that classification is only performed once for each model, and currently only maps entire columns to entities (searching for entities within free text fields similarly to Transform's use_nlp
option is not currently supported in Transform). Subsequent model runs will assume the schema remains unchanged, and continue to use the column to entity mapping detected during the first run. NOTE: This will send column headers and a sample of data to perform the classification to Gretel Navigator
or a hybrid-deployed Gretel Inference LLM
.
enable
: Boolean specifying whether to perform classification. Defaults to true
when running within Gretel Cloud; defaults to false
otherwise. When false
, sets column.entity
to none
for all columns. When true
, classification accuracy currently necessitates sending column names and a few (equal to num_samples
) randomly selected values from each column to the Gretel Cloud.
num_samples
: Number of randomly selected values from each column to use for classification. Defaults to 3, but you can set it to a higher number for more accurate classification, or a lower number if you have privacy or security concerns with sending randomly sampled values from your dataset. Setting num_samples: 0
will use only column names as the input to classification.
ner
: Named entity recognition
seed
: Integer seed value used to generate fake values consistently. Defaults to null
. When the seed is set to null
, a random integer is generated at the beginning of each Transform run and used as the seed to transform values consistently within the current run (subsequent runs will generate their own random seed). This means rerunning with a null seed can cause inconsistent transforms (i.e. Alice -> Bob for the first run, Alice -> Jane for the second). If you set the seed to a specific number, transforms will be consistent across runs (i.e. Alice -> Bob always). The seed also doubles as a salt for the hash
function. While there are privacy benefits to inconsistent transformations, we recommend setting a fixed seed for consistent transformation for use cases involving downstream synthetic data generation, or analysis on the transformed dataset.
You can also access global constants in transformation steps. For example, a transformation step with value: globals.locales | first
will set that field's value to the first locale in the list of locales
.
steps
contain an ordered list of data transformation actions to be executed in the same order as they are defined in the Transform config.
Each step can optionally contain a vars
section, which defines custom variables to be used in any Jinja expression within the step. Unlike globals
, vars
are scoped to an individual step, and are initialized using Jinja expressions that are evaluated at the beginning of each step.
The columns
section of each step contains transformations applying to an entire column at once. Namely: adding a new column, dropping (removing) a column, and renaming a column.
You can add a new blank column (which you can later fill in using a rows
update
action) by specifying its name
and optional position
. If position is left unspecified, the new column is added as the last column. Initially all values in the new column will be null, but you can populate them using a rows.update
rule. For example, the config section below adds a primary_key
column, positions it as the first column in the dataset, and then populates it with the index of the row:
To drop a column, specify its name in a columns
drop
action. For example, the config section below drops the FirstName
and LastName
columns:
You can also drop columns based on a condition expressed. condition
has access to the entire Transform Jinja environment, as well as a few additional objects:
column
: Dictionary containing the following column properties. For example, condition: column.entity in vars.entities_to_drop
drops all columns matching the list of PII entities defined in the entities_to_drop
variable.
name
: the name or header of the column in the dataset.
entity
: the detected PII entity type of the column, or none
if the column does not match any PII entity type from the list under globals.classify.entities
.
type
: the detected data type of the column, one of "empty", "numeric", "categorical", "binary", "text", or "other".
position
: zero-indexed position of the column in the dataset. For a dataset with 10 columns, column.position
is equal to 0 for the first column and 9 for the last column.
You can rename a column by specifying its current name (name
) and new name (value
). For example, the config section below renames the MiddleName
column to MiddleInitial
:
Each step can also contain a rows
section, listing transformation rules that process the dataset row by row. The two currently supported operations are drop
and update
, respectively allowing for selective removal of rows or modification of row data based on specified rules.
The drop
operation within the rows
section is used to remove rows from the dataset that meet certain conditions. The specified condition must be a valid Jinja template expression. Rows that satisfy the condition are excluded from the resulting dataset.
For instance, to exclude rows where the user_id
column is empty, the configuration can be specified as follows:
You can use more complex Jinja expressions for conditions that involve multiple columns, logical operators, or functions. condition
has access to the entire Transform Jinja environment, as well as a few additional objects:
row
: Dictionary of the row's contents. For example, row.user_id
refers to the value of the user_id
column within that row.
index
: Zero-based index of the row in the dataset. Note that the index of a row may change during processing if previous steps delete or add rows. For example, the rule below drops every other record from the dataset:
The update
operation allows you to modify the values of specific rows. It can be used to set new values for columns, generate fake data, anonymize sensitive information, or apply any transformation that can be expressed as a Jinja template.
Each update operation must contain one of name
, entity
, type
or condition
which are different ways to specify what to update, as well as value
, which is contains the updated value. name
and entity
must be strings or list of strings, while condition
and value
are Jinja templates.
You can also optionally specify a fallback_value
to be used if evaluating value
throws an error. We recommend doing this when passing dynamic inputs to functions in value
(for example, setting the Faker locale based on the contents of another column), preferably with a simple template (e.g. using static parameter values) for fallback_value
to avoid further errors. In the event where both value
and fallback_value
fail to parse, the value will be set to the error message to aid with debugging.
condition
, value
, and fallback_value
in row update rules have access to the row drop Jinja environment including vars
, row
, and index
, as well as a few additional objects:
column
: Dictionary referring to the current column whose value is being changed. The properites of the column that can be accessed are:
name
: The name of the column
entity
: The name of an entity that is in the column
type
: A Gretel extracted generic type for the column, one of:
empty
numeric
categorical
text
binary
other
dtype
: The Pandas dtype of the column (object
, int32
, etc)
position
: The numerical (index) position of the column in the table
this
: Literal referring to the current value that is being changed. For example, value: this
is a no-op which leaves the current value unchanged, while value: this | sha256
replaces the current value with its SHA-256 hash.
Here's how the update
operation works with examples:
Setting a static value
The rule below sets the value of the column namedstatus_column
to the string processed
for all rows.
Incrementing an index
In the example below, we use the index
special variable to set the value of the column row_index
as the index of the record in the dataset. e.g. for a dataset containing 100 rows, the value of row_index
for the last row will be 99.
Generating fake PII
The example below replaces values in all columns detected to contain email addresses with fake email addresses. Notice that unlike previous examples where the update rule was conditioned on name
(the name of a column), the rule below is conditioned on entity
(the type of entity contained within a column), which may match multiple columns. For example, if the dataset contains personal_email
and work_email
columns, the rule below will replace the contents of both with fake email addresses.
Modifying based on a condition
You can also conditionally update rows using flexible Jinja conditions. These conditions may match any number of columns and any number of rows (unlike name
and entity
conditions which apply to all rows).
For example, you can set the value of the flag_for_review
column to true
for all rows where the value of the amount
column is greater than 1,000:
Transform incorporates a classification feature to detect personal identifiable information (PII) within data. This feature simplifies selecting and transforming specific types of PII by tagging each column with its appropriate entity, if any.
Here is an example configuration that uses classification for detecting these 3 entities and applying transformations:
Since these align with Faker built-in entities, we could also write a single rule that applies to all detected entities:
With this setting, Transform will first classify entities in the dataset, then replace detected entities with faker-generated ones for each row in the dataset.
If your list of entities contains custom entities not supported by Faker, you can leverage fallback_value
to apply other transformations. For example, the policy below attempts to fake all entities, and falls back to hashing unsupported entities. Since iban
is supported by Faker while employee_id
is not, the output of this policy will be fake IBAN values in the IBAN column, and hashes of the actual employee IDs in the employee ID column.
If instead you wish to replace unsupported entities by the entity name between brackets, you could set fallback_value: "<" + column.entity + ">"
. You could also generate custom fake values, for example if you wanted to replace all entities not supported by Faker by the letter "E" followed by a random 6 digit number, you could set fallback_value: "E" + fake.pyint(100000, 999999) | string
, or use Jinja's concatenation operator ~
which automatically converts integers to strings: fallback_value: "E" ~ fake.pyint(100000, 999999)
.
Similarly to column classification, Transform supports flexible Named Entity Recognition (NER) functionality including the ability to detect and transform custom entity types.
To get started, list the entities to detect under the globals.ner.entities
section and use one of the four built-in NER transformation filters:
redact_entities
replaces detected entities with the entity type. For example, "I met Sally" becomes "I met <first_name>".
fake_entities
replaces detected entities with randomly generated fake values using the Faker function corresponding to the entity type. For example, "I met Sally" could become "I met Joe". When using fake_entities
, ensure the name of the entity in the globals.classify.entities
section exactly matches the name of a Faker function. Entities without a matching Faker function are redacted by default, and you can customize the fallback behavior using the on_error
parameter, e.g. fake_entities(on_error="hash")
hashes the non-Faker-matching entities instead of redacting them.
hash_entities
replaces detected entities with salted hashes of their value. For example, "I met Sally" may become "I met 515acf74f".
label_entities
is similar to redact_entities
, but also includes the entity value. For example, "I met Sally" becomes "I met <entity type="first_name" value="Sally">". This can be useful for downstream post-processing (such as highlighting detected entities within the original text, applying more complex replacement logic for specific entity types, etc.), both within Transform and externally.
You can tweak the ner_threshold
parameter if you notice too many or too few detections. You can think of the NER threshold as the level of confidence required in the model's detection before labeling an entity. Increasing the NER threshold decreases the number of detected entities, while decreasing the NER threshold increases the number of detected entities. Values between 0.5 and 0.8 are good starting points for avoiding false positives. Values below 0.5 are good if you don't want any leaked entities.
The sample config below shows how to apply fake_entities
(falling back to redact_entities
) for a list of custom entity types across all free text fields:
Additionally, if you would like to speed up Named Entity Recognition by having it run on hardware with a GPU, you can set the globals.ner.ner_optimized
flag to true
:
Once you've done that, you can specify the Gretel Inference LLM model via Transform's globals.classify.deployed_llm_name
configuration field. This name should match the gretelLLMConfig.modelName
defined in the Gretel Inference LLM's values.yml
.
Here's how to perform the above PII detection using mistral-7b
deployed in your Gretel Hybrid Cluster:
Every Jinja environment in Transform can access the objects below:
Transform extends the capabilities of the standard Jinja filters with its own specific set. These include:
hash
: Computes the SHA-256 hash of a value. For example, this | hash
returns a hash of the value in the matched column in a row update rule. It can also take in its own salt, i.e. this | hash(salt="my-salt")
, but by default it uses the seed
value of the run as the salt. If the seed is unset, the hash will be different for the same values across runs.
isna
: Returns true
if a value is null or missing.
fake
: Invokes the Faker library to generate fake data of the entity that's passed to the filter. This is useful if the entity name dynamic, e.g. column.type | fake
is equivalent to fake.first_name()
if column.type
is equal to "first_name"
.
lookup_locales
: Maps a pycountry Country to a list of Faker locales for that country. For example "Canada" | lookup_country | lookup_locales
returns ["en_CA", "fr_CA"]
.
normalize
: Removes special characters and converts Unicode strings to an ASCII representation.
tld
: Maps a pycountry Country object to its corresponding top-level domain. For example, "France" | lookup_country | tld
evaluates to .fr
.
Workflows are configured using YAML and can be managed from either the Gretel Console, SDK or CLI.
Workflows are configured using three top-level blocks: name
, trigger
, and actions
.
The name
field sets the name of the workflow. This name is used as the canonical reference to the workflow. Workflow names do not need to be unique to a project, but should be descriptive enough to uniquely describe the purpose of the workflow.
Triggers may be used to schedule recurring workflows using standard cron syntax. To schedule a workflow to run once daily, a workflow trigger might look like this:
The actions
block configures each step in the workflow.
Each action definition carries the same top-level configuration envelope with the following fields:
Template expressions are used to dynamically configure actions based on the result of a preceding action. Template expressions are denoted by curly braces, i.e. {<template-expression>}
.
Action outputs are accessed via the following form:
For example, a dataset output from a MySQL source action would be referenced like this:
You can append attribute components to the expression to dive into the output data structure. For example, to get the filename of each object from an Azure blob storage source action:
Consider the following workflow config
In this config the s3-read
action outputs a dataset
object. In the next action - model-train-run
- we use the template expression {outputs.s3-read.dataset.files.data}
to define the training_data used for that action. When executing the workflow, the expression is resolved to a concrete set of values based on the outputs of s3-read
.
If the s3-read
action finds two files, a.csv
and b.csv
, we will enumerate two concrete instances of the model-train-run
config with:
training_data: <data handle to a.csv>
training_data: <data handle to b.csv>
Each instance of the config will get passed into the model-train-run
action, resulting in two trained models, one model for a.csv
and another for b.csv
.
Additionally, an action config can include multiple template expressions referring to different lists. For example, the s3-write
action above is configured with two template expressions, one referencing the original source filename, the other referencing synthesized data. The workflow runtime will automatically resolve these expressions to align such that again, there are two concrete instances of the s3-write
config enumerated, with:
filename: "a.csv"
input: <data handle to the synthetic output from the model trained on a.csv>
filename: "b.csv"
input: <data handle to the synthetic output from the model trained on b.csv>
Workflows can be scheduled on a recurring basis
Using the trigger
field of a workflow config, you can configure your Workflow to run on a schedule with the cron
setting.
The following workflow config is configured to run every two hours:
You may use one of several pre-defined schedules in place of a cron expression.
Each workflow can only have a single active run at a time. If a workflow is still running while a subsequent scheduled workflow reaches the evaluation window, the next workflow run is prevented from launching until the current run completes.
Transform a dataset by applying a consistent model to all tables in the dataset. Note that the model config can be specified as a full object...
...or a reference to a blueprint template can be provided via from
:
You can apply different model configs to different tables by supplying table-specific configs:
To pass a subset of tables through unaltered by the model (e.g. for static reference data), specify tables to skip:
Instead of providing a specific model config, you can instruct the gretel_tabular
action to run trials to identify the best model config for each table. This is accomplished via the autotune
option inside model_config
fields (at either the root train
level to apply to all tables, or inside a table_specific_config
to apply to only a subset of tables).
Autotune objects accept the following fields:
Using all autotune defaults:
...or a Tuner config can be spelled out explicitly:
Integrate Gretel with your existing data services using Workflows.
Connections are used to authenticate Workflow Actions. Each action is compatible with a specific type of connection. For example, the s3_source
action requires a s3
connection.
When creating a connection, you must select a project for the connection to reside in. The connection will inherit all the project permissions and user memberships.
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
and credentials
. The config
and credentials
field should contain fields that are specific to the connection being created.
Below is an example S3 connection
Now that you've created the credentials file, use the CLI to create the connection
Navigate to the Connections page using the menu item in the left sidebar.
Click the New Connection button in the top right corner.
Select the project where the connection will be stored.
Next, fill in your credentials and select Add Connection. The example below shows the flow for creating an Amazon S3 connection. All connections follow the same pattern.
The connection can now be used in a workflow.
Select the option to connect to an external data source, and choose the connection you created above, or create a new one.
Then run the following command from
Navigate to the Connections page using the menu item in the left sidebar.
Go to the Connection you'd like to update and click the three vertical dots (aka overflow actions).
Select Update Connection.
Modify the name and/or credentials and select Save. All workflows that use this connection will automatically use the new information.
Expanding on the example from Creating Connections
On the Connections list page, select the three dots (aka overflow actions) to the right of the connection you want to delete.
Select Delete
Expanding on the example from Creating Connections
Workflows can be managed from the Console, CLI or SDK.
To manage workflows from the Console, select the Workflows tab from the left side navigation bar. This will bring you to a list of Workflows where you can view more details for each Workflow.
Using the CLI you can view commands for working with workflows by running
Workflows can be created either from the Gretel Console or CLI.
Workflows are organized under projects and share the same permissions as the project they are owned by.
You can share a Workflow by sharing the project it is owned by. If a workflow references models or connections in a different project, be sure you have the appropriate level of access to that project.
The gretel_model
action can be used to train and generate records from Gretel Models. It is a good choice for non-relational data that can share the same model config. For relational data and providing table-specific models, you should use instead.
can only be referenced "as-is" in action configs, for example {outputs.extract.dataset}
. However, there are cases where a dataset output by one action needs to be edited before it can be used by a downstream action. Some examples include:
You can find additional Transform configuration templates .
entities
: List of PII entities that Transform's classification model will attempt to detect. Defaults to the following commonly used entities: [name, first_name, last_name, company, email, phone_number, address, street_address, city, administrative_unit, country, postcode]
. For best practices around customizing this list, see .
locales
: List of default Faker locales to use for fake value generation. Defaults to ["en_US"]
. fake
will randomly choose a locale from this list each time it generates a fake value, except when initialized with explicit locales, e.g. fake(["fr_FR"]).first_name()
. For a list of valid locales, see Faker's .
These expressions can leverage data
(a containing the entire dataset) to implement custom aggregations. For example, the config section below creates a new percent_of_total
column by storing the total
in vars
then dividing the value of each individual row by vars.total
:
vars
: Dictionary of variables defined under the vars
section of the current step
. For example, vars.total
refers to the value of the total
variable defined .
dtype
: Pandas of the column.
vars
: Dictionary of variables defined under the vars
section of the current step
. For example, vars.total
refers to the value of the total
variable defined .
You can use the built-in Faker implementation to generate fake entities. See for a list of supported entities and parameters.
Note: Column classification requires access to an LLM endpoint. When running within Gretel Cloud, this will use Gretel Navigator
. For Gretel Hybrid, classification needs to use a separately deployed LLM within your cluster. For full documentation on how to setup an LLM, see .
The classification model is capable of recognizing a variety of pre-defined and custom entities. While you can use arbitrary strings as entity names, it is beneficial to align with Faker entities if you plan to pass entity names to the fake
in order to generate fake values of the same entity.
For example, to detect and replace phone numbers, email addresses, employee IDs, and International Bank Account Numbers (IBAN), include phone_number
, email
, and iban
in the list of entities under globals.classify.entities
. These match perfectly Faker's , , and methods.
If you are running Transform in Gretel Hybrid and want to use classification, you'll need to first ensure you've installed the Gretel Inference LLM chart in your cluster. For full instructions on that installation, see .
fake
: Instantiation of which defaults to the locale and seed specified in the globals
section. You can override these defaults by passing parameters, such as fake(locale="it_IT", seed=42)
, which will generate data using the Italian locale and 42 as the consistency seed.
random
is Python's . For example you could call random.randint(1, 10)
to generate an integer between 1 and 10.
Variables can be modified by filters. Filters are separated from the variable by a pipe symbol (|) and may have optional arguments in parentheses. Multiple filters can be chained. The output of one filter is applied to the next. Transform can use any of , and also extends them with a few Gretel-specific filters:
lookup_country
: Attempts to map a country name to its corresponding Country.
partial_mask(prefix: int, padding: str, suffix: int)
: This filter is similar to the MSSQL partial()
functionality. Given a value, this filter will retain the first N characters as the prefix, the last N characters as the suffix, and apply the padding between the prefix and suffix. If the original value is too short and would be leaked in the prefix, suffix, or a combination of the two, then the prefix and suffix are automatically adjusted to prevent this. For very short values, for example a single character value, only the padding may be returned. Example usage: value: this | partial_mask(2, "XXXXXX", 2)
date_parse
: Takes a string value and parses it into a Python datetime object. Date formats are those supported by Python's method.
date_shift
: Takes a date, either as a string or a date object, and randomly shifts it on an interval about the date. For example 2023-01-01 | date_shift('-5y', '+5y')
will result in a date object between between 2018-01-01
and 2028-01-01
. Supports the same interval formats as Python's .
date_time_shift
: Takes a date, either as a string, a date or datetime object, and randomly shifts it on an interval about the date. For example 2023-01-01 00:00 | datetime_shift('-5y', '+5y')
will result in a date object between between 2018-01-01 00:00
and 2028-01-01 00:00
. Supports the same interval formats as Python's .
date_format
: Takes a date and formats it per the passed in format. The default format is "%Y-%m-%d"
. Supports all formats for .
date_time_format
: Takes a datetime and formats it per the passed in format. The default format is "%Y-%m-%d" %H:%M:%S
. Supports all formats for .
For more detailed documentation please refer to the docs.
See the section for type
and config
details for actions that work with sources and sinks. See for type
and config
details for actions that interface with Gretel.
Workflows can be scheduled using . Some examples include:
The gretel_tabular
action can be used to transform multiple tables while preserving referential integrity between those tables. gretel_tabular
also allows specifying different model configs for different tables, and even instructing Gretel to find optimal model configs for your data via .
By default, gretel_tabular
uses the , but a different blueprint can be referenced...
With , you can train and run one or more by connecting directly to your data sources and destinations. We support the following integrations for data inputs and outputs.
If you're creating Connections in a Hybrid environment, follow along here:
Select the .
Data sources also can be configured during a blueprint flow. Go to the or page and select a use case, for example "Generate synthetic data". This will start a guided flow to help you create a workflow.
To update the connection, follow the steps from to create a credentials file for the updated connection.
First, create a file on your computer containing a YAML workflow config. Then run the following command
Log into the Gretel Console, and navigate to the Workflows page. Select the New Workflow button.
Next, select the project in which you'd like to create the workflow. For first time users, a Default Project will automatically be created.
Now, select the model type. This depends on the use case. For example, if the goal is to generate synthetic data with differential privacy guarantees, choose Tabular DP.
The next step is selecting the remote data source. Since workflows are meant to be run automatically, you can't manually upload a data source. When creating and evaluating models, we recommend creating a model directly. That model can be referenced in the workflow config when it's time to operationalize your data generation.
Existing connections will show up automatically in the dropdown. If there are no connections, select New connection to define one. Add a descriptive connection name (separated by hyphens), and enter your credentials.
Provide data source and file name details. Gretel supports multiple files being processed at once. All files will create the same model type that was selected earlier in the flow.
Configure the destination. Generated data can be uploaded to the Gretel Cloud for easier access and sharing. It can also be output to a remote connection; either the same one that was configured as the data source or an entirely new one.
The final step is reviewing the workflow configuration. For an example of a workflow config, see the section below.
The workflow configuration can be edited from this page, and the model type updated. Once the workflow has been created, it will appear on the Workflows screen. Click the workflow list item to run the workflow. Workflow run activity details will be displayed, along with detailed logs for each step.
When a workflow has successfully completed, all generated artifacts will be available in the remote destination. This includes the generated data, quality and utility reports, and log files.
When creating a new workflow, select Run Now in the scheduling step.
Existing workflows can be run manually by navigating to the Workflow detail page, and selecting Run workflow now in the top right.
Building on the previous example from Creating Workflows
Workflows can be editing by navigating to that workflow, and clicking the configuration tab. Use the YAML code editor to modify workflow parameters, and select Done when completed. The new configuration will take effect for all subsequent runs of that workflow. To test the changes, select the Run workflow now button in the top right.
Building on the previous example from Creating Workflows
By default, processed files are output to the configured bucket path using the of the for the model run. If you want to customize the filename or path you can modify the destination action from YAML config after completing the wizard.
With the source and destination defined, select whether the workflow should run manually or on a schedule. We provide some pre-defined schedule types, but you can also create your own schedule using a cron expression. about cron expressions, or for help creating one.
project_id
The project to create the model in.
model
A reference to a blueprint or config location. If a config location is used, it must be addressable by the workflow action.
This field is mutually exclusive to model_config
.
model_config
Specify the model config as a dictionary. Accepts any valid model config.
This field is mutually exclusive to model
.
run_params
Parameters to run or generate records. If this field is omitted, the model will be trained, but no records will get generated for the model.
training_data
Data to use for training. This should be a reference to the output from a previous action.
dataset
A dataset object containing the outputs from the models created by this action.
table
string
The name of the table containing data (typically an id column) pointing to records on another table, e.g. "the table with the foreign key"
constrained_columns
string list
The columns populated with identifiers from the other table, e.g. "the foreign key column(s)"
referred_table
string
The name of the table containing data records being referenced by table
referred_columns
string list
The columns to which constrained_columns
point
name
An identifier for the action. Action names must be unique within the scope of a workflow.
type
The specific action type, e.g. s3_source
or gretel_model
. (See below)
connection
Pass a Connection ID to authenticate the action. This field is required for actions that connect to external services such as S3 or BigQuery.
input
Specify a preceding action as input to the current action.
config
The type-specific config.
0 * * * *
Every hour at the beginning of the hour.
0 2 * * 1-5
2:00 AM from Monday to Friday.
0 0 * * 0
Midnight (00:00) every Sunday.
30 3 15 * *
3:30 AM on the 15th day of every month.
@yearly
Run once a year, midnight, Jan. 1st
0 0 1 1 *
@monthly
Run once a month, midnight, first of month
0 0 1 * *
@weekly
Run once a week, midnight between Sat/Sun
0 0 * * 0
@daily
Run once a day, midnight
0 0 * * *
@hourly
Run once an hour, beginning of hour
0 * * * *
project_id
The project to create the model in.
train
(Training details, see following fields)
train.dataset
Data to use for training, including relationships between tables (if applicable). This should be a reference to a dataset output from a previous action.
train.model_config
A yaml object that accepts a few different shapes (detailed below): 1) a complete Gretel model config; 2) a reference to a blueprint or config location (from
); 3) an autotune
configuration.
train.skip_tables
(List of tables to pass through unaltered to outputs, see following fields)
train.skip_tables.table
The name of a table to skip, i.e. omit from model training and pass through unaltered.
train.table_specific_configs
(List of table-specific training details, see following fields)
train.table_specific_configs.tables
A list of table names to which the other fields in this object apply.
train.table_specific_configs.model_config
An alternative to the global default train.model_config
value defined above.
run
(Run details, see following fields)
run.encode_keys
(Transform models only.) Whether to transform primary and foreign key columns. Defaults to false
.
dataset
A dataset object containing the outputs from the models created by this action.
enabled
This boolean field must be explicitly set to true
to enable config tuning.
trials_per_table
Optionally specify the number of trials to run for each table. Defaults to 4.
metric
The metric to optimize for. Defaults to synthetic_data_quality_score
; also accepts field_correlation_stability
, field_distribution_stability
, principal_component_stability
.
tuner_config
The specific Gretel Tuner config to use. Like model_config
, this accepts either full configuration objects, or references to blueprints via from
.
Read
Users can list and view connection metadata.
Write
Users can access connections in a Workflow.
Administrator
Users can create, edit and delete connections.
Owner
Full control.
Read
Users can view workflows, runs, and logs.
Write
Users can create new workflows.
User can edit existing workflows.
Users can manually trigger existing workflows.
Users can delete existing workflows.
Administrator
Users can share workflows with other users.
Co-Owner
Full control.
Connect Gretel to object storage based services.
Gretel Workflows support connecting to the following object storage services
Object storage source actions will incrementally crawl buckets searching for files that have changed between runs. Crawled files can then be configured as inputs to Gretel Models.
A glob filter can be configured to ensure files matching a specific pattern are used as sources. Files not matching the pattern will be excluded from the crawl.
A glob filter is evaluated against the filename or key of the object.
The character *
is used to matches any number of characters, excluding slashes.
Passing **
recursively matches any number of nested directories.
Checks are case-sensitive
Examples
*.txt
data.txt
Yes, any txt file in the current path will be matched.
*.png
data.json
No, json files do not container a png ending.
my/path/*.txt
my/path/data.txt
Yes, any txt files under my/path
are matched
**/*.csv
my/path/data.csv
Yes, any csv file is recursively matched.
**
data.csv
Yes, all files are recursively matched.
*/**
data.csv
No, any files in the root directory are excluded.
In addition to a glob filter, a source action can be configured to crawl in a specific path. Configuring a path will narrow the set of objects that the bucket crawler will list or search.
Object storage destination actions can be configured to write the synthetic data outputs of a Gretel Model back to object storage.
Each object storage destination action can be configured to mirror the directory structure of the source bucket or can be configured to create new directory layouts.
For a list of supported file types, please refer to Inputs and Outputs.
Connect Gretel to database management systems.
Gretel workflows support connecting to the following databases:
MySQL
PostgreSQL
MS SQL Server
Oracle Database
Gretel database connectors can be used in Gretel Workflows with the gretel_tabular
action to operationalize synthetic data into your data pipeline.
For the source database connection, we recommend using a backup or clone with read-only permissions, instead of connecting directly to your production database.
Do not use your input database connection as an output connector. This action can result in the unintended overwriting of existing data.
When reading from a database connector, the source action can extract:
an entire database, OR
selected tables from a database, OR
the results of SQL query/queries against a database.
Each time the workflow is run the source action will extract the most recent data from the source database.
When combined in a workflow, the data extracted from the {database}_source
action is used to train models and generate synthetic data with the gretel_tabular
action, and can be written to an output database a {database}_destination
action.
Gretel database connectors can be used in Workflows with the gretel_tabular
action type. They are not compatible with gretel_model
.
Destination database must exist and contain placeholder tables/schema
While referential integrity can be maintained up to a certain extent, this functionality works best on single tables and we recommend processing either individual tables or views.
Connect Gretel to data warehouse platforms.
Gretel workflows support connecting to the following data warehouse platforms:
Gretel data warehouse connectors can be used in Gretel Workflows to operationalize synthetic data into your data pipeline.
When reading from a data warehouse connector, the source action can extract:
an entire database, OR
selected tables from a database, OR
the results of SQL query/queries against a database.
Each time the workflow is run the source action will extract the most recent data from the source database.
When combined in a workflow, the data extracted from the {data_warehouse}_source
action is used to train models and generate synthetic data with the gretel_tabular
action, and can be written to an output database a {data_warehouse}_destination
action.
Do not use your input data warehouse connection as an output connector. This action can result in the unintended overwriting of existing data.
Connect to your Google Cloud Storage buckets.
Prerequisites to create a Google Cloud storage based workflow. You will need
A connection to Google Cloud Storage.
A source bucket.
(optional) A destination bucket. This can be the same as your source bucket, or omitted entirely.
Google Cloud Storage related actions require creating a gcs
connection. The connection must be configured with the correct permissions for each Gretel Action.
For specific permissions, please refer to the Minimum Permissions section under each corresponding action.
Gretel GCS connections require the following fields
private_key_json
This private key JSON blob is used to authenticate Gretel with GCS object storage resources.
In order to generate a private key you will first need to create a service account, and then download the key for that service account.
After the service account has been created, you can attach bucket specific permissions to the service account.
Please see each action's Minimum Permissions section for a list of permissions to attach to the service account.
Type
gcs_source
Connection
gcs
The gcs_source
action can be used to read an object from a GCS bucket into Gretel Models.
This action works as an incremental crawler. Each time a workflow is run the action will crawl new files that have landed in the bucket since the last crawl.
For details how the action more generally works, please see Reading Objects.
bucket
Bucket to crawl data from. Should only include the name, such as my-gretel-source-bucket
.
glob_filter
path
Prefix to crawl objects from. If no path
is provided, the root of the bucket is used.
recursive
Default false
. If set to true
the action will recursively crawl objects starting from path
.
dataset
The associated service account must have the following permissions for the configured bucket
storage.objects.list
storage.objects.get
Type
gcs_destination
Connection
gcs
The gcs_destination
action may be used to write gretel_model
outputs to Google Cloud Storage buckets.
For details how the action more generally works, please see Writing Objects.
bucket
The bucket to write objects back to. Only include the name of the bucket, eg my-gretel-bucket
.
path
Defines the path prefix to write the object into.
filename
Name of the file to write data back to. This file name will be appended to the path
if one is configured.
input
None
The associated service account must have the following permissions for the configured destination bucket
storage.objects.create
storage.objects.delete
(supports replacing an existing file in the bucket)
Create a synthetic copy of your Google Cloud Storage bucket. The following config will crawl a bucket, train and run a synthetic model, then write the outputs of the model back to a destination bucket while maintaining the same folder structure of the source bucket.
Check out this Benchmark report, running Gretel models on popular ML datasets, indexed by industry
You can use a Benchmark report like the one shown here to evaluate which Gretel model is best for your synthetic data goals.
For example, Gretel Tabular Fine-Tuning consistently generates synthetic data with high Synthetic Data Quality Score (SQS) on multiple types of tabular data, and Gretel ACTGAN is great for particularly long or wide datasets.
The publicly available datasets used in this results leaderboard were sourced from the following ML dataset repositories: UCI, Kaggle, and HuggingFace.
Tabular Fine-Tuning
84
83
1034.299
4.9 MB
tabular_mixed
21
41188
Tabular GAN
89
86
1080.329
4.9 MB
tabular_mixed
21
41188
Tabular Fine-Tuning
95
85
669.117
371 KB
tabular_mixed
17
4521
Tabular GAN
87
87
148.713
371 KB
tabular_mixed
17
4521
Tabular Fine-Tuning
93
53
2126.556
89 KB
time_series
16
750
Tabular GAN
60
97
62.725
89 KB
time_series
16
750
Tabular Fine-Tuning
87
94
1334.324
2.4 MB
tabular_numeric
24
16519
Tabular GAN
78
95
444.495
2.4 MB
tabular_numeric
24
16519
Tabular Fine-Tuning
95
75
368.91
52 KB
tabular_numeric
7
1728
Tabular GAN
86
74
67.322
52 KB
tabular_numeric
7
1728
Tabular Fine-Tuning
83
97
507.772
5.6 MB
tabular_numeric
5
103886
Tabular GAN
87
77
868.921
5.6 MB
tabular_numeric
5
103886
Tabular Fine-Tuning
93
91
874.715
1.9 MB
tabular_mixed
14
19158
Tabular GAN
91
91
417.622
1.9 MB
tabular_mixed
14
19158
Tabular Fine-Tuning
91
73
2008.613
274 KB
tabular_mixed
37
1470
Tabular GAN
75
89
98.267
274 KB
tabular_mixed
37
1470
Tabular Fine-Tuning
89
95
2277.711
11.4 MB
time_series
29
19735
Tabular GAN
75
84
653.85
11.4 MB
time_series
29
19735
Tabular Fine-Tuning
87
90
1805.024
1.7 MB
tabular_mixed
33
7043
Tabular GAN
80
91
265.678
1.7 MB
tabular_mixed
33
7043
Tabular Fine-Tuning
90
81
1144.394
822 KB
time_series
15
9357
Tabular GAN
69
86
214.324
822 KB
time_series
15
9357
Tabular Fine-Tuning
82
78
376.864
4 KB
tabular_numeric
5
150
Tabular GAN
78
58
56.388
4 KB
tabular_numeric
5
150
Tabular Fine-Tuning
92
90
775.856
90 KB
tabular_numeric
12
1599
Tabular GAN
66
92
60.729
90 KB
tabular_numeric
12
1599
Tabular Fine-Tuning
94
88
742.663
281 KB
tabular_numeric
12
4898
Tabular GAN
82
89
120.683
281 KB
tabular_numeric
12
4898
Tabular Fine-Tuning
89
65
2170.874
3 MB
tabular_numeric
28
21643
Tabular GAN
87
88
710.705
3 MB
tabular_numeric
28
21643
Tabular Fine-Tuning
90
79
815.833
3.6 MB
tabular_mixed
15
32561
Tabular GAN
92
80
683.149
3.6 MB
tabular_mixed
15
32561
Tabular Fine-Tuning
92
53
723.425
18 KB
tabular_numeric
14
303
Tabular GAN
73
73
47.015
18 KB
tabular_numeric
14
303
Tabular Fine-Tuning
83
76
649.074
19 KB
tabular_numeric
11
699
Tabular GAN
75
78
47.237
19 KB
tabular_numeric
11
699
Tabular Fine-Tuning
85
84
12805.043
75 MB
tabular_numeric
55
581012
Tabular GAN
92
81
3135.347
75 MB
tabular_numeric
55
581012
Tabular GAN
86
50
94297.403
311 MB
tabular_numeric
1349
27000
Tabular GAN
83
77
22133.246
743 MB
tabular_mixed
42
4898430
Tabular GAN
83
50
154719.489
421 MB
tabular_numeric
967
63360
Tabular Fine-Tuning
93
98
1353.606
154 MB
tabular_numeric
15
1446956
Tabular GAN
92
99
10106.598
154 MB
tabular_numeric
15
1446956
Tabular Fine-Tuning
99
87
511.645
24 MB
tabular_numeric
11
1000000
Tabular GAN
95
85
5350.631
24 MB
tabular_numeric
11
1000000
Tabular Fine-Tuning
99
89
445.614
614 KB
tabular_numeric
11
25010
Tabular GAN
91
90
419.063
614 KB
tabular_numeric
11
25010
Tabular Fine-Tuning
67
94
1608.793
262 MB
tabular_numeric
12
5749132
Tabular GAN
85
92
33233.547
262 MB
tabular_numeric
12
5749132
Tabular Fine-Tuning
92
92
570.957
38 MB
tabular_mixed
9
1017209
Tabular GAN
89
89
5040.424
38 MB
tabular_mixed
9
1017209
Connect to your Azure Blob containers.
Prerequisites to create an Azure Blob based workflow. You will need
A connection to Azure Blob.
A source container.
A destination container. This can be the same as your source container.
Azure Blob related actions require creating an azure
connection. The connection must be configured with the correct permissions for each Gretel Action.
For specific permissions, please refer to the Minimum Permissions section under each corresponding action.
There are three ways to authenticate a Gretel Azure Blob Connection, each method requires different fields for connection creation:
name
Display name of your choosing used to identify your connection within Gretel.
account_name
Name of the Storage Account.
access_key
default_container
Default container to crawl data from. Different containers can be chosen at the azure_source
and azure_destination
actions.
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. connection_target_type
is optional; if omitted, the connection can be used for both source and destination action. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example Azure Blob connection using access key credentials:
Now that you've created the credentials file, use the CLI to create the connection
Navigate to the Connections page using the menu item in the left sidebar.
Click the New Connection
button.
Step 1, choose the Type for the Connection - Azure Blob.
Step 2, choose the Project for your Connection.
Step 3, fill in the credentials and select Add Connection
.
name
Display name of your choosing used to identify your connection within Gretel.
account_name
Name of the Storage Account.
client_id
Application (client) ID.
tenant_id
Directory (tenant) ID.
username
Email of the Service Account.
entra_password
Password of the Service Account.
default_container
Default container to crawl data from. Different containers can be chosen at the azure_source
and azure_destination
actions.
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. connection_target_type
is optional; if omitted, the connection can be used for both source and destination action. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example Azure Blob connection using access key credentials:
Now that you've created the credentials file, use the CLI to create the connection
Console support for creating Azure Blob connections using Entra ID is coming soon. For now, you can create connections using Entra ID via CLI or SDK and then use those connections in Console.
name
Display name of your choosing used to identify your connection within Gretel.
account_name
Name of the Storage Account.
sas_token
default_container
Default container to crawl data from. Different containers can be chosen at the azure_source
and azure_destination
actions.
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. connection_target_type
is optional; if omitted, the connection can be used for both source and destination action. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example Azure Blob connection file using access key credentials:
Now that you've created the credentials file, use the CLI to create the connection
Console support for creating Azure Blob connections using SAS Tokens is coming soon. For now, you can create connections using SAS Tokens via CLI or SDK and then use those connections in Console.
Type
azure_source
Connection
azure
The azure_source
action can be used to read an object from an Azure Blob container into Gretel Models.
This action works as an incremental crawler. Each time a workflow is run the action will crawl new files that have landed in the container since the last crawl.
For details how the action more generally works, please see https://github.com/Gretellabs/docs/blob/main/workflows-and-connectors/connectors/object-storage/broken-reference/README.md.
container
Container to crawl data from. If empty, will default to default_container
.
glob_filter
path
Prefix to crawl objects from. If no path
is provided, the root of the container is used.
recursive
Default false
. If set to true
the action will recursively crawl objects starting from path
.
dataset
The associated service account must have the following permissions for the configured container
Storage Blob Data Reader role permissions, or higher
The SAS Token must have the following permissions for the configured container or storage account
List
Read
The SAS Token added for the storage account needs to have Container and Object allowed resource types.
Type
azure_destination
Connection
azure
The azure_destination
action may be used to write gretel_model
or gretel_tabular
outputs to Azure Blob containers.
For details how the action more generally works, please see Reading Objects.
container
Container to write data to. If empty, will default to default_container
.
path
Defines the path prefix to write the object into.
filename
Name of the file to write data back to. This file name will be appended to the path
if one is configured.
input
None
The associated service account must have the following permissions for the configured container
Storage Blob Data Contributor role permissions, or higher
The SAS Token must have the following permissions for the configured container or storage account
Create
List
Write
The SAS Token added for the storage account needs to have Container and Object allowed resource types.
Create a synthetic copy of your Azure Blob container. The following config will crawl a container, train and run a synthetic model, then write the outputs of the model back to a destination container while maintaining the same folder structure of the source container.
Connect Gretel to your Amazon S3 buckets.
This guide will walk you through connecting source and destination S3 buckets to Gretel. Source buckets will be crawled and used as training inputs to Gretel models. Model outputs get written to the configured S3 destination.
Prerequisites to create a Amazon S3 based workflow. You will need
A connection to Amazon S3.
A source bucket.
(optional) A destination bucket. This can be the same as your source bucket, or omitted entirely.
Amazon S3 related actions require creating an s3
connection. The connection must be configured with the correct IAM permissions for each Gretel Action.
You can configure the following properties for a connection
access_key_id
Unique identifier used to authenticate and identify the user.
secret_access_key
Secret value used to sign requests.
The following policy can be used to enable access for all S3 related actions
More granular permissions for each action can be found in the action's respective Minimum Permissions section.
The following documentation provides instruction for creating IAM users and access keys from your AWS account.
You can configure your Gretel S3 connector to use an IAM role for authorization. Using IAM roles you can grant Gretel systems access to your bucket without sharing any static access keys.
Before setting up your IAM role, you must first locate the Gretel Project ID for the project you wish to create the connection in. You will use the project id as the external id for the IAM role.
You may find your Gretel Project ID from the Console, SDK or CLI using the following instructions:
Using the CLI you can query for projects by name and use the project_guid
field to retrieve the external id for the IAM role.
Navigate to the Projects page, and select Copy UID from the project drop-down on the right.
This should automatically copy the project id to your clipboard.
Running the snippet above, should yield an output such as
Now that you have the external id, you will need to create an AWS IAM role. To create the role, navigate to your AWS IAM Console, select the Roles page from the left menu, select Create Role and follow the instruction for either Gretel Cloud or Gretel Hybrid below:
From the Role Creation dialog
Select AWS account as the Trusted entity type.
From the Select Another AWS account and enter Gretel's AWS account 074762682575
.
Check Require external ID and enter the Gretel Project ID from the previous step as the External ID.
Select Next and add the appropriate IAM policies for the bucket.
The final trust policy on your IAM role should look similar to
For more information about delegating permissions to an AWS IAM user, please reference the following AWS documentation:
From the Role Creation dialog, select Custom trust policy as the Trusted entity type, and enter the following config:
Be sure to replace the following values:
<your-aws-account>
with your AWS Account ID.
<hybrid-deployment-name>
with the name of your Gretel Hybrid deployment. By default this is set to gretel-hybrid-env
. You can find this value by checking the deployment_name
variable from your Gretel Hybrid Terraform module.
<your gretel project id>
with your Gretel Project ID from the previous step.
Now that you have the role configured, you can create a Gretel connection using the role ARN from the the previous step.
Using the role ARN from the previous steps, create a file on your local computer with the following contents
Then use the Gretel CLI to create the connection from the credentials file
Once you've create the connection, you may delete the local credentials file.
From the Gretel Console, navigate to the Create Connection dialog, select S3, select the Role ARN authentication method, and enter the role ARN created in the previous steps.
Type
s3_source
Connection
s3
The s3_source
action can be used to read an object from a S3 data source into Gretel Models.
Each time the source action is run from a workflow, the action will crawl new files that have landed in the bucket since the last crawl.
For details how the action more generally works, please see Reading Objects.
bucket
Bucket to crawl data from. Should only include the name, such as my-gretel-source-bucket
.
glob_filter
path
Prefix to crawl objects from. If no path
is provided, the root of the bucket is used.
recursive
Default false
. If set to true
the action will recursively crawl objects beginning from the configured path
.
dataset
The following permissions must be attached to the AWS connection in order to read objects from a s3 bucket
Type
s3_destination
Connection
s3
An S3 bucket can be configured as a destination for model outputs. This bucket can be the same bucket as the source, or a different bucket may be specified. If no destination is specified, generated data can be accessed from the model itself.
The s3_destination
action may be used to write gretel_model
outputs to S3 destination buckets.
For details how the action more generally works, please see Writing Objects.
bucket
The bucket to write objects back to. Please only include the name of the bucket, eg my-gretel-bucket
.
path
Defines the path prefix to write the object into.
filename
This is the name of the file to write data back to. This file name will be appended to the path
if one is configured.
input
None
The following permissions must be attached to the AWS connection in order to write objects to a destination bucket.
path
The path
property from the source configuration may be used in conjunction with the destination path
to move file locations while preserving file names.
For example, if a source bucket is configured with path=data/
and the destination bucket configured with path=processed-data/
, a source file data/records.csv
will get written to the destination asprocessed-data/records.csv
.
Create a synthetic copy of your Amazon S3 bucket. The following config will crawl a S3 bucket, train and run a synthetic model, then write the outputs of the model back to a destination S3 bucket while maintaining the same name and folder structure of the source bucket.
Connect to your MySQL databases.
Prerequisites to create a MySQL based workflow. You will need
A source MySQL connection.
(optional) A list of tables OR SQL queries.
(optional) A destination MySQL connection.
For the source database connection, we recommend using a backup or clone with read-only permissions, instead of connecting directly to your production database.
Do not use your input database connection as an output connector. This action can result in the unintended overwriting of existing data.
A mysql
connection is created using the following parameters:
name
Display name of your choosing used to identify your connection within Gretel.
my-mysql-connection
username
Unique identifier associated with specific account authorized to access database.
john
password
Security credential to authenticate username.
...
host
Fully qualified domain name (FQDN) used to establish connection to database server.
myserver.example.com
port
Port number; If left empty, the default value - 3306
- will be used.
3306
database
Name of database to connect to.
mydatabase
(optional) params
Optional JDBC URL parameters that can be used for advanced configuration.
TrustServerCertificate=True&useSSL=false
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example MySQL connection:
Now that you've created the credentials file, use the CLI to create the connection
Navigate to the Connections page using the menu item in the left sidebar.
Click the New Connection
button.
Step 1, choose the Type for the Connection - MySQL.
Step 2, choose the Project for your Connection.
Step 3, fill in the credentials and select Add Connection
.
Type
mysql_source
Connection
mysql
The mysql_source
action reads data from your MySQL database. It can be used to extract:
an entire database, OR
selected tables from a database, OR
the results of SQL query/queries against a database.
Each time the workflow is run the source action will extract the most recent data from the source database.
When combined in a workflow, the data extracted from the mysql_source
action is used to train models and generate synthetic data with the gretel_tabular
action, and can be written to an output database with the mysql_destination
action.
The mysql_source
action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.
Entire Database
sync.mode
full
- extracts all records from tables in database
(coming soon) subset
- extract percentage of records from tables in database
Example Source Action YAML
Selected Tables
sync.mode
full
- extracts all records from selected tables in database
(coming soon) subset
- extract percentage of records from selected tables in database
Sequence of mappings that lists the table(s) in the database to extract. name
- table name
Example Source Action YAML
SQL Query/Queries
name
- name of query; will be treated as name of resulting table
query
- SQL statement used to query connected database
Additional name
and query
mappings can be provided to include multiple SQL queries
Example Source Action YAML
Whether you are extracting an entire database, selected tables, or querying against a database, the mysql_source
action always provides a single output, dataset
.
dataset
The output of a mysql_source
action can be used as the input to a gretel_tabular
action in order to transform and/or synthesize a database.
Type
mysql_destination
Connection
mysql
The mysql_destination
action can be used to write gretel_tabular
action outputs to MySQL destination databases.
Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the mysql_destination
action always takes the same input, dataset
.
dataset
sync.mode
replace
- overwrites any existing data in table(s) at destination
append
- add generated data to existing table(s); only supported for query-created tables without primary keys
Example Destination Action YAML
There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode
field on a destination config.
sync.mode
may be one of truncate
, replace
, or append
.
When sync.mode
is configured with truncate
, records are first truncated from the destination table using the TRUNCATE TABLE
DML command.
When sync mode is configured with truncate
the destination table must already exist in the database.
When sync.mode
is configured with replace
, the destination table is first dropped and then recreated using the schema from the source table.
If the source table is from MySQL, the DDL is extracted using a SHOW CREATE TABLE
statement. If the source table is from a non-MySQL source, the destination table schema is inferred based on the column types of the database.
When sync mode is configured with replace
the destination table does not need to exist in the destination.
When sync.mode
is configured with append
, the destination action will simply insert records into the table, leaving any existing records in place.
When using the append
sync mode, referential integrity is difficult to maintain. It's only recommended to use append
mode when syncing adhoc queries to a destination table.
If append
mode is configured with a source that syncs an entire database, it's likely the destination will be unable to insert records while maintaining foreign key constraints or referential integrity, causing the action to fail.
Create a synthetic version of your MySQL database.
The following config will extract the entire database, train and run a synthetic model, then write the outputs of the model back to a destination MySQL database while maintaining referential integrity.
Create a synthetic version of selected tables from your MySQL database
The following config will extract two tables from your database, train and run a synthetic model, then write the outputs of the model back to a destination MySQL database while maintaining any key relationships between the tables.
.
Create a synthetic version of a dataset formed by querying your MySQL database
The following config will execute a SQL query against your MySQL database to create a table containing data from across the database. Then, it will train and run a synthetic model to generate a synthetic table.
Connect to your Oracle database.
Prerequisites to create an Oracle Database based workflow. You will need
A source Oracle Database connection.
(optional) A list of tables OR SQL queries.
(optional) A destination Oracle Database connection OR object storage connection.
For the source database connection, we recommend using a backup or clone with read-only permissions, instead of connecting directly to your production database.
Do not use your input database connection as an output connector. This action can result in the unintended overwriting of existing data.
A oracle
connection is created using the following parameters:
name
Display name of your choosing used to identify your connection within Gretel.
my-oracle-connection
username
Unique identifier associated with specific account authorized to access database. The connection will be to this user's schema.
john
password
Security credential to authenticate username.
...
host
Fully qualified domain name (FQDN) used to establish connection to database server.
myserver.example.com
port
Optional Port number; If left empty, the default value - 1521
- will be used.
1521
service_name
Name of database service to connect to.
my_service_name
(optional) instance_name
Optional Name of specific database instance for this connection.
instance_id
(optional) params
Optional JDBC URL parameters that can be used for advanced configuration.
key1=value1;key2=value2
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example Oracle Database connection:
Now that you've created the credentials file, use the CLI to create the connection
Navigate to the Connections page using the menu item in the left sidebar.
Click the New Connection
button.
Step 1, choose the Type for the Connection - Oracle Database.
Step 2, choose the Project for your Connection.
Step 3, fill in the credentials and select Add Connection
.
In Oracle, the CREATE SCHEMA
command does not create a new, standalone schema. Instead, one creates a user. When the user is created, a schema is also automatically created for that user. When the user logs in, that schema is used by default for the session. In order to prevent name clashes or data accidents, we encourage you to create separate Oracle users for the Source and Destination connections.
The Oracle source action requires enough access to read from tables and access schema metadata. The following SQL script will create an Oracle user suitable for a Gretel Oracle source.
The following SQL script will create an Oracle user suitable for a Gretel Oracle destination. It will write to its own schema.
For more details please check your installation's version and see Oracle documents on CREATE USER.
Type
oracle_source
Connection
oracle
The oracle_source
action reads data from your Oracle database. It can be used to extract:
an entire database, OR
selected tables from a database, OR
the results of SQL query/queries against a database.
Each time the workflow is run the source action will extract the most recent data from the source database.
When combined in a workflow, the data extracted from the oracle_source
action is used to train models and generate data with the gretel_tabular
action, and can be written to an output database with the oracle_destination
action. Your generated data can also be written to object storage connections, for more information see Writing to Object Storage.
The oracle_source
action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.
Entire Database
sync.mode
full
- extracts all records from tables in database
(coming soon) subset
- extract percentage of records from tables in database
Example Source Action YAML
Selected Tables
sync.mode
full
- extracts all records from selected tables in database
(coming soon) subset
- extract percentage of records from selected tables in database
Sequence of mappings that lists the table(s) in the database to extract. name
- table name
Example Source Action YAML
SQL Query/Queries
name
- name of query; will be treated as name of resulting table
query
- SQL statement used to query connected database
Additional name
and query
mappings can be provided to include multiple SQL queries
Example Source Action YAML
Whether you are extracting an entire database, selected tables, or querying against a database, the oracle_source
action always provides a single output, dataset
.
dataset
The output of a oracle_source
action can be used as the input to a gretel_tabular
action in order to transform and/or synthesize a database.
Type
oracle_destination
Connection
oracle
The oracle_destination
action can be used to write gretel_tabular
action outputs to Oracle destination databases.
Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the oracle_destination
action always takes the same input, dataset
.
dataset
sync.mode
replace
- overwrites any existing data in table(s) at destination
append
- add generated data to existing table(s); only supported for query-created tables without primary keys
There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode
field on a destination config.
sync.mode
may be one of truncate
, replace
, or append
.
When sync.mode
is configured with truncate
, records are first truncated from the destination table using the TRUNCATE TABLE
DML command.
When sync mode is configured with truncate
the destination table must already exist in the database.
When sync.mode
is configured with replace
, the destination table is first dropped and then recreated using the schema from the source table.
If the source table is from Oracle, the DDL is extracted using the GET_DDL
interface from the DBMS_METADATA
package. If the source table is from a non Oracle source, the destination table schema is inferred based on the column types of the source schema (if present) or data.
When sync mode is configured with replace
the destination table does not need to exist in the destination.
When sync.mode
is configured with append
, the destination action will simply insert records into the table, leaving any existing records in place.
When using the append
sync mode, referential integrity is difficult to maintain. It's only recommended to use append
mode when syncing adhoc queries to a destination table.
If append
mode is configured with a source that syncs an entire database, it's likely the destination will be unable to insert records while maintaining foreign key constraints or referential integrity, causing the action to fail.
Example Destination Action YAML
You can also write your output dataset to an object storage connection like Amazon S3 or Google Cloud Storage. Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the {object_storage}_destination
action always takes the same inputs - filename
and input
, and path
. Additionally, S3 and GCS take bucket
and Azure Blob takes container
.
filename
This is the name(s) of the file(s) to write data back to. File name(s) will be appended to the path
if one is configured.
This is typically a reference to the output from the previous action, e.g. {outputs.<action-name>.dataset.files.filename}
input
Data to write to the file. This should be a reference to the output from the previous action, e.g. {outputs.<action-name>.dataset.files.data}
path
Defines the path prefix to write the object(s) into.
[S3 and GCS only] bucket
The bucket to write object(s) to. Please only include the name of the bucket, eg my-gretel-bucket
.
[Azure Blob only] container
The container to write object(s) to. Please only include the name of the container, eg my-gretel-container
.
Example Destination Action YAML
Create a synthetic version of your Oracle database.
The following config will extract the entire Oracle database, train and run a synthetic model, then write the outputs of the model back to a destination Oracle database while maintaining referential integrity.
Create a synthetic version of selected tables from your Oracle database
The following config will extract two tables from your database, train and run a synthetic model, then write the outputs of the model back to a destination Oracle database while maintaining any key relationships between the tables.
.
Create a synthetic version of a dataset formed by querying your Oracle database and write to S3
The following config will execute a SQL query against your Oracle database to create a table containing data from across the database. Then, it will train and run a synthetic model to generate a synthetic table. Finally, the generated data will be written to an Amazon S3 bucket.
Create a synthetic version of your Oracle database and write the results to GCS.
The following config will extract the entire Oracle database, train and run a synthetic model, then write the output tables to an output Google Cloud Storage bucket while maintaining referential integrity.
Connect to your MS SQL Server databases.
Prerequisites to create a MS SQL based workflow. You will need
A source MS SQL connection.
(optional) A list of tables OR SQL queries.
(optional) A destination MS SQL connection.
For the source database connection, we recommend using a backup or clone with read-only permissions, instead of connecting directly to your production database.
Do not use your input database connection as an output connector. This action can result in the unintended overwriting of existing data.
A mssql
connection is created using the following parameters:
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example MS SQL connection:
Now that you've created the credentials file, use the CLI to create the connection
Click the New Connection
button.
Step 1, choose the Type for the Connection - MS SQL Server.
Step 2, choose the Project for your Connection.
Step 3, fill in the credentials and select Add Connection
.
The mssql_source
action reads data from your MS SQL database. It can be used to extract:
an entire database, OR
selected tables from a database, OR
the results of SQL query/queries against a database.
Each time the workflow is run the source action will extract the most recent data from the source database.
When combined in a workflow, the data extracted from the mssql_source
action is used to train models and generate synthetic data with the gretel_tabular
action, and can be written to an output database with the mssql_destination
action.
The mssql_source
action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.
Entire Database
Example Source Action YAML
Selected Tables
Example Source Action YAML
SQL Query/Queries
Example Source Action YAML
Whether you are extracting an entire database, selected tables, or querying against a database, the mssql_source
action always provides a single output, dataset
.
The output of a mssql_source
action can be used as the input to a gretel_tabular
action in order to transform and/or synthesize a database.
The mssql_destination
action can be used to write gretel_tabular
action outputs to MS SQL destination databases.
Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the mssql_destination
action always takes the same input, dataset
.
Example Destination Action YAML
There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode
field on a destination config.
sync.mode
may be one of truncate
, replace
, or append
.
When sync.mode
is configured with truncate
, records are first truncated from the destination table using the TRUNCATE TABLE
DML command.
When sync mode is configured with truncate
the destination table must already exist in the database.
When sync.mode
is configured with replace
, the destination table is first dropped and then recreated using a schema inferred from the input dataset.
When the schema is inferred from the input dataset, certain column types or constraints may not be maintained from the source table. If you want to maintain the same schema from your source database, please use sync mode truncate
.
When sync mode is configured with replace
the destination table does not need to exist in the destination.
When sync.mode
is configured with append
, the destination action will simply insert records into the table, leaving any existing records in place.
When using the append
sync mode, referential integrity is difficult to maintain. It's only recommended to use append
mode when syncing adhoc queries to a destination table.
If append
mode is configured with a source that syncs an entire database, it's likely the destination will be unable to insert records while maintaining foreign key constraints or referential integrity, causing the action to fail.
Create a synthetic version of your MS SQL database.
The following config will extract the entire MS SQL database, train and run a synthetic model, then write the outputs of the model back to a destination MS SQL database while maintaining referential integrity.
Create a synthetic version of selected tables from your MS SQL database
The following config will extract two tables from your database, train and run a synthetic model, then write the outputs of the model back to a destination MS SQL database while maintaining any key relationships between the tables.
.
Create a synthetic version of a dataset formed by querying your MS SQL database
The following config will execute a SQL query against your MS SQL database to create a table containing data from across the database. Then, it will train and run a synthetic model to generate a synthetic table.
Writing a well-formatted, clear prompt can get you a long way toward high quality tabular results, and often resolve errors you may be experiencing. Follow these guidelines to get the best from Navigator.
For generating tabular data, make sure your prompt is at least 25 characters.
Do not submit spam or prompts irrelevant to a tabular dataset (like "hello") to Navigator's tabular data format. If you're looking for question-and-answer style data, try submitting a prompt to our text chat interface (select "Natural Language" in the Playground)
The more detail you include, both in terms of what the output should and shouldn’t look like, will lead to better results. This includes:
List columns you want the input to have.
Describe each column, include things like format you want the data to follow (e.g. YYYY-MM-DD for dates). Describe range of values, if applicable, and the context.
If there is a mismatch between the text prompt and sample data you provide, this can confuse the model and cause errors. For best results, always make sure your text prompt and sample data match.
This helps the model parse the example table. The text prompt can be as simple as an instruction to generate more data following the example data. Example:
Generate 30 rows of data exactly like the following table
This includes more information than SELECT. You can also combine them both, for example
If you want to generate a table with multiple columns, use a bulleted list and a short, clear description of the data you want in each column. Example:
Create a U.S. flight passenger dataset with the following columns: - Traveler ID: a 6-character alphanumeric ID - Departing city: a city in the U.S. - Arrival city: a city in the U.S. - Duration: duration of the flight, in minutes - Number of seats: seats on the flight
Navigator can generate roughly 20-30 columns worth of data, depending on the length of the column name. If you need more columns, consider generating in pieces using edit mode, then joining afterwards
Resources for Gretel Navigator (now in GA!)
Navigator is Gretel's first AI system designed to generate, edit, and augment tabular data using natural language or code. It's a tool for creating and enhancing datasets in a more intuitive and interactive way.
We’re rapidly adding new features and improvements to Navigator, so we appreciate your patience and feedback.
If you’ve already tried Navigator and haven’t been able to get the results you expected (or even if you are), we’re here to help. Our primary goal is to better understand what you are trying to achieve, and we’d love to work with you to create high quality data for your use case.
Access artifacts in your project.
Gretel Workflows can read from and write to your Gretel Project. The actions below can be particularly useful alternatives if you have local data you want to run through a workflow, or don't have a destination to write output data to.
The read_project_artifact
action can be used to read in existing Gretel Project Artifacts as inputs to other actions.
The write_project_artifact
action can be used to write an action output to a Gretel Project.
None.
Train a Gretel Model from an existing project artifact and write the output to your project.
Read from and write to Databricks.
Prerequisites to create a Databricks based workflow. You will need
A source Databricks connection.
(optional) A list of tables OR SQL queries.
(optional) A destination Databricks connection.
Do not use your input Databricks connection as an output connector. This action can result in the unintended overwriting of existing data.
Before creating the Databricks connection on Gretel, please ensure that the compute cluster has been started (i.e. Spark Cluster or SQL Warehouse) to ensure that validation doesn't timeout.
A databricks
connection is created using the following parameters:
In order to generate a personal access token you will first need to create a service principal, and then generate a personal access token for that service account.
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example Databricks connection:
Now that you've created the credentials file, use the CLI to create the connection
Click the New Connection
button.
Step 1, choose the Type for the Connection - Databricks.
Step 2, choose the Project for your Connection.
Step 3, fill in the credentials and select Add Connection
.
The Databricks source action requires enough access to read from tables and access schema metadata.
Add the following permissions to the Service Principal that was created above in order to be able to read data.
Ensure that the user/server principal is a part of the ownership group for the Destination catalog or schema.
The Databricks destination action requires enough permissions to write to the destination schema.
Add the following permissions to the Service Principal that was created above in order to be able to write data.
The databricks_source
action reads data from your Databricks database. It can be used to extract:
an entire database, OR
selected tables from a database, OR
the results of SQL query/queries against a database.
Each time the workflow is run the source action will extract the most recent data from the source database.
When combined in a workflow, the data extracted from the databricks_source
action is used to train models and generate synthetic data with the gretel_tabular
action, and can be written to an output database with the databricks_destination
action.
The databricks_source
action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.
Entire Database
Example Source Action YAML
Selected Tables
Example Source Action YAML
SQL Query/Queries
Example Source Action YAML
Whether you are extracting an entire database, selected tables, or querying against a database, the databricks_source
action always provides a single output, dataset
.
The output of a databricks_source
action can be used as the input to a gretel_tabular
action in order to transform and/or synthesize a database.
The databricks_destination
action can be used to write gretel_tabular
action outputs to Databricks destination databases.
Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the databricks_destination
action always takes the same input, dataset
.
Example Destination Action YAML
There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode
field on a destination config.
sync.mode
may be one of truncate
, replace
, or append
.
When sync.mode
is configured with truncate
, records are first truncated from the destination table using the TRUNCATE TABLE
DML command.
When sync mode is configured with truncate
the destination table must already exist in the database.
When sync.mode
is configured with replace
, the destination table is first dropped and then recreated using the schema from the source table.
If the source table is from Databricks, the DDL is extracted using the GET_DDL
metadata function. If the source table is from a non Databricks source, the destination table schema is inferred based on the column types of the database.
When sync mode is configured with replace
the destination table does not need to exist in the destination.
When sync.mode
is configured with append
, the destination action will simply insert records into the table, leaving any existing records in place.
When using the append
sync mode, referential integrity is difficult to maintain. It's only recommended to use append
mode when syncing adhoc queries to a destination table.
If append
mode is configured with a source that syncs an entire database, it's likely the destination will be unable to insert records while maintaining foreign key constraints or referential integrity.
Create a synthetic version of your Databricks database.
The following config will extract the entire Databricks database, train and run a synthetic model, then write the outputs of the model back to a destination Databricks database while maintaining referential integrity.
Create a synthetic version of selected tables from your Databricks database
The following config will extract two tables from your database, train and run a synthetic model, then write the outputs of the model back to a destination Databricks database while maintaining any key relationships between the tables.
.
Create a synthetic version of a dataset formed by querying your Databricks database
The following config will execute a SQL query against your Databricks database to create a table containing data from across the database. Then, it will train and run a synthetic model to generate a synthetic table.
Gretel Navigator FAQ
What types of data can I work with using Navigator? Gretel Navigator is designed to support tabular data containing any combination of numeric, categorical, and text modalities. This flexibility allows you to work seamlessly across various types of datasets, catering to a broad range of data generation and augmentation tasks.
What can I do with Navigator? You can generate tabular data from natural language or SQL prompts, edit existing datasets, augment data, fill in missing values, experiment interactively in the console, and generate and edit data at scale using our batch API and SDK.
Why is my feedback important? Your feedback helps us prioritize our development roadmap. By sharing your experience and suggestions, you directly contribute to shaping the future features and improvements of Navigator.
What about larger datasets and advanced features? We're committed to rapidly increasing the scale of datasets that Navigator can handle and are continuously working on enhancing the AI's capabilities. Expect regular updates and improvements based on user feedback.
Can I use Navigator to work with my existing datasets? Absolutely! Navigator is designed to assist in editing and augmenting existing datasets. You can fill in missing values, make corrections, or extend your datasets using natural language prompts.
Is Navigator a model or an application? It's actually both. Navigator is a compound AI system that leverages multiple transformer-based models, including Gretel’s own finetuned LLM.
How does Gretel Navigator overcome the limitations of traditional LLMs in data generation tasks? Traditional LLMs are limited by their context windows and struggle with tasks that exceed these limits or require precise mathematical operations. Gretel Navigator overcomes these by using an agent-based approach that plans tasks, delegates operations beyond the scope of LLMs, and ensures high-quality output without the complexities for the user.
Can I run Gretel Navigator in my own cloud or VPC? Currently, Navigator runs inside Gretel’s managed cloud. We are working to make it available in any public cloud, including AWS, Azure, and Google Cloud through a serverles offering. Contact us if you have questions or need any additional details.
What else is coming for Navigator? Data quality and diversity, as well as advanced agent capabilities and some LLM model updates, are still under development.
Are there safety checks for prompts submitted to Navigator?
What data sources is Navigator trained on? Navigator is trained on high quality, structured and semi-structured tabular datasets with permissible licenses, that have been curated and organized across over 20 industry verticals including Healthcare, Biotechnology, Finance, Telecommunications, Government, Pharma, Retail, and others. Goals with model training include familiarizing the model with industry specific dataset formats, teaching data correlations inside analytics and machine learning datasets, and improving task performance for being able to fill in missing values, clean data, or generate data at scale for analytics and machine learning use cases.
What large language model (LLM) does Gretel Navigator use for generating tabular data? Gretel Navigator uses a mixture of expert models including foundation models and Gretel's fine-tuned model specialized in generating tabular data. Data generation requests may utilize a combination of models to compare and optimize performance.
Can you share the details of each LLM that Gretel Navigator uses? Certainly! There are currently five options available for customers:
An easy way to try out Navigator (for free!) is to start in the Gretel Console.
Try out an example prompt in the playground, or use your own. Then click "Generate".
Click the 3 dots to download your dataset, or click "Batch Data" to generate more than 100 records
You can use Navigator operationally through the Gretel SDK.
If you're feeling stuck, you may find the following use cases helpful to get started.
Let’s say we want to generate data that represents consumer packaged goods inventory.
We can workshop prompt ideas first using the "Natural language" option of Navigator. You can find this tab in the playground.
Try asking:
What are common headers of a consumer packaged goods (CPG) inventory dataset?
or
Help me write a prompt for a large language model (LLM) to create a dataset that represents consumer packaged goods (CPG) inventory
We can use the responses to start with a prompt and then narrow it down to be more specific for the data we're looking for.
This feature is available in the playground as well as the SDK. In playground, select the option to add columns to an existing dataset.
Upload your dataset (csv or jsonl format), then use the prompt template to describe the new columns you want to add.
To do this via the SDK, make sure to write a clear prompt describing the new column you want to add to your data.
Select "Add columns to existing datasets"
Upload your CSV or JSON(L) file into the box
Ensure the uploaded file looks correct in the output section
Edit the prompt as appropriate to add the columns you'd like, with detailed description of rules for generating each column as appropriate
Click generate
Connect to your PostgreSQL databases.
Prerequisites to create a PostgreSQL based workflow. You will need
A source PostgreSQL connection.
(optional) A list of tables OR SQL queries.
(optional) A destination PostgreSQL connection.
For the source database connection, we recommend using a backup or clone with read-only permissions, instead of connecting directly to your production database.
Do not use your input database connection as an output connector. This action can result in the unintended overwriting of existing data.
A postgres
connection is created using the following parameters:
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example PostgreSQL connection:
Now that you've created the credentials file, use the CLI to create the connection
Click the New Connection
button.
Step 1, choose the Type for the Connection - PostgreSQL.
Step 2, choose the Project for your Connection.
Step 3, fill in the credentials and select Add Connection
.
The postgres_source
action reads data from your PostgreSQL database. It can be used to extract:
an entire database, OR
selected tables from a database, OR
the results of SQL query/queries against a database.
Each time the workflow is run the source action will extract the most recent data from the source database.
When combined in a workflow, the data extracted from the postgres_source
action is used to train models and generate synthetic data with the gretel_tabular
action, and can be written to an output database with the postgres_destination
action.
The postgres_source
action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.
Entire Database
Example Source Action YAML
Selected Tables
Example Source Action YAML
SQL Query/Queries
Example Source Action YAML
Whether you are extracting an entire database, selected tables, or querying against a database, the postgres_source
action always provides a single output, dataset
.
The output of a postgres_source
action can be used as the input to a gretel_tabular
action in order to transform and/or synthesize a database.
The postgres_destination
action can be used to write gretel_tabular
action outputs to PostgreSQL destination databases.
Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the postgres_destination
action always takes the same input, dataset
.
Example Destination Action YAML
There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode
field on a destination config.
sync.mode
may be one of truncate
, replace
, or append
.
When sync.mode
is configured with truncate
, records are first truncated from the destination table using the TRUNCATE TABLE
DML command.
When sync mode is configured with truncate
the destination table must already exist in the database.
When sync.mode
is configured with replace
, the destination table is first dropped and then recreated using a schema inferred from the input dataset.
When the schema is inferred from the input dataset, certain column types or constraints may not be maintained from the source table. If you want to maintain the same schema from your source database, please use sync mode truncate
.
When sync mode is configured with replace
the destination table does not need to exist in the destination.
When sync.mode
is configured with append
, the destination action will simply insert records into the table, leaving any existing records in place.
When using the append
sync mode, referential integrity is difficult to maintain. It's only recommended to use append
mode when syncing adhoc queries to a destination table.
If append
mode is configured with a source that syncs an entire database, it's likely the destination will be unable to insert records while maintaining foreign key constraints or referential integrity, causing the action to fail.
Create a synthetic version of your MySQL database.
The following config will extract the entire database, train and run a synthetic model, then write the outputs of the model back to a destination PostgreSQL database while maintaining referential integrity.
Create a synthetic version of selected tables from your MySQL database
The following config will extract two tables from your database, train and run a synthetic model, then write the outputs of the model back to a destination PostgreSQL database while maintaining any key relationships between the tables.
.
Create a synthetic version of a dataset formed by querying your MySQL database
The following config will execute a SQL query against your PostgreSQL database to create a table containing data from across the database. Then, it will train and run a synthetic model to generate a synthetic table.
Read from and write to BigQuery.
Prerequisites to create a BigQuery based workflow. You will need
A source BigQuery connection.
(optional) A list of tables OR SQL queries.
(optional) A destination BigQuery connection.
Do not use your input data warehouse connection as an output connector. This action can result in the unintended overwriting of existing data.
Google BigQuery related actions require creating a bigquery
connection. The connection must be configured with the correct permissions for each Gretel Workflow Action.
For specific permissions, please refer to the Minimum Permissions section under each corresponding action.
Gretel bigquery
connections require the following fields:
In order to generate a private key you will first need to create a service account, and then download the key for that service account.
After the service account has been created, you can attach dataset specific permissions to the service account.
Please see each action's Minimum Permissions section for a list of permissions to attach to the service account.
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example BigQuery connection credential file:
Now that you've created the credentials file, use the CLI to create the connection
Click the New Connection
button.
Step 1, choose the Type for the Connection - Snowflake.
Step 2, choose the Project for your Connection.
Step 3, fill in the credentials and select Add Connection
.
The bigquery_source
action reads data from your BigQuery dataset. It can be used to extract:
the entire dataset, OR
selected tables from the dataset, OR
the results of SQL query/queries against the dataset.
Each time the workflow is run the source action will extract the most recent data from the source database.
The bigquery_source
action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.
Entire Dataset
Example Source Action YAML
Selected Tables
Example Source Action YAML
SQL Query/Queries
Example Source Action YAML
Whether you are extracting an entire dataset, selected tables, or querying against a dataset, the bigquery_source
action always provides a single output, dataset
.
The output of a bigquery_source
action can be used as the input to a gretel_tabular
action in order to transform and/or synthesize a dataset.
The associated service account must have the following permissions for the configured dataset:
bigquery.datasets.get
The bigquery_destination
action can be used to write gretel_tabular
action outputs to Snowflake destination databases.
Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the bigquery_destination
action always takes the same input, dataset
.
Example Destination Action YAML
There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode
field on a destination config.
sync.mode
may be one of truncate
, replace
, or append
. Each sync mode will configure a write and create disposition that determines how rows are inserted, and how destination tables are created.
When sync.mode
is configured with truncate
Records are written with WRITE_TRUNCATE
The destination table must already exist in the destination dataset.
When sync mode is configured with replace
Records are written with WRITE_TRUNCATE
The destination table is created if necessary with CREATE_IF_NEEDED
When sync.mode
is configured with append
Records are appended with WRITE_APPEND
The destination table is created if necessary with CREATE_IF_NEEDED
The associated service account must have the following permissions for the configured dataset:
bigquery.datasets.create
bigquery.datasets.delete
(supports replacing an existing file in the dataset)
Example Destination Action YAML
Create a synthetic version of your BigQuery dataset.
The following config will extract the entire BigQuery dataset, train and run a synthetic model, then write the outputs of the model back to a destination BigQuery dataset while maintaining referential integrity.
Create a synthetic version of selected tables from your BigQuery dataset
The following config will extract two tables from your dataset, train and run a synthetic model, then write the outputs of the model back to a destination BigQuery dataset while maintaining any key relationships between the tables.
Create a synthetic version of table(s) formed by querying your BigQuery dataset and write to Google Cloud Storage
The following config will execute a SQL query against your BigQuery dataset to create a table containing data from across the dataset. Then, it will train and run a synthetic model to generate a synthetic table. Finally, the generated data will be written to a Google Cloud Storage bucket.
Connect to your Snowflake Data Warehouse.
Prerequisites to create a Snowflake based workflow. You will need
A source Snowflake connection.
(optional) A list of tables OR SQL queries.
(optional) A destination Snowflake connection.
Do not use your input data warehouse connection as an output connector. This action can result in the unintended overwriting of existing data.
There are two ways to authenticate a Gretel Snowflake connection, each methoed requires different fields for a connection creation:
A snowflake
connection authenticated via username/password is created using the following parameters:
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example Snowflake connection:
Now that you've created the credentials file, use the CLI to create the connection
Click the New Connection
button.
Step 1, choose the Type for the Connection - Snowflake.
Step 2, choose the Project for your Connection.
Step 3, fill in the credentials and select Add Connection
.
External OAuth is currently only supported via CLI/SDK.
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example Snowflake External OAuth connection:
Now that you've created the credentials file, use the CLI to create the connection
The Snowflake source action requires enough access to read from tables and access schema metadata. The following SQL script will create a Snowflake user suitable for a Gretel Snowflake source.
The snowflake destination action requires enough permissions to write to the destination schema.
If your destination database and schema do not already exist, create those first.
Next configure a user for the Snowflake destination. This user must have OWNERSHIP
permissions in order to write data to the destination schema.
The following SQL script will create a Snowflake user suitable for a Gretel Snowflake destination.
The snowflake_source
action reads data from your Snowflake database. It can be used to extract:
an entire database, OR
selected tables from a database, OR
the results of SQL query/queries against a database.
Each time the workflow is run the source action will extract the most recent data from the source database.
When combined in a workflow, the data extracted from the snowflake_source
action is used to train models and generate synthetic data with the gretel_tabular
action, and can be written to an output database with the snowflake_destination
action.
The snowflake_source
action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.
Entire Database
Example Source Action YAML
Selected Tables
Example Source Action YAML
SQL Query/Queries
Example Source Action YAML
Whether you are extracting an entire database, selected tables, or querying against a database, the snowflake_source
action always provides a single output, dataset
.
The output of a snowflake_source
action can be used as the input to a gretel_tabular
action in order to transform and/or synthesize a database.
The snowflake_destination
action can be used to write gretel_tabular
action outputs to Snowflake destination databases.
Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the snowflake_destination
action always takes the same input, dataset
.
Example Destination Action YAML
There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode
field on a destination config.
sync.mode
may be one of truncate
, replace
, or append
.
When sync.mode
is configured with truncate
, records are first truncated from the destination table using the TRUNCATE TABLE
DML command.
When sync mode is configured with truncate
the destination table must already exist in the database.
When sync.mode
is configured with replace
, the destination table is first dropped and then recreated using the schema from the source table.
If the source table is from Snowflake, the DDL is extracted using the GET_DDL
metadata function. If the source table is from a non Snowflake source, the destination table schema is inferred based on the column types of the database.
When sync mode is configured with replace
the destination table does not need to exist in the destination.
When sync.mode
is configured with append
, the destination action will simply insert records into the table, leaving any existing records in place.
When using the append
sync mode, referential integrity is difficult to maintain. It's only recommended to use append
mode when syncing adhoc queries to a destination table.
If append
mode is configured with a source that syncs an entire database, it's likely the destination will be unable to insert records while maintaining foreign key constraints or referential integrity.
Create a synthetic version of your Snowflake database.
The following config will extract the entire Snowflake database, train and run a synthetic model, then write the outputs of the model back to a destination Snowflake database while maintaining referential integrity.
Create a synthetic version of selected tables from your Snowflake database
The following config will extract two tables from your database, train and run a synthetic model, then write the outputs of the model back to a destination Snowflake database while maintaining any key relationships between the tables.
.
Create a synthetic version of a dataset formed by querying your MS SQL database
The following config will execute a SQL query against your Snowflake database to create a table containing data from across the database. Then, it will train and run a synthetic model to generate a synthetic table.
A glob filter may be used to match file names matching a specific pattern. Please see the for more details.
A containing file and table representations of the found objects.
Data to write to the file. This should be a to the output from a previous action.
.
A glob filter may be used to match file names matching a specific pattern. Please see the for more details.
A containing file and table representations of the found objects.
Data to write to the file. This should be a to the output from a previous action.
A glob filter may be used to match file names matching a specific pattern. Please see the for more details.
A containing file and table representations of the found objects.
Data to write to the file. This should be a to the output from a previous action.
A to the data extracted from the database, including tables and relationships/schema.
A to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.
A to the data extracted from the database, including tables and relationships/schema.
A to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.
Navigate to the using the menu item in the left sidebar.
If you encounter issues towards the end of batch generation, consider generating in smaller batches or , which can give you finer control over error handling
If you're having issues with a relatively small prompt, reach out to
and select "Navigator".
More advanced features such as data editing and augmentation are available via the Gretel SDK. Get started with a .
Use and ping us on the on Discord!
If you encounter issues when using navigator, reach out to us at
Navigate to the using the menu item in the left sidebar.
Check out this which demonstrates an end to end flow of running a Workflow using the Databricks Connector. Optionally, to run this notebook on Databricks, you can directly into Databricks by providing the URL to the notebook.
How can I get started? Log in or create a free Gretel account, and access Navigator here: or using the SDK through your .
How can I provide feedback or report bugs for Navigator? Your input is crucial. Please use , file requests or bugs through the console or or join the to share feedback and communicate directly with our team.
How much does Navigator cost? Navigator is billed by character input and output. 1 Gretel credit = 100,000 characters. Every user receives 15 free credits monthly, which is the equivalent of 1.5 million characters free! Learn more about character counting , and about credits and pricing .
How can I learn to use Navigator effectively? Start with the , , and and . You can also reach out to us if you have more questions.
At Gretel, we are committed to promoting fair and equitable use of our AI systems. We firmly stand against any hateful, discriminatory, or otherwise harmful content. All prompts submitted to Navigator undergo safety and alignment checks to ensure they adhere to our guidelines, utilizing the safety checks built into the LLMs. Content flagged as potentially harmful will be reviewed by our security team, and violators may have their access revoked. We take these measures seriously to maintain a safe and respectful environment for all users. For more information on what constitutes acceptable use, please visit our guidelines at .
and select "Navigator".
Pro tip: review the page to get the best results from your input
Create a if you don't already have one in order to get your API key.
Start from an or create your own
Pro tip: After submitting a prompt to Navigator, you can further refine the results by adding sample data from the output. Select "Add an example to improve result" and choose "Import current output". You can make edits to the output to match what you're looking for.
For best results, describe the data you're looking to create in a clear manner: like using bullet points and clear descriptions for each column. Review the doc for best practices.
You can see an example of adding columns in the .
Navigate to the using the menu item in the left sidebar.
Navigate to the using the menu item in the left sidebar.
When combined in a workflow, the data extracted from the bigquery_source
action is used to train models and generate synthetic data with the gretel_tabular
action, and can be written to an output database with the bigquery_destination
action. Your generated data can also be written to , for more information see .
The BigQuery destination action uses a to write records into destination tables.
For more information on how job dispositions behave, please reference writeDisposition
and createDisposition
from .
You can also write your output dataset to an object storage connection like . Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the {object_storage}_destination
action always takes the same inputs - filename
and input
, and path
. Additionally, S3 and GCS take bucket
while Azure Blob takes container
.
Navigate to the using the menu item in the left sidebar.
A snowflake
connection authenticated via is created using the following parameters:
name
Display name of your choosing used to identify your connection within Gretel.
my-mssql-connection
username
Unique identifier associated with specific account authorized to access database.
john
password
Security credential to authenticate username.
...
host
Fully qualified domain name (FQDN) used to establish connection to database server.
myserver.example.com
port
Port number; If left empty, the default value - 1403
- will be used.
1403
database
Name of database to connect to.
mydatabase
(optional) schema
Optional Name of specific schema
dbo
(optional) params
Optional JDBC URL parameters that can be used for advanced configuration.
TrustServerCertificate=True
Type
mssql_source
Connection
mssql
sync.mode
full
- extracts all records from tables in database
(coming soon) subset
- extract percentage of records from tables in database
sync.mode
full
- extracts all records from selected tables in database
(coming soon) subset
- extract percentage of records from selected tables in database
Sequence of mappings that lists the table(s) in the database to extract. name
- table name
name
- name of query; will be treated as name of resulting table
query
- SQL statement used to query connected database
Additional name
and query
mappings can be provided to include multiple SQL queries
dataset
A reference to the data extracted from the database, including tables and relationships/schema.
Type
mssql_destination
Connection
mssql
dataset
A reference to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.
sync.mode
replace
- overwrites any existing data in table(s) at destination
append
- add generated data to existing table(s); only supported for query-created tables without primary keys
project_id
The project id the artifact is located in.
artifact_id
The id of the artifact to read.
dataset
A dataset with exactly one item (the project artifact) represented as both a file and table.
project_id
The project to create the artifact in.
artifact_name
The name of the artifact.
data
Reference to a data handle.
name
Display name of your choosing used to identify your connection within Gretel.
my-databricks-connection
server_hostname
Fully qualified domain name (FQDN) used to establish connection to database server.
account_identifier.cloud.databricks.com
http_path
The http path of the cluster.
/sql/1.0/warehouses/foo
personal_access_token
Security credential to authenticate databricks account (36 characters)
dapi....
catalog
Name of catalog to connect to.
MY_CATALOG
schema
Name of schema.
MY_SCHEMA
(optional) params
Optional JDBC URL parameters that can be used for advanced configuration.
role=MY_ROLE
Type
databricks_source
Connection
databricks
sync.mode
full
- extracts all records from tables in database
(coming soon) subset
- extract percentage of records from tables in database
sync.mode
full
- extracts all records from selected tables in database
(coming soon) subset
- extract percentage of records from selected tables in database
Sequence of mappings that lists the table(s) in the database to extract. name
- table name
name
- name of query; will be treated as name of resulting table
query
- SQL statement used to query connected database
Additional name
and query
mappings can be provided to include multiple SQL queries
dataset
A reference to the data extracted from the database, including tables and relationships/schema.
Type
databricks_destination
Connection
databricks
dataset
A reference to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.
sync.mode
replace
- overwrites any existing data in table(s) at destination
append
- add generated data to existing table(s); only supported for query-created tables without primary keys
volume
Unity Catalog volume where the destination data will be staged temporarily before writing to tables. A volume name must be specified in the destination action YAML in order for the write to succeed..
auto
Auto-selected model
This setting automatically selects the best model from the list below to generate high-quality data at scale. Note: please read each description carefully to understand specific constraints of each model and, if applicable, to make a different model selection when using Navigator to best suit your use case.
Gretel Custom Model (Industry fine-tuned)
gretelai/Mistral-7B-Instruct-v0.2/industry
Gretel's proprietary LLM
Gretel's proprietary model is based on Mistral-7b and fine-tuned by Gretel on curated and synthetic industry-specific datasets from 10+ verticals. Data generated from this LLM is owned by the user and can be used for any downstream task without licensing concerns.
Gretel Llama-3.1-8B-Instruct
gretelai/Llama-3.1-8B-Instruct
Gretel's LLM + Llama 3.1 model
Built with Llama 3.1. Gretel's LLM and Llama 3.1 are both used in this option. This option offers high quality and data available for commercial use. For more please see Llama 3.1 official license and policy on Github.
Gretel Azure GPT-3.5 Turbo
gretelai-azure/gpt-3.5-turbo
Gretel's LLM + Azure OpenAI models
Gretel's LLM along with Azure OpenAI models are both leveraged. This option offers excellent free text capabilities and speed, but data generated from this model may have certain restrictions. Please see Azure's documentation for possible restrictions.
Gretel Google Gemini Pro
gretelai-google/gemini-pro
Gretel's LLM + Google Gemini Pro models
Both Gretel's LLM along with Google Gemini Pro models are leveraged in this option. This option offers excellent free text capabilities and speed, but data generated from this model may have certain restrictions. Please read Google's documentation to understand possible restrictions.
name
Display name of your choosing used to identify your connection within Gretel.
my-postgres-connection
username
Unique identifier associated with specific account authorized to access database.
john
password
Security credential to authenticate username.
...
host
Fully qualified domain name (FQDN) used to establish connection to database server.
myserver.example.com
port
Port number; If left empty, the default value - 5432
- will be used.
5432
database
Name of database to connect to.
mydatabase
(optional) schema
Optional Name of specific schema
public
(optional) params
Optional JDBC URL parameters that can be used for advanced configuration.
TrustServerCertificate=True&useSSL=false
Type
postgres_source
Connection
postgres
sync.mode
full
- extracts all records from tables in database
(coming soon) subset
- extract percentage of records from tables in database
sync.mode
full
- extracts all records from selected tables in database
(coming soon) subset
- extract percentage of records from selected tables in database
Sequence of mappings that lists the table(s) in the database to extract. name
- table name
name
- name of query; will be treated as name of resulting table
query
- SQL statement used to query connected database
Additional name
and query
mappings can be provided to include multiple SQL queries
dataset
A reference to the data extracted from the database, including tables and relationships/schema.
Type
postgres_destination
Connection
postgres
dataset
A reference to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.
sync.mode
replace
- overwrites any existing data in table(s) at destination
append
- add generated data to existing table(s); only supported for query-created tables without primary keys
name
Display name of your choosing used to identify your connection within Gretel.
my-bigquery-connection
connection_target_type
source
or destination
depending on whether you want to read data from or write data to the connection.
source
project_id
ID of the Google project containing your dataset.
my-project-id
service_account_email
The service account email associated with your private key.
service-account-name@my-project-id.iam.gserviceaccount.com
dataset
Name of the dataset to connect to.
my-dataset-name
private_json_key
Private key JSON blob used to authenticate Gretel.
{ "type": "service_account", "project_id": "my-project-id", "private_key_id": "Oabc1def2345678g90123h456789012h34561718", "private_key": "-----BEGIN PRIVATE KEY-----/ ... }
Type
bigquery_source
Connection
bigquery
sync.mode
full
- extracts all records from tables in dataset
(coming soon) subset
- extract percentage of records from tables in dataset
sync.mode
full
- extracts all records from selected tables in dataset
(coming soon) subset
- extract percentage of records from selected tables in dataset
Sequence of mappings that lists the table(s) in the dataset to extract. name
- table name
name
- name of query; will be treated as name of resulting table
query
- SQL statement used to query connected dataset
Additional name
and query
mappings can be provided to include multiple SQL queries
dataset
A reference to the data extracted from the database, including tables and (if defined) relationships/schema.
Type
bigquery_destination
Connection
bigquery
dataset
A reference to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.
sync.mode
replace
- overwrites any existing data in table(s) at destination
append
- add generated data to existing table(s); only supported for query-created tables without primary keys
name
Display name of your choosing used to identify your connection within Gretel.
my-snowflake-connection
host
Fully qualified domain name (FQDN) used to establish connection to database server.
account_identifier.snowflakecomputing.com
username
Unique identifier associated with specific account authorized to access database.
john
password
Security credential to authenticate username.
...
database
Name of database to connect to.
MY_DATABASE
warehouse
Name of warehouse.
MY_WAREHOUSE
(optional) schema
Optional Name of schema.
MY_SCHEMA
(optional) params
Optional JDBC URL parameters that can be used for advanced configuration.
role=MY_ROLE
name
Display name of your choosing used to identify your connection within Gretel.
my-snowflake-connection
host
Fully qualified domain name (FQDN) used to establish connection to database server.
account_identifier.snowflakecomputing.com
username
Unique identifier associated with specific account authorized to access database.
john
password
Security credential to authenticate username.
...
database
Name of database to connect to.
MY_DATABASE
warehouse
Name of warehouse.
MY_WAREHOUSE
oauth_client_id
Unique identifier associated with the authentication application.
oauth_grant_type
method through which oauth token will be acquired
ex. "password"
oauth_scope
scope given to request token
oauth_url
endpoint to fetch access token from
(optional) schema
Optional Name of schema.
MY_SCHEMA
(optional) params
Optional JDBC URL parameters that can be used for advanced configuration.
role=MY_ROLE
Type
snowflake_source
Connection
snowflake
sync.mode
full
- extracts all records from tables in database
(coming soon) subset
- extract percentage of records from tables in database
sync.mode
full
- extracts all records from selected tables in database
(coming soon) subset
- extract percentage of records from selected tables in database
Sequence of mappings that lists the table(s) in the database to extract. name
- table name
name
- name of query; will be treated as name of resulting table
query
- SQL statement used to query connected database
Additional name
and query
mappings can be provided to include multiple SQL queries
dataset
A reference to the data extracted from the database, including tables and relationships/schema.
Type
snowflake_destination
Connection
snowflake
dataset
A reference to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.
sync.mode
replace
- overwrites any existing data in table(s) at destination
append
- add generated data to existing table(s); only supported for query-created tables without primary keys
Real-time data generation with Gretel Navigator
The previous sections on the Gretel SDK were focused on running batch jobs, which are project-based and do not support real-time interaction. In this section, we will introduce the Navigator inference API, which makes it easy to generate high-quality synthetic tabular and text data – in real time – with just a few lines of code, powered by Gretel Navigator.
Navigator currently supports two data generation modes: tabular
and natural_language
. In both modes, you can choose the backend model that powers the generation, which we'll describe in more detail below.
The Gretel object has a factories
attribute that provides helper methods for creating new objects that interact with Gretel's non-project-based APIs. Let's use the factories
attribute to fetch the available backend models that power Navigator's tabular
data generation:
This will print the list of available models, the first of which will be gretelai/auto
, which automatically selects the current default model, which will change with time as models continue to evolve.
To initialize the Navigator Tabular inference API, we use the initialize_navigator_api
method. Then, we can generate synthetic data in real time using its generate
method:
You can augment an existing dataset using the edit
method:
Finally, Navigator's tabular
mode supports streaming data generation. To enable streaming, simply set the stream
parameter to True
:
Navigator's natural_language
mode gives you access to state-of-the-art LLMs for generating text data. Let's fetch the available backend models that power Navigator's natural_language
data generation:
Similar to the tabular
mode, this will print the list of available models, the first of which will be gretelai/gpt-auto
, which automatically selects the current default model.
To initialize the Navigator Natural Language inference API, we again use the initialize_navigator_api
method. Then, we can generate synthetic text data in real time using its generate
method:
Documentation for the batch job SDK. For using Navigator at scale.
Initialize Navigator Batch with a model config:
Utilize these helper functions to use batch SDK in your own workflows:
Example:
A one-stop shop for interacting with Gretel’s APIs, models, and artifacts
The Gretel
object provides a streamlined interface for Gretel's SDK:
Your Gretel session is configured upon instantiation of a Gretel
object. To customize your session (e.g., with custom endpoints for a Hybrid deployment), pass any keyword argument of the configure_session function to the Gretel
initialization method:
Each Gretel
instance can be bound to a single project. This is relevant when you submit project-based jobs like training or generating synthetic data with a Gretel Model.
You have three options for setting the current project:
Use the project_name
keyword argument when you instantiate a Gretel
object, as we demonstrated above. If the project does not exist, a new one will be created.
Use the set_project
method. For example, gretel.set_project("sdk-docs")
. Again, if the project does not exist, a new one will be created.
Do not set the project. In this case, a random project will be created if/when you run a submit_*
method. This behavior is described in the Train and Generate Jobs section.
Project names must be unique, but project display names can be the same. Because of this, many times your project name will be different than the display name. The display name is what is surfaced in the UI.
To look up your project name in the Console:
Navigate to the Projects tab
Select your project
Click on "Settings", down and to the right of the project name
The box labeled "Name" is your project name.
Data Designer is a general purpose system for building datasets to improve your AI models. Developers can describe the attributes of the dataset they want and iterate on the generated data through fast previews and detailed evaluations.
With Data Designer, you get:
Speed: Generate preview datasets in minutes, production datasets in hours
Quality: Built-in evaluation metrics ensure accuracy and relevance
Simplicity: Automated workflows replace complex manual processes
Scale: Move from proof-of-concept to production without rebuilding
Data-centric AI: Unlock true data experimentation with rapid iteration on use-case-specific data.
Learn how to use Data Designer by exploring the YAML and SDK configuration docs below.
If you're looking for hands-on examples, check out our Example Notebooks, where you'll find:
✅ Structured Outputs – Generate complex, nested synthetic data ✅ Evaluation Sets – Create high-quality AI evaluation datasets ✅ Multi-turn Chat – Build user-assistant dialogue datasets ✅ Text-to-SQL & Text-to-Code – Generate SQL & Python code datasets
💡 Check out the full list and interactive notebooks in our Example Notebooks section →
Start building with synthetic data in just 3 lines of code 🚀
The SDK's high-level interface makes interacting with Gretel's APIs simple and intuitive. Training state-of-the-art deep generative models from scratch only takes a few lines of code:
In this section, we will provide an overview of the key concepts you need to know to start building with the high-level interface:
The high-level SDK interface is built on top of the lower-level Gretel Python SDK. This means it is compatible with all existing code, and the lower-level SDK can always be used for features that are not yet covered by the high-level SDK.
Note that the high-level interface is in active development, and it currently only supports Synthetics. We have plans to add support for Transform and Workflows soon.
Quickly assess the quality of your synthetic data
When you train a model, Gretel automatically creates a Synthetic Data Quality Report to help you assess how well the synthetic data captures the statistical properties of the training data.
The report is stored as an attribute of the returned job-results object:
The report attribute is itself an object with useful methods and attributes:
You can download the synthetic data used in the report as follows:
Methods for submitting jobs to Gretel workers
With the Gretel
object instance ready to go, you can use its submit_*
methods to submit model training and data generation jobs. Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.
The submit_train
method submits a model training job based on the given model configuration. The data source for the training job is passed in using the data_source
argument and may be a file path or pandas DataFrame
:
We trained an ACTGAN model by setting base_config="tabular-actgan"
. You can replace this base config with the path to a custom config file, or you can select any of the config names listed here (excluding the .yml
extension). The returned trained
object is a dataclass
that contains the training job results such as the Gretel model object, synthetic data quality report, training logs, and the final model configuration.
The base configuration can be modified using keyword arguments with the following rules:
Nested model settings can be passed as keyword arguments in the submit_train
method, where the keyword is the name of the config subsection and the value is a dictionary with the desired subsection's parameter settings. For example, this is how you update settings in ACTGAN's params
and privacy_filters
subsections, where epochs
, discriminator_dim
, similarity
, and outliers
are nested settings:
Non-nested model settings can be passed directly as keyword arguments in the submit_train
method. For example, this is how you update Gretel GPT's pretrained_model
and column_name
, which are not nested within a subsection:
Once you have models in your Gretel Project, you can use any of them to generate synthetic data using the submit_generate
method:
Above we use the model_id
attribute of a completed training job, but you are free to use the model_id
of any model within the current project. If the model has additional generate
settings (e.g., temperature
when generating text), you can pass them as keyword arguments to the submit_generate
method. The returned generated
object is a dataclass
that contains results from the generation job, including the generated synthetic data.
In the previous example, we unconditionally generated num_records
records. To conditionally generate synthetic data, use the seed_data
argument:
The above code will conditionally generate 50 examples where the given field's class is "seed".
If you do not want to wait for a job to complete, you can set wait=False
when calling submit_train
or submit_generate
. In this case, the method will return immediately after the job starts:
Some things to know if you use this option:
You can still monitor the job progress in the Gretel Console.
You can check the job status using the job_status
attribute of the returned object: print(trained.job_status)
.
You can continue waiting for the job to complete by calling the wait_for_completion
method of the returned object: trained.wait_for_completion()
.
If you are not waiting when the job completes, you must call the refresh
method of the returned object to fetch the job results: trained.refresh()
.
Our transforms product allows you to remove PII from data, and you can submit these transform jobs from the high level SDK. The default behavior is to use a model to classify the data and fake entities from that.
You can fetch results from previous training and generation jobs using the fetch_*_job_results
methods:
For fetching transform results, you can do the following, and also access the transformed object as a DataFrame
Evaluate job analyzes the quality of synthetic data and generates the Data Quality Report.
The submit_evaluate
method submits an evaluate model training job based on the given evaluate model configuration. The data source for the job is passed in using the data_source
argument, the original data source is passed with ref_data
, and these data sources may be file path or pandas DataFrame
:
The test (holdout) data source for MIA is passed with an optional test_data
argument, it may be a file path or pandas DataFrame
:
All about inference pricing
Navigator (and all Gretel inference APIs) are billed by characters. Both input (prompt and input data) and output characters count toward usage.
Navigator pricing is as follows:
1 Gretel credit = 100,000 characters. This character count includes both input and output.
Every Gretel user monthly receives 15 free credits, which means 1.5 million free characters, every month.
All inference is billed by character. This includes the playground (which you can find on the Gretel console) and well as the inference SDK and Navigator batch job SDK.
We log characters and amount billed for each inference call so that it's easy for you to track.
In playground, go to the Logs section on the right hand side.
To track usage and billing for batch jobs and inference SDK, visit the Usage page for your account. For more information on billing and usage, visit this page
Input and output characters count toward what is billed. We round to the nearest 10 characters when billing. Pricing is as shown above for characters per Gretel credit.
If you have any questions, you can reach out to us at support@gretel.ai or visit our pricing page
Video tutorials and walk throughs for popular use cases with Gretel Navigator
Here are some use case videos to get started:
Video tutorials and walk throughs for popular use cases with Gretel Navigator
Use case-based notebooks for Gretel Navigator.
Follow these links for guides on:
The data designer configuration is the primary interface customers will use to build their dataset and inject diversity into it.
Special System Instruction: Customers can use this to specify a prompt that is used to provide guidance to the entire Navigator system when it generates data.
Categorial Seed Columns: Navigator Data Designer uses data seeds to inject diversity in the dataset. You can define seeds as key value pairs so that the columns you want to generate can use these seeds as context to generate data related to specific concepts. Seed columns support subcategories which allow you to specify topics related to a specific seed.
Generated Data Columns: These are the columns you are interested in generated from scratch in your dataset, for example, Text and Code are the two data columns you want to generated in a Text-to-Code dataset. For each data column you can provide a detailed generation prompt to guide how that column should be generated.
Post Processors: We offer two types of post processing for the data you generate. Validation is used to check the correctness of the data generated in a specific column. In this beta we support Python and SQL validation to ensure that the code generated is valid SQL or Python. Evaluation is used to explain how readable, relevant, and diverse the data you generated is. Evaluation is done using LLMs on specific records and the entire dataset.
Diversity in data is at the core of successfully generating a large-scale synthetic dataset. Data Designer introduces the concept of a "Data Seed" which is a key value pair used to inject diversity in the dataset. Data Designer uses these seeds to guide the data generation process based on the seed values to ensure maximal diversity in the dataset.
There are 3 ways to define your seeds:
Specify them in your config: As shown above, you can provide the seed values you are interested in directly in your config or Python script.
Let Data Designer create seed values: Sometimes you could want Data Designer to generate the values for a specific seed, this is useful in cases where you have "Nested Seeds". An example of this could be in a Text-to-Python dataset where you have code complexity as the "Seed" and you want an LLM to generate Descriptions for each of them.
Data Designer provides a high level YAML interface to declaratively define your dataset.
Once you define the configuration in YAML, you can use the Gretel SDK to load the configuration and then generate data.
Once you have a define DataDesigner
object, you can generate your dataset.
You can generate a quick preview of your dataset, assess the data generated, and adjust your config if needed.
Display a record
Once you are happy with your configuration, you can submit a batch job to generate as many records as you want!
Batch jobs may take a while to complete depending on how much data you create. Batch jobs create a Gretel Workflow that has an ID and you can use that ID to fetch your dataset.
If you prefer not to use YAML, you can use the Gretel SDK to define your Data Designer workflow, here is a simple example.
Once you have a define DataDesigner
object, you can generate your dataset.
You can generate a quick preview of your dataset, assess the data generated, and adjust your config if needed.
Display a record
Once you are happy with your configuration, you can submit a batch job to generate as many records as you want!
Batch jobs may take a while to complete depending on how much data you create. Batch jobs create a Gretel Workflow that has an ID and you can use that ID to fetch your dataset.
Transform unstructured model responses into strictly typed, schema-validated data. Structured Outputs ensures every response matches your predefined schema specifications, making integrations reliable and predictable.
Schema Enforcement: Responses automatically conform to your JSON schema definitions.
Developer Experience: Avoid writing long prompts with strict guidelines on model outputs and use simple Pydantic
objects to define your outputs.
When using the Gretel SDK, you can specify structured data outputs by using the data_config
parameter on the DataDesigner object. This parameter can take either a JSON schema or a Pydantic BaseModel
.
In the case of pydantic
types, you can also make use of Field
to define extra instruction information that will be passed along to the LLM behind the scenes. This can help you get optimal performance out of generations:
Code generation is handled in much the same way -- one just needs to specify the "code" type and then also provide the "syntax" for the language desired.
Here's a quick demo creating a fruit salad recipe!
Start here to learn how to generate natural language as well as tabular data. Helpful as an introductory guide!
Intro
Introducing real-time inference API and high-level Python SDK support. Synthesize data in only 4 lines of code!
Intro
Learn how to use the Gretel Navigator SDK to create new datasets or edit and augment existing datasets from a natural language prompt.
Intro
Use Navigator to create and facilitate further research into safeguards for completion models.
Advanced
Use Navigator to to simplify testing against a wide variety of queries that may be encountered in a production environment.
Advanced
Here is an example of a Data Designer Configuration for building a Text-to-Python dataset ()
Model Suite: Model Suites are curated collections of models designed to easily navigate the challenges of model selection, regulatory compliance, and legal rights over generated data. We support two model suites - apache-2.0
and llama-3.x.
For more on model suites, view page.
We provide Blueprint configurations for common use cases, like Text-to-Code. You can view our Blueprints .
Generate Seeds from sample records: Sometimes you may not know the best way to define seeds for your dataset, but you might have some examples of the data you want. You can provide Data Designer a few records of your data and Data Designer will figure out the best seeds to use. This capability is a powerful way to quickly go from a few records to an entire dataset. Learn more about this in our "Sample-to-Dataset" .
Model Suites: To learn more about model suites, check out page!