Search
K

Running Gretel Hybrid

Gretel Hybrid provides customers with the flexibility to deploy their own data plane within their preferred cloud tenant. When you choose this option, Gretel Cloud's only role is job and workflow orchestration, ensuring all your data and models remain in your own tenant.
We will walk you through the following steps to deploy Gretel Hybrid.
Please note that once Hybrid is setup, job management tasks such including training and running Gretel models, are available exclusively through the SDK/CLI interfaces. You will still be able to view your projects and model activity on the Gretel console.

Data stored in Gretel Cloud when using Hybrid mode

When using Gretel Hybrid, you must configure your SDK to work in Hybrid mode and have a valid bucket to store output artifacts.
The Gretel Console will only give you viewing functionality for the models created in Hybrid mode. When running in Hybrid mode, the following data will be stored in Gretel Cloud:
  • Project names and descriptions
  • Model configuration (The YAML configuration created for each model)
  • Model name and ID
  • Model status (created, active, completed, etc)
  • Model run ID (when using a model to create more data)
  • Model run status (created, active, completed, etc)
  • The email address of the user that created a model
  • The email address of the user that ran a model
  • Model creation and model run logs. These logs only include metadata and error information.
  • Names of data source and results (file names only, no data is stored)
An example of viewing a hybrid job using Gretel Transform API:
Logs are the only artifacts stored in Gretel Cloud. Data source and generated result names can be viewed, but data is not stored in Gretel Cloud.
The following data is not stored in Gretel Cloud when using Hybrid mode:
  • Model training data. This will be stored and accessed from your own object storage (buckets you create).
  • Model training artifacts. These will be written to your object storage (buckets you create) instead. This includes:
    • The trained model archive / weights
    • Quality and privacy reports
    • Sample data generated during training
  • Model run artifacts. These will be written to your object storage instead. This includes:
    • Generated data
    • Model run reports (if applicable)
These instructions will run you through the following for each Cloud Provider:
  • Installing necessary command line tooling
  • Setting up your data source and sink buckets
  • Creating and managing your Kubernetes cluster with the required configuration and access controls
  • Testing your deployment with sample jobs

High level architecture

Prerequisites

Before getting started, you’ll need to install some tools on your system. If you’re using MacOS, we recommend that you install homebrew.

Kubectl

You’ll need kubectl to communicate with your Kubernetes cluster.

MacOS

brew update
brew install kubectl
kubectl version --client
See the Kubernetes Docs for other installation methods.

Helm

You’ll need helm to configure your Kubernetes cluster.

MacOS

brew install helm
helm version
You should be on at least helm version v3.10.2 to make sure you don’t run into any issues.
See the Helm Docs for other installation methods.

Gretel Client

Install and configure your client before proceeding further, and ensure session configuration is set as follows. The hybrid environment configuration will apply to everything run with the Gretel client, including libraries like Gretel Trainer and Gretel Relational.
The system that you are running Gretel SDKs from should have access to the artifact_endpoint below, which is an object storage bucket. This bucket should be the SINK_BUCKET that you configure in respective cloud-specific setups.
from gretel_client import configure_session
configure_session(
api_key="prompt", # for Notebook environments
validate=True,
default_runner="hybrid",
artifact_endpoint="s3://my-sink-bucket" # or gcs://, azure://
)
The Gretel Client uses cloud provider specific libraries to interact with the underlying object storage via the smart_open library.
S3
When using S3, the Gretel Client will look for default credentials already configured on your system. Docs for configuring S3 credentials can be found here.
GCS
When using GCS, the Gretel Client will look for default credentials already configured on your machine. Docs for configuring GCS credentials can be found here.
Azure
There is no standard way to configure credentials for Azure. The Gretel Client will look for credentials under the AZURE_STORAGE_CONNECTION_STRING or OAUTH_STORAGE_ACCOUNT_NAME environment variable.
To fetch a connection string for the AZURE_STORAGE_CONNECTION_STRING you can run the following command from your terminal using the Azure CLI.
az storage account show-connection-string \
--name ${STORAGE_ACCOUNT_NAME} \
--resource-group "${RESOURCE_GROUP}" --query="connectionString"
Be sure to replace STORAGE_ACCOUNT_NAME and RESOURCE_GROUP with the appropriate values for your storage container.
The OAUTH_STORAGE_ACCOUNT_NAME may be used to configure the Gretel Client with system assigned managed identities. OAUTH_STORAGE_ACCOUNT_NAME should contain the value of the storage account associated with your storage container.

Choose your cloud provider