Architecture

Architecture Diagram

Data sent to Gretel's control plane when using Hybrid mode

When running in Hybrid mode, the following data will be stored in Gretel's control plane and may be passed between your Gretel Hybrid environment and the Gretel API.

  • Project names and descriptions

  • Model configuration (The YAML configuration created for each model)

  • Model name and ID

  • Model status (created, active, completed, etc)

  • Model run ID (when using a model to create more data)

  • Model run status (created, active, completed, etc)

  • Workflow IDs, Workflow Run IDs and Workflow Task IDs

  • Workflow Task Statuses and overall Workflow Run Status

  • The email address of the user that created a model

  • The email address of the user that ran a model

  • Model creation and model run logs. These logs only include metadata and error information.

  • Workflow Task logs. These logs include metadata and error information, and allow users to view logs in the Console.

  • Names of data source and results (file names only, no data is stored)

The following data is not stored in Gretel's control plane when using Hybrid mode.

  • Model training data. This will be stored and accessed from your own object storage (buckets you create).

  • Model training artifacts. These will be written to your object storage (buckets you create) instead. This includes:

    • The trained model archive / weights

    • Quality and privacy reports

    • Sample data generated during training

  • Model run artifacts. These will be written to your object storage instead. This includes:

    • Generated data

    • Model run reports (if applicable)

An example of viewing a hybrid job using Gretel Transform API:

Outbound Network Requirements

Gretel Hybrid relies on outbound connections to reach out to the Gretel API and pull container images. No inbound network connections are required for Gretel Hybrid to function. The below endpoints must be reachable from the network associated with the Kubernetes cluster hosting Gretel Hybrid.

  • api.gretel.cloud (HTTPS / TCP 443) - The Gretel API. This must be reachable by all Gretel pods running within your Kubernetes cluster for the purposes of job scheduling and orchestration.

  • artifacts.gretel.cloud (HTTPS / TCP 443) - This endpoint provides presigned S3 URLs for pulling certain base model artifacts when a model training job starts. This must be reachable by all Gretel pods running within your Kubernetes cluster.

  • 074762682575.dkr.ecr.us-west-2.amazonaws.com (HTTPS / TCP 443) - Gretel's Contain Registry hosted on AWS ECR. This must be reachable by Kubernetes nodes so that pod container images may be pulled.

  • s3.amazonaws.com (HTTPS / TCP 443) - AWS S3 is the persistent storage that backs ECR and this endpoint must be reachable by Kubernetes nodes so that they can pull Gretel container images.

  • s3-us-west-2.amazonaws.com (HTTPS / TCP 443) - AWS S3 is the persistent storage that backs ECR and this endpoint must be reachable by Kubernetes nodes so that they can pull Gretel container images.

Last updated