Search…
Deploying On-Prem
This guide walks you step-by-step how to deploy the Gretel S3 Connector and on-prem workers into your AWS environment. Using this connector, your data never leaves your environment.

Getting Started

Requirements:
  • Gretel API Key
  • AWS Account and VPC.
We highly recommend not using an AWS root user when setting up Gretel's S3 connector in your own AWS environment.
Gretel's S3 connector is deployed via a CloudFormation template and supports most AWS regions.
Click here to jump straight to step-by-step instructions, or continue reading for a brief overview of the components that will be deployed.
Estimated time to deploy stack: 20 minutes. Users should have a basic understanding of AWS and CloudFormation in order to complete this guide.
Please contact [email protected] for access to the template.

Template Overview

The Gretel S3 Connector is deployed into your AWS account via a set of nested CloudFormation templates. The connector pipeline can be broken into five components, which are automatically deployed via the CloudFormation template. In the next few sections, we describe each component and then walk you through how to deploy the pipeline using the AWS CloudFormation console.

Connector Service

The connector service is responsible for reading source S3 objects, producing Gretel jobs, and writing transformed or synthesized data back to the destination bucket. The connector service is deployed as a single EC2 instance managed under an autoscaling group.

Event Infrastructure

S3 object processing is triggered via an EventBridge notification from the source bucket. As new objects are pushed to the source bucket, the connector will receive those new objects and push them to a Gretel Worker for processing. The CloudFormation template will provision an EventBridge rule and SQS queue for the connector to subscribe to.

Worker Cluster

The worker cluster is responsible for running Gretel models and processing source S3 data. The worker cluster runs on-prem, and is horizontally scalable. To scale worker instances up or down, you may adjust the auto scaling group that is provisioned by the connector CFn template.

Artifact Storage

With the on-prem Gretel connector, your data never leaves your cloud. To achieve this, the CloudFormation template will deploy an intermediate S3 bucket that is used to temporarily store source and destination artifacts produced by the connector and worker instances. These objects are only stored temporarily and phased out when they are no longer needed by the pipeline.

Configuration Management

There are various parameters, secrets, and configurations that must be managed in order to run the pipeline. These values are configured automatically based on parameters provided to the CloudFormation template. After the template has been deployed, these values may be managed by updating the respective Secret Manager or Parameter Store resources.

Preparing your S3 Buckets

To set up the Gretel S3 connector, you must provide an existing source and destination bucket.
The source bucket must be configured to send Amazon EventBridge notifications. To configure these events, please navigate to the "Properties" tab on the source S3 bucket.
Next, scroll down to the "Event notifications" section, and click the "Edit" button next to the "Amazon EventBridge" heading.
Select "On" and click "Save Changes". Now your bucket is ready to be a source for the Gretel S3 Connector.

Launching via CloudFormation

IAM Permissions

The following policy may be used to create a role for provisioning the CloudFormation connector stack.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:CreateAutoScalingGroup",
"autoscaling:DeleteAutoScalingGroup",
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeScalingActivities",
"autoscaling:UpdateAutoScalingGroup",
"cloudformation:CreateChangeSet",
"cloudformation:DeleteStack",
"cloudformation:DescribeStackResource",
"cloudformation:DescribeStackResources",
"cloudformation:ExecuteChangeSet",
"cloudwatch:DeleteDashboards",
"cloudwatch:GetDashboard",
"cloudwatch:PutDashboard",
"cloudwatch:PutMetricData",
"ec2:CreateLaunchTemplate",
"ec2:CreateTags",
"ec2:DeleteLaunchTemplate",
"ec2:DescribeImages",
"ec2:DescribeInstances",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeTags",
"ec2:DescribeVolumes",
"ec2:RunInstances",
"kms:GenerateDataKey",
"lambda:CreateFunction",
"lambda:DeleteFunction",
"lambda:GetFunction",
"lambda:InvokeFunction",
"logs:CreateLogGroup",
"logs:DeleteLogGroup",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams",
"logs:DescribeMetricFilters",
"logs:ListTagsLogGroup",
"logs:PutRetentionPolicy",
"resource-groups:CreateGroup",
"resource-groups:DeleteGroup",
"resource-groups:ListGroups",
"secretsmanager:CreateSecret",
"secretsmanager:DeleteSecret",
"secretsmanager:GetSecretValue",
"secretsmanager:UpdateSecret",
"ssm:DeleteParameter",
"ssm:GetParameters",
"ssm:PutParameter"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"iam:AddRoleToInstanceProfile",
"iam:AttachRolePolicy",
"iam:CreateInstanceProfile",
"iam:CreatePolicy",
"iam:CreateRole",
"iam:DeleteInstanceProfile",
"iam:DeletePolicy",
"iam:DeleteRole",
"iam:DeleteRolePolicy",
"iam:DetachRolePolicy",
"iam:GetInstanceProfile",
"iam:GetPolicy",
"iam:GetRole",
"iam:GetRolePolicy",
"iam:ListInstanceProfiles",
"iam:ListInstanceProfilesForRole",
"iam:ListPolicyVersions",
"iam:PassRole",
"iam:PutRolePolicy",
"iam:RemoveRoleFromInstanceProfile",
"iam:TagInstanceProfile",
"iam:TagPolicy",
"iam:TagRole"
],
"Resource": [
"arn:aws:iam::*:instance-profile/<name-of-stack>*",
"arn:aws:iam::*:role/<name-of-stack>*",
"arn:aws:iam::*:policy/<name-of-stack>*"
]
},
{
"Effect": "Allow",
"Action": [
"events:DeleteRule",
"events:DescribeRule",
"events:DisableRule",
"events:EnableRule",
"events:PutRule",
"events:PutTargets",
"events:RemoveTargets",
"s3:CreateBucket",
"s3:DeleteBucket",
"s3:GetBucketAcl",
"s3:GetBucketPolicyStatus",
"s3:GetBucketPublicAccessBlock",
"s3:PutBucketPublicAccessBlock",
"s3:PutLifecycleConfiguration",
"sqs:AddPermission",
"sqs:ChangeMessageVisibility",
"sqs:CreateQueue",
"sqs:DeleteMessage",
"sqs:DeleteQueue",
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl",
"sqs:ReceiveMessage",
"sqs:RemovePermission",
"sqs:SendMessage",
"sqs:SetQueueAttributes",
"ssm:DeleteParameter",
"ssm:GetParameters",
"ssm:PutParameter"
],
"Resource": "*"
}
]
}
This role includes enough permissions to both provision and delete the connector stack.

Create the Stack

Navigate to "CloudFormation -> Stacks -> Create Stack" and enter in the Connector CloudFormation template URL, then click "Next".

Stack Parameters

With the CloudFormation template selected, you will now be prompted to fill out parameters for the stack.
  • Bucket Configuration
    • SourceBucket optional - The name of the source bucket that contains the raw data. This bucket must be different from the DestinationBucket. If no bucket is provided, one will be created on your behalf.
    • DestinationBucket optional - The name of the destination bucket which will receive transformed or synthesized data from the pipeline. If no bucket is created, one will be created on your behalf.
  • Gretel Config
    • GretelApiKey - Your Gretel API key. This key can be found by navigating to the Gretel Console.
    • Project optional - Gretel Project ID. Each connector and worker cluster is scoped to a single project. This project should only be used for the connector pipeline. If no project is provided, one will be created automatically.
    • Model - A pre-trained model ID or path to a model config.
      • If you choose to use a pre-trained model, this model must be owned by the configured project and be trained in the cloud.
      • If you configure the connector to use a model config, a new model will be trained per object. Valid model configs include
        • S3 URIs
        • HTTP links
        • Managed Gretel Configs, e.g. transform/default. These configs can be found by navigating to Gretel's Blueprints Github Repo.
    • GretelEndpoint - Gretel API endpoint. The default value populated in the template should be used.
  • Worker Cluster Config
    • CpuWorkerInstanceCount - This determines the number of Gretel Worker instances to provision in the auto scaling group.
    • Subnets - Configures where connector and worker instances will be provisioned. Please note: these instances need outbound access to https://api.gretel.cloud. They do not need ingress access.
    • WorkerSecurityGroup - Connector and worker EC2 instance security groups.
After all the appropriate parameters have been filled, out click "Next".
In the next step, you will be asked to configure any additional stack options. After this has been completed, click "Next"
On the final page of the CloudFormation wizard you will be given an opportunity to review all the parameters. After you have reviewed these, click "Create Stack" to deploy the connector.

Verifying the Pipeline is Configured Correctly

After the stack has been deployed, you may verify everything is running correctly by checking both connector and worker application logs.
Connectors application logs are sent to CloudWatch and can be found under the /gretel/{stack-name}/application prefix.

Connector

If the connector has been deployed successfully, you will see log messages indicating that the connector has subscribed to the SQS queue and is awaiting new work to arrive.
INFO: Starting container 074762682575.dkr.ecr.us-east-2.amazonaws.com/gretelai/connector:dev
[main] INFO cloud.gretel.connectors.Connector - Starting connector with config /etc/gretel/connector.yaml
[main] INFO cloud.gretel.connectors.jobs.JobsController - Configured Gretel Endpoint `https://api.gretel.cloud`
[main] INFO cloud.gretel.connectors.jobs.JobsController - Instantiated job controller mode MODEL, project `s3-connector-demo`, model `Optional[transform/default]`
[main] INFO cloud.gretel.connectors.jobs.JobsController - Configured artifact endpoint [email protected]08
[main] INFO cloud.gretel.connectors.Connector - Assembling the pipeline
[main] INFO cloud.gretel.connectors.adapters.s3.S3Factory - Source config Source[fileType=null, bucket=gretel-connector-test-source-usw2, pathPrefix=null, globFilter=null, trigger=Trigger[endpoint=https://sqs.us-west-2.amazonaws.com/...]]
[main] INFO cloud.gretel.connectors.adapters.s3.S3Factory - Sink config Sink[bucket=gretel-connector-test-destination-usw2, pathPrefix=null]
[main] INFO cloud.gretel.connectors.Connector - Starting the connector pipeline
[main] INFO cloud.gretel.connectors.tasks.TaskManager - Starting error handler
[main] INFO cloud.gretel.connectors.adapters.s3.S3Sink - Starting sink class cloud.gretel.connectors.adapters.s3.S3Sink
[pool-4-thread-1] INFO cloud.gretel.connectors.tasks.TaskManager - Active Tasks: 0, Queued Results: 0, Failed Tasks: 0, Failed Writes: 0
[main] INFO cloud.gretel.connectors.adapters.s3.S3EventSource - Starting source class cloud.gretel.connectors.adapters.s3.S3EventSource
[pool-6-thread-1] INFO cloud.gretel.connectors.adapters.s3.S3Sink - Started sink for gretel-connector-test-destination-usw2
[pool-5-thread-1] INFO cloud.gretel.connectors.adapters.s3.S3EventSource - Started source, polling https://sqs.us-west-2.amazonaws.com/...
[pool-5-thread-1] INFO cloud.gretel.connectors.adapters.s3.sqs.SQSPoller - Polling on https://sqs.us-west-2.amazonaws.com/...
[pool-4-thread-1] INFO cloud.gretel.connectors.tasks.TaskManager - Active Tasks: 0, Queued Results: 0, Failed Tasks: 0, Failed Writes: 0

Workers

Worker instances have started successfully if you see log messages similar to the following:
INFO: Starting Gretel agent using driver docker
INFO - gretel_client.agents.agent - Agent started, waiting for work to arrive

Security and Encryption

All EC2 connected EBS volumes are encrypted by default.
All outbound traffic sent to Gretel is TLS encrypted.

Limitations

  • The S3 connector stack, source bucket, and destination bucket must be in the same region.
  • S3 objects must be less than 100 MB. This limit may increase in the future.
  • Encrypted source and destination buckets are not supported at the moment.

Pricing

Gretel jobs are billed at the rate detailed under our pricing page, https://gretel.ai/pricing.