Search…
S3 Connector
This document provides a general overview for using AWS S3 as both the source and the sink for a Gretel Connector.
See Deploying On-Prem for step-by-step instructions for how to deploy a Gretel S3 Connector with Local Workers in your AWS environment.

Overview

The Gretel S3 Connector can be configured to continuously watch for new objects in a source S3 bucket, call Gretel Workers to transform records within those objects (for example to replace or encrypt PII), and write the results to a destination S3 bucket.

Config

Below is an example S3 Connector config. In this pipeline all CSVs in the my-connector-source bucket prefixed with sandbox will be transformed so any PII is removed. After the S3 object has been de-identified, the object will be written into the my-connector-destination bucket and prefixed with output/sandbox.
version: 1
sources:
- name: my_s3_source
type: s3
config:
bucket: my-connector-source
path_prefix: sandbox
glob_filter: "*.csv"
trigger:
type: sqs
endpoint: https://sqs.us-east-2.amazonaws.com/123456789012/s3-connector-inbound
sinks:
- name: my_s3_sink
type: s3
config:
bucket: my-connector-destination
path_prefix: output/sandbox
connectors:
- name: default
version: dev
max_active: 1
source: my_s3_source
sink: my_s3_sink
model: transform/default

Source Config

  • bucket - The name of the source bucket to ingest data from.
  • path_prefix - Objects matching this prefix will be processed through the connector. If the object does not match the prefix, that object will be skipped.
  • glob_filter - Filters for objects matching a specific glob filter. This is useful for filtering objects by file type. If an object does not match the filter, it will be omitted from processing. Glob filters follow standard unix style pathname pattern expansion.
  • trigger - The S3 connector is built to continuously poll for new objects arriving in a bucket. A trigger config defines where to poll new objects. Currently, only SQS triggers are supported.
    • type: sqs - This configures the connector to continuously poll a SQS queue for new S3 change events.
    • endpoint - Specifies the SQS endpoint to poll for new events.

Sink Config

  • bucket - The destination bucket to write objects back to.
  • path_prefix - Rewrites the source object path to the specified path prefix.

Supported File Types

Connectors support all of the same file types that are supported by the Gretel CLI, with a few limitations:
  • Compressed CSV and JSON files are supported, but will arrive in the destination bucket uncompressed.
  • If file types or schemas are inconsistent within a single pipeline (for example the source S3 bucket contains both CSV and Parquet files), you must choose to train a new model per file type or schema. If data sources are consistent, the same model can be re-used. Please see Specifying a Model for more information how to configure the connector model.