Amazon S3
Connect Gretel to your Amazon S3 buckets.
Last updated
Connect Gretel to your Amazon S3 buckets.
Last updated
This guide will walk you through connecting source and destination S3 buckets to Gretel. Source buckets will be crawled and used as training inputs to Gretel models. Model outputs get written to the configured S3 destination.
Prerequisites to create a Amazon S3 based workflow. You will need
A connection to Amazon S3.
A source bucket.
(optional) A destination bucket. This can be the same as your source bucket, or omitted entirely.
Amazon S3 related actions require creating an s3
connection. The connection must be configured with the correct IAM permissions for each Gretel Action.
You can configure the following properties for a connection
All credentials sent to Gretel are encrypted both in transit and at rest.
The following policy can be used to enable access for all S3 related actions
More granular permissions for each action can be found in the action's respective Minimum Permissions section.
The following documentation provides instruction for creating IAM users and access keys from your AWS account.
You can configure your Gretel S3 connector to use an IAM role for authorization. Using IAM roles you can grant Gretel systems access to your bucket without sharing any static access keys.
Using an IAM role is supported for both Gretel Cloud and Gretel Hybrid on AWS.
Before setting up your IAM role, you must first locate the Gretel Project ID for the project you wish to create the connection in. You will use the project id as the external id for the IAM role.
You may find your Gretel Project ID from the Console, SDK or CLI using the following instructions:
Using the CLI you can query for projects by name and use the project_guid
field to retrieve the external id for the IAM role.
Now that you have the external id, you will need to create an AWS IAM role. To create the role, navigate to your AWS IAM Console, select the Roles page from the left menu, select Create Role and follow the instruction for either Gretel Cloud or Gretel Hybrid below:
From the Role Creation dialog
Select AWS account as the Trusted entity type.
From the Select Another AWS account and enter Gretel's AWS account 074762682575
.
Check Require external ID and enter the Gretel Project ID from the previous step as the External ID.
Select Next and add the appropriate IAM policies for the bucket.
The final trust policy on your IAM role should look similar to
For more information about delegating permissions to an AWS IAM user, please reference the following AWS documentation:
Now that you have the role configured, you can create a Gretel connection using the role ARN from the the previous step.
Using the role ARN from the previous steps, create a file on your local computer with the following contents
Then use the Gretel CLI to create the connection from the credentials file
Once you've create the connection, you may delete the local credentials file.
The s3_source
action can be used to read an object from a S3 data source into Gretel Models.
Each time the source action is run from a workflow, the action will crawl new files that have landed in the bucket since the last crawl.
The following permissions must be attached to the AWS connection in order to read objects from a s3 bucket
An S3 bucket can be configured as a destination for model outputs. This bucket can be the same bucket as the source, or a different bucket may be specified. If no destination is specified, generated data can be accessed from the model itself.
The s3_destination
action may be used to write gretel_model
outputs to S3 destination buckets.
None
The following permissions must be attached to the AWS connection in order to write objects to a destination bucket.
path
The path
property from the source configuration may be used in conjunction with the destination path
to move file locations while preserving file names.
For example, if a source bucket is configured with path=data/
and the destination bucket configured with path=processed-data/
, a source file data/records.csv
will get written to the destination asprocessed-data/records.csv
.
Create a synthetic copy of your Amazon S3 bucket. The following config will crawl a S3 bucket, train and run a synthetic model, then write the outputs of the model back to a destination S3 bucket while maintaining the same name and folder structure of the source bucket.
For details how the action more generally works, please see .
For details how the action more generally works, please see .
access_key_id
Unique identifier used to authenticate and identify the user.
secret_access_key
Secret value used to sign requests.
Type
s3_source
Connection
s3
bucket
Bucket to crawl data from. Should only include the name, such as my-gretel-source-bucket
.
glob_filter
A glob filter may be used to match file names matching a specific pattern. Please see the Glob Filter Reference for more details.
path
Prefix to crawl objects from. If no path
is provided, the root of the bucket is used.
recursive
Default false
. If set to true
the action will recursively crawl objects beginning from the configured path
.
dataset
A dataset object containing file and table representations of the found objects.
Type
s3_destination
Connection
s3
bucket
The bucket to write objects back to. Please only include the name of the bucket, eg my-gretel-bucket
.
path
Defines the path prefix to write the object into.
filename
This is the name of the file to write data back to. This file name will be appended to the path
if one is configured.
input
Data to write to the file. This should be a reference to the output from a previous action.