Amazon S3

Connect Gretel to your Amazon S3 buckets.

This guide will walk you through connecting source and destination S3 buckets to Gretel. Source buckets will be crawled and used as training inputs to Gretel models. Model outputs get written to the configured S3 destination.

Getting Started

Prerequisites to create a Amazon S3 based workflow. You will need

  1. A connection to Amazon S3.

  2. A source bucket.

  3. (optional) A destination bucket. This can be the same as your source bucket, or omitted entirely.

Configuring a Connection

Amazon S3 related actions require creating an s3 connection. The connection must be configured with the correct IAM permissions for each Gretel Action.

You can configure the following properties for a connection

access_key_id

Unique identifier used to authenticate and identify the user.

secret_access_key

Secret value used to sign requests.

All credentials sent to Gretel are encrypted both in transit and at rest.

The following policy can be used to enable access for all S3 related actions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GretelS3Source",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::your-source-bucket-here",
        "arn:aws:s3:::your-source-bucket-here/*"
      ]
    },
    {
      "Sid": "GretelS3Destination",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts",
        "s3:ListBucketMultipartUploads",
        "s3:CreateMultipartUpload",
        "s3:UploadPart",
        "s3:CompleteMultipartUpload"
      ],
      "Resource": [
        "arn:aws:s3:::your-destination-bucket-here/*"
      ]
    }
  ]
}

More granular permissions for each action can be found in the action's respective Minimum Permissions section.

Creating Access Keys

The following documentation provides instruction for creating IAM users and access keys from your AWS account.

Creating an IAM Role

You can configure your Gretel S3 connector to use an IAM role for authorization. Using IAM roles you can grant Gretel systems access to your bucket without sharing any static access keys.

Using an IAM role is supported for both Gretel Cloud and Gretel Hybrid on AWS.

Before setting up your IAM role, you must first locate the Gretel Project ID for the project you wish to create the connection in. You will use the project id as the external id for the IAM role.

You may find your Gretel Project ID from the Console, SDK or CLI using the following instructions:

Using the CLI you can query for projects by name and use the project_guid field to retrieve the external id for the IAM role.

$ gretel projects search --query "My Test S3 Workflow"
[
    {
        "name": "proj-6aa9a",
        "project_id": "6268f03b6da43339ff37756a",
        "project_guid": "proj_28N5smcmkGnD6H5pd17tZwfYkQ1",
        "display_name": "My Test S3 Workflow",
        "desc": "Workflow demo project",
        "console_url": "https://console.gretel.ai/proj_2eK7enJMH6fffItDp1dS4Ywa4tz"
    }
]

Now that you have the external id, you will need to create an AWS IAM role. To create the role, navigate to your AWS IAM Console, select the Roles page from the left menu, select Create Role and follow the instruction for either Gretel Cloud or Gretel Hybrid below:

From the Role Creation dialog

  1. Select AWS account as the Trusted entity type.

  2. From the Select Another AWS account and enter Gretel's AWS account 074762682575.

  3. Check Require external ID and enter the Gretel Project ID from the previous step as the External ID.

  4. Select Next and add the appropriate IAM policies for the bucket.

The final trust policy on your IAM role should look similar to

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "074762682575"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<your gretel project id, eg proj_28N5smcmkGnD6H5pd17tZwfYkQ1>"
                }
            }
        }
    ]
}

For more information about delegating permissions to an AWS IAM user, please reference the following AWS documentation:

Now that you have the role configured, you can create a Gretel connection using the role ARN from the the previous step.

Using the role ARN from the previous steps, create a file on your local computer with the following contents

{
    "type": "s3",
    "name": "my-s3-source-bucket",
    "config": {
        "role_arn": "arn:aws:iam::123456789012:role/s3-gretel-source-access",
    },
}

Then use the Gretel CLI to create the connection from the credentials file

gretel connections create --project [project id] --from-file [credential_file.json]

Once you've create the connection, you may delete the local credentials file.

S3 Source

Type

s3_source

Connection

s3

The s3_source action can be used to read an object from a S3 data source into Gretel Models.

Each time the source action is run from a workflow, the action will crawl new files that have landed in the bucket since the last crawl.

Inputs

bucket

Bucket to crawl data from. Should only include the name, such as my-gretel-source-bucket.

glob_filter

path

Prefix to crawl objects from. If no path is provided, the root of the bucket is used.

recursive

Default false. If set to true the action will recursively crawl objects beginning from the configured path.

Outputs

dataset

Minimum Permissions

The following permissions must be attached to the AWS connection in order to read objects from a s3 bucket

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GretelS3Source",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-here",
        "arn:aws:s3:::your-bucket-here/*"
      ]
    }
  ]
}

S3 Destination

Type

s3_destination

Connection

s3

An S3 bucket can be configured as a destination for model outputs. This bucket can be the same bucket as the source, or a different bucket may be specified. If no destination is specified, generated data can be accessed from the model itself.

The s3_destination action may be used to write gretel_model outputs to S3 destination buckets.

Inputs

bucket

The bucket to write objects back to. Please only include the name of the bucket, eg my-gretel-bucket.

path

Defines the path prefix to write the object into.

filename

This is the name of the file to write data back to. This file name will be appended to the path if one is configured.

input

Outputs

None

Minimum Permissions

The following permissions must be attached to the AWS connection in order to write objects to a destination bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GretelS3Destination",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts",
        "s3:ListBucketMultipartUploads",
        "s3:CreateMultipartUpload",
        "s3:UploadPart",
        "s3:CompleteMultipartUpload"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-here/*"
      ]
    }
  ]
}

Options for configuring the path

The path property from the source configuration may be used in conjunction with the destination path to move file locations while preserving file names.

For example, if a source bucket is configured with path=data/ and the destination bucket configured with path=processed-data/, a source file data/records.csv will get written to the destination asprocessed-data/records.csv.

Examples

Create a synthetic copy of your Amazon S3 bucket. The following config will crawl a S3 bucket, train and run a synthetic model, then write the outputs of the model back to a destination S3 bucket while maintaining the same name and folder structure of the source bucket.

name: sample-s3-workflow

actions:
  - name: s3-crawl
    type: s3_source
    connection: c_1
    config:
      bucket: my-analytics-bucket
      glob_filter: "*.csv"
      path: metrics/

  - name: model-train-run
    type: gretel_model
    input: s3-crawl
    config:
      project_id: proj_1
      model: synthetics/default
      run_params:
        params:
          num_records_multiplier: 1.0
      training_data: "{outputs.s3-crawl.dataset.files.data}"

  - name: s3-sync
    type: s3_destination
    connection: c_1
    input: model-train-run
    config:
      bucket: my-synthesized-analyics-bucket
      input: "{outputs.model-train-run.dataset.files.data}"
      filename: "{outputs.s3-crawl.dataset.files.filename}"
      path: metrics/

Last updated