Google Cloud Storage

Connect to your Google Cloud Storage buckets.

Getting Started

Prerequisites to create a Google Cloud storage based workflow. You will need

  1. A connection to Google Cloud Storage.

  2. A source bucket.

  3. (optional) A destination bucket. This can be the same as your source bucket, or omitted entirely.

Configuring a Google Cloud Storage Connection

Google Cloud Storage related actions require creating a gcs connection. The connection must be configured with the correct permissions for each Gretel Action.

For specific permissions, please refer to the Minimum Permissions section under each corresponding action.

Gretel GCS connections require the following fields

private_key_json

This private key JSON blob is used to authenticate Gretel with GCS object storage resources.

Create a Service Account

In order to generate a private key you will first need to create a service account, and then download the key for that service account.

Configure Bucket IAM Permissions

After the service account has been created, you can attach bucket specific permissions to the service account.

Please see each action's Minimum Permissions section for a list of permissions to attach to the service account.

GCS Source

Type

gcs_source

Connection

gcs

The gcs_source action can be used to read an object from a GCS bucket into Gretel Models.

This action works as an incremental crawler. Each time a workflow is run the action will crawl new files that have landed in the bucket since the last crawl.

For details how the action more generally works, please see Reading Objects.

Inputs

bucket

Bucket to crawl data from. Should only include the name, such as my-gretel-source-bucket.

glob_filter

A glob filter may be used to match file names matching a specific pattern. Please see the Glob Filter Reference for more details.

path

Prefix to crawl objects from. If no path is provided, the root of the bucket is used.

recursive

Default false. If set to true the action will recursively crawl objects starting from path.

Outputs

dataset

A dataset object containing file and table representations of the found objects.

Minimum Permissions

The associated service account must have the following permissions for the configured bucket

  • storage.objects.list

  • storage.objects.get

GCS Destination

Type

gcs_destination

Connection

gcs

The gcs_destination action may be used to write gretel_model outputs to Google Cloud Storage buckets.

For details how the action more generally works, please see Writing Objects.

Inputs

bucket

The bucket to write objects back to. Only include the name of the bucket, eg my-gretel-bucket.

path

Defines the path prefix to write the object into.

filename

Name of the file to write data back to. This file name will be appended to the path if one is configured.

input

Data to write to the file. This should be a reference to the output from a previous action.

Outputs

None

Minimum Permissions

The associated service account must have the following permissions for the configured destination bucket

  • storage.objects.create

  • storage.objects.delete (supports replacing an existing file in the bucket)

Examples

Create a synthetic copy of your Google Cloud Storage bucket. The following config will crawl a bucket, train and run a synthetic model, then write the outputs of the model back to a destination bucket while maintaining the same folder structure of the source bucket.

name: sample-gcs-workflow

actions:
  - name: gcs-crawl
    type: gcs_source
    connection: c_1
    config:
      bucket: my-analytics-bucket
      glob_filter: "*.csv"
      path: metrics/

  - name: model-train-run
    type: gretel_model
    input: gcs-crawl
    config:
      project_id: proj_1
      model: synthetics/default
      run_params:
        params:
          num_records_multiplier: 1.0
      training_data: "{outputs.gcs-crawl.dataset.files.data}"

  - name: gcs-sync
    type: gcs_destination
    connection: c_1
    input: model-train-run
    config:
      bucket: my-synthesized-analyics-bucket
      input: "{outputs.model-train-run.dataset.files.data}"
      filename: "{outputs.gcs-crawl.dataset.files.filename}"
      path: metrics/

Last updated