Config Syntax

Workflows are configured using YAML and can be managed from either the Gretel Console, SDK or CLI.

Config Structure

Workflows are configured using three top-level blocks: name, trigger, and actions.

Name

The name field sets the name of the workflow. This name is used as the canonical reference to the workflow. Workflow names do not need to be unique to a project, but should be descriptive enough to uniquely describe the purpose of the workflow.

name: anonymize-redshift-analytics-schema

Trigger

Triggers may be used to schedule recurring workflows using standard cron syntax. To schedule a workflow to run once daily, a workflow trigger might look like this:

trigger:
  cron:
    pattern: "@daily"

For more detailed documentation please refer to the Scheduled Workflows docs.

Actions

The actions block configures each step in the workflow.

actions:
  - name: s3-read
    type: s3_source
    connection: c_1
    config:
      bucket: my-bucket
      glob_filter: "*.csv"
      path: prod-analytics-daily/

Each action definition carries the same top-level configuration envelope with the following fields:

name

An identifier for the action. Action names must be unique within the scope of a workflow.

type

The specific action type, e.g. s3_source or gretel_model. (See below)

connection

Pass a Connection ID to authenticate the action. This field is required for actions that connect to external services such as S3 or BigQuery.

input

Specify a preceding action as input to the current action.

config

The type-specific config.

See the Connectors section for type and config details for actions that work with sources and sinks. See Working with Models for type and config details for actions that interface with Gretel.

Template Expressions

Template expressions are used to dynamically configure actions based on the result of a preceding action. Template expressions are denoted by curly braces, i.e. {<template-expression>}.

Accessing Action Outputs

Action outputs are accessed via the following form:

{outputs.<action-name>.<output-name>[.<attr>...]}

For example, a dataset output from a MySQL source action would be referenced like this:

{outputs.mysql-extract.dataset}

You can append attribute components to the expression to dive into the output data structure. For example, to get the filename of each object from an Azure blob storage source action:

{outputs.blob-crawl.dataset.files.filename}

Enumerating Template Expression Values

Consider the following workflow config

name: sample-s3-workflow

actions:
  - name: s3-read
    type: s3_source
    connection: c_1
    config:
      bucket: my-analytics-bucket
      glob_filter: "*.csv"
      path: metrics/

  - name: model-train-run
    type: gretel_model
    input: s3-crawl
    config:
      project_id: proj_1
      model: synthetics/tabular-actgan
      run_params:
        params:
          num_records_multiplier: 1.0
      training_data: "{outputs.s3-crawl.dataset.files.data}"

  - name: s3-write
    type: s3_destination
    connection: c_1
    config:
      bucket: my-synthetic-bucket
      path: metrics/
      filename: "{outputs.s3-read.dataset.files.filename}"
      input: "{outputs.model-train-run.dataset.files.data}"

In this config the s3-read action outputs a dataset object. In the next action - model-train-run - we use the template expression {outputs.s3-read.dataset.files.data} to define the training_data used for that action. When executing the workflow, the expression is resolved to a concrete set of values based on the outputs of s3-read.

If the s3-read action finds two files, a.csv and b.csv, we will enumerate two concrete instances of the model-train-run config with:

  • training_data: <data handle to a.csv>

  • training_data: <data handle to b.csv>

Each instance of the config will get passed into the model-train-run action, resulting in two trained models, one model for a.csv and another for b.csv.

Additionally, an action config can include multiple template expressions referring to different lists. For example, the s3-write action above is configured with two template expressions, one referencing the original source filename, the other referencing synthesized data. The workflow runtime will automatically resolve these expressions to align such that again, there are two concrete instances of the s3-write config enumerated, with:

  • filename: "a.csv" input: <data handle to the synthetic output from the model trained on a.csv>

  • filename: "b.csv" input: <data handle to the synthetic output from the model trained on b.csv>

Last updated