Working with Data

To use data extracted by a connector as training input to a Gretel model, we need to understand how data is passed between Workflow Actions. Each Workflow Action produces a set of outputs that can be referenced by downstream actions as inputs.

These inputs are configured on each action's config block as a template expression. The properties of these inputs might take a number of different forms depending on the type of data being worked with.

Data Types

File

The file data structure holds information about a data file, such as a CSV in object storage.

data

string, the data handle

filename

string, the stem of the file (e.g. events.csv)

source_filename

string, the name of the file with any path prefix (e.g. sources/events.csv)

Table

The table data structure holds information about a table extracted from a relational database or data warehouse.

data

string, the data handle

name

string, the name of the table

Dataset

A dataset is an umbrella data structure containing collections of files and tables, as well as metadata like table relationships used internally by various actions. All actions output exactly one dataset.

files

file list

tables

table list

Some actions natively work with files, such as actions interfacing with object stores. Others natively work with tables, such as those connecting to relational databases. A dataset will contain both a file and a table representation of every data source. This allows you to create workflows that extract data from one kind of data source but write to a different type of destination.

file and table names are formatted with downstream compatibility in mind. An object store source action will preserve file names as-is, and create database-friendly names for the corresponding table representation.

file
    source_filename: path/to/data.csv
    filename: data.csv

table
    name: path_to_data_csv

Similarly, a database source action will preserve table names as-is, and create file storage-friendly names for the corresponding file representation.

table
    user_events

file
    source_filename: user_events.csv
    filename: user_events.csv

Referencing outputs via Template Expressions

All Gretel Workflow actions output a dataset object that can then be referenced from a template expression in subsequent actions. Some actions require an entire dataset as input, while others require finer-grained inputs like file names and data handles. Each action documents its required inputs.

For more detail on template expression syntax, see the Config Syntax docs.

Last updated