Working with Data
Last updated
Last updated
To use data extracted by a connector as training input to a Gretel model, we need to understand how data is passed between Workflow Actions. Each Workflow Action produces a set of outputs that can be referenced by downstream actions as inputs.
These inputs are configured on each action's config
block as a template expression. The properties of these inputs might take a number of different forms depending on the type of data being worked with.
The file
data structure holds information about a data file, such as a CSV in object storage.
The table
data structure holds information about a table extracted from a relational database or data warehouse.
A dataset
is an umbrella data structure containing collections of files and tables, as well as metadata like table relationships used internally by various actions. All actions output exactly one dataset
.
Some actions natively work with file
s, such as actions interfacing with object stores. Others natively work with table
s, such as those connecting to relational databases. A dataset
will contain both a file
and a table
representation of every data source. This allows you to create workflows that extract data from one kind of data source but write to a different type of destination.
file
and table
names are formatted with downstream compatibility in mind. An object store source action will preserve file
names as-is, and create database-friendly names for the corresponding table
representation.
Similarly, a database source action will preserve table
names as-is, and create file storage-friendly names for the corresponding file
representation.
All Gretel Workflow actions output a dataset
object that can then be referenced from a template expression in subsequent actions. Some actions require an entire dataset
as input, while others require finer-grained inputs like file
names and data handles. Each action documents its required inputs.
For more detail on template expression syntax, see the Config Syntax docs.
data
string, the data handle
filename
string, the stem of the file (e.g. events.csv
)
source_filename
string, the name of the file with any path prefix (e.g. sources/events.csv
)
data
string, the data handle
name
string, the name of the table
files
file
list
tables
table
list