Editing Datasets
Datasets and their attributes can only be referenced "as-is" in action configs, for example {outputs.extract.dataset}
. However, there are cases where a dataset output by one action needs to be edited before it can be used by a downstream action. Some examples include:
Specifying primary and foreign keys on data sourced from object stores (where such metadata does not exist as it does in a relational database)
Removing a foreign key to break a cyclic table relationship
Renaming tables
The dataset_editor
action provides a way to apply alterations like these and more to datasets. It accepts a dataset from some other action as input, and outputs a modified version of that dataset for downstream actions to consume.
Adding and removing table relationships
A table relationship is used to relate two tables. The most common example is a foreign key constraint in a relational database.
A table_relationship
contains the following properties:
table | string | The name of the table containing data (typically an id column) pointing to records on another table, e.g. "the table with the foreign key" |
constrained_columns | string list | The columns populated with identifiers from the other table, e.g. "the foreign key column(s)" |
referred_table | string | The name of the table containing data records being referenced by |
referred_columns | string list | The columns to which |
For example, a relational database storing users
and their sessions
might have a foreign key user_id
on the sessions
table pointing to the users.id
column. That key can be represented as a table_relationship
:
To add or remove relationships via the dataset_editor
action, use the add_table_relationships
and remove_table_relationships
attributes, both of which accept a table_relationship
list.
Renaming tables
There are two ways to rename tables. First, tables can be renamed individually:
Alternatively, common prefixes and suffixes can be added or removed in bulk. This is particularly useful for renaming tables sourced from object storage.
Note that both these renaming mechanisms only apply to tables in the dataset; the corresponding file representations in the dataset are unaffected.
Removing tables
Tables in a dataset can be removed entirely:
Actions downstream of drop-tables
will have no awareness of the extraneous_data
table.
Dropping a table from a dataset also drops the corresponding file representation.
Setting primary keys
Primary keys can be specified on tables.
Combining datasets
The dataset editor can combine multiple datasets into one, allowing a single downstream action to operate on data extracted from disparate sources.
Table names must be unique across all datasets. The rename_all_tables
option (see "Renaming tables" above) can be used to resolve name conflicts.
Note that actions only accept a single input action (input: s3-read
in the example above). To use outputs from multiple actions in a single config, the other actions must be transitive dependencies via the defined input action. In this particular example, the s3-read
action would need to include input: mysql-extract
; the S3 action does not use any outputs from MySQL, but defining it as a dependency ensures outputs from both actions are accessible to the dataset editor action.
Putting it all together
All dataset modifications above can be performed in a single action. The order of operations is:
datasets.rename_all_tables
(and merge if there are multiple datasets)drop_tables
rename_tables
set_primary_keys
remove_table_relationships
add_table_relationships
The example below lists these in order for convenience; the actual order of these keys in your yaml config does not matter.
Last updated