Editing Datasets

Most of the time, datasets and/or their attributes should be referenced "as-is" in action configs. However, there are cases where datasets need to be edited, for example:

  • Specifying primary and foreign keys on data sourced from object stores (where such metadata does not exist as it does in a relational database)

  • Removing a foreign key to break a cyclic relationship between multiple tables

The dataset_editor action provides a way to apply alterations like these to datasets. It accepts a dataset from some other action as input, and outputs a modified version of that dataset for downstream actions to consume.

Adding and removing table relationships

A table relationship is used to relate two tables. The most common example is a foreign key constraint in a relational database.

A table_relationship may contain the following properties

table

string

The name of the table containing data (typically an id column) pointing to records on another table, e.g. "the table with the foreign key"

constrained_columns

string list

The columns populated with identifiers from the other table, e.g. "the foreign key column(s)"

referred_table

string

The name of the table containing data records being referenced by table

referred_columns

string list

The columns to which constrained_columns point

For example, a relational database storing users and their sessions might have a foreign key user_id on the sessions table pointing to the users.id column. That key can be represented as a table_relationship:

table: sessions
constrained_columns: [user_id]
referred_table: users
referred_columns: [id]

To add or remove relationships via the dataset_editor action, use the add_table_relationships and remove_table_relationships attributes, both of which accept a table_relationship list.

type: dataset_editor
name: add-relationships
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
  add_table_relationships:
    - table: "sessions"
      constrained_columns: ["user_id"]
      referred_table: "users"
      referred_columns: ["id"]

Removing or renaming tables

Tables in a dataset can be removed entirely, or renamed.

type: dataset_editor
name: alter-tables
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
  drop_tables:
    - table: "extraneous_data"
  rename_tables:
    - from: "original_name"
      to: "new_name"

Setting primary keys

Primary keys can be specified on tables.

type: dataset_editor
name: define-pks
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
  set_primary_keys:
    - table: "users"
      primary_key: ["id"]

Combining operations

All dataset modifications above can be performed in a single action. The order of operations is:

  1. drop_tables

  2. rename_tables

  3. set_primary_keys

  4. remove_table_relationships

  5. add_table_relationships

type: dataset_editor
name: alter-dataset
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
  drop_tables:
    - table: "extraneous_data"
  rename_tables:
    - from: "original_name"
      to: "new_name"
  set_primary_keys:
    - table: "users"
      primary_key: ["id"]
  add_table_relationships:
    - table: "sessions"
      constrained_columns: ["user_id"]
      referred_table: "users"
      referred_columns: ["id"]

Last updated