Editing Datasets

Datasets and their attributes can only be referenced "as-is" in action configs, for example {outputs.extract.dataset}. However, there are cases where a dataset output by one action needs to be edited before it can be used by a downstream action. Some examples include:

  • Specifying primary and foreign keys on data sourced from object stores (where such metadata does not exist as it does in a relational database)

  • Removing a foreign key to break a cyclic table relationship

  • Renaming tables

The dataset_editor action provides a way to apply alterations like these and more to datasets. It accepts a dataset from some other action as input, and outputs a modified version of that dataset for downstream actions to consume.

Adding and removing table relationships

A table relationship is used to relate two tables. The most common example is a foreign key constraint in a relational database.

A table_relationship contains the following properties:

table

string

The name of the table containing data (typically an id column) pointing to records on another table, e.g. "the table with the foreign key"

constrained_columns

string list

The columns populated with identifiers from the other table, e.g. "the foreign key column(s)"

referred_table

string

The name of the table containing data records being referenced by table

referred_columns

string list

The columns to which constrained_columns point

For example, a relational database storing users and their sessions might have a foreign key user_id on the sessions table pointing to the users.id column. That key can be represented as a table_relationship:

table: sessions
constrained_columns: [user_id]
referred_table: users
referred_columns: [id]

To add or remove relationships via the dataset_editor action, use the add_table_relationships and remove_table_relationships attributes, both of which accept a table_relationship list.

type: dataset_editor
name: add-relationships
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
  add_table_relationships:
    - table: "sessions"
      constrained_columns: ["user_id"]
      referred_table: "users"
      referred_columns: ["id"]

Renaming tables

There are two ways to rename tables. First, tables can be renamed individually:

type: dataset_editor
name: rename-tables
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
  rename_tables:
    - from: "original_name"
      to: "new_name"
    - from: "source_name"
      to: "preferred_name"

Alternatively, common prefixes and suffixes can be added or removed in bulk. This is particularly useful for renaming tables sourced from object storage.

type: dataset_editor
name: rename-tables
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
      rename_all_tables:
        remove_prefix: "source_"
        remove_suffix: "_csv"
type: dataset_editor
name: rename-tables
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
      rename_all_tables:
        add_prefix: "gretel_"

Note that both these renaming mechanisms only apply to tables in the dataset; the corresponding file representations in the dataset are unaffected.

Removing tables

Tables in a dataset can be removed entirely:

type: dataset_editor
name: drop-tables
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
  drop_tables:
    - table: "extraneous_data"

Actions downstream of drop-tables will have no awareness of the extraneous_data table.

Dropping a table from a dataset also drops the corresponding file representation.

Setting primary keys

Primary keys can be specified on tables.

type: dataset_editor
name: define-pks
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
  set_primary_keys:
    - table: "users"
      primary_key: ["id"]

Combining datasets

The dataset editor can combine multiple datasets into one, allowing a single downstream action to operate on data extracted from disparate sources.

type: dataset_editor
name: combine
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
    - dataset: "{outputs.mysql-extract.dataset}"

Table names must be unique across all datasets. The rename_all_tables option (see "Renaming tables" above) can be used to resolve name conflicts.

Note that actions only accept a single input action (input: s3-read in the example above). To use outputs from multiple actions in a single config, the other actions must be transitive dependencies via the defined input action. In this particular example, the s3-read action would need to include input: mysql-extract; the S3 action does not use any outputs from MySQL, but defining it as a dependency ensures outputs from both actions are accessible to the dataset editor action.

Putting it all together

All dataset modifications above can be performed in a single action. The order of operations is:

  1. datasets.rename_all_tables (and merge if there are multiple datasets)

  2. drop_tables

  3. rename_tables

  4. set_primary_keys

  5. remove_table_relationships

  6. add_table_relationships

The example below lists these in order for convenience; the actual order of these keys in your yaml config does not matter.

type: dataset_editor
name: alter-dataset
input: s3-read
config:
  datasets:
    - dataset: "{outputs.s3-read.dataset}"
      rename_all_tables:
        remove_suffix: "_csv"
  drop_tables:
    - table: "extraneous_data"
  rename_tables:
    - from: "original_name"
      to: "new_name"
  set_primary_keys:
    - table: "users"
      primary_key: ["id"]
  add_table_relationships:
    - table: "sessions"
      constrained_columns: ["user_id"]
      referred_table: "users"
      referred_columns: ["id"]

Last updated