Gretel Tabular

The gretel_tabular action can be used to transform multiple tables while preserving referential integrity between those tables. gretel_tabular also allows specifying different model configs for different tables, and even instructing Gretel to find optimal model configs for your data via Gretel Tuner.

Inputs

project_id

The project to create the model in.

train

(Training details, see following fields)

train.dataset

train.model_config

A yaml object that accepts a few different shapes (detailed below): 1) a complete Gretel model config; 2) a reference to a blueprint or config location (from); 3) an autotune configuration.

train.skip_tables

(List of tables to pass through unaltered to outputs, see following fields)

train.skip_tables.table

The name of a table to skip, i.e. omit from model training and pass through unaltered.

train.table_specific_configs

(List of table-specific training details, see following fields)

train.table_specific_configs.tables

A list of table names to which the other fields in this object apply.

train.table_specific_configs.model_config

An alternative to the global default train.model_config value defined above.

run

(Run details, see following fields)

run.encode_keys

(Transform models only.) Whether to transform primary and foreign key columns. Defaults to false.

Outputs

dataset

Example Configs

Transform a dataset by applying a consistent model to all tables in the dataset. Note that the model config can be specified as a full object...

type: gretel_tabular
name: model-train-run
input: mysql-read
config:
  project_id: proj_1
  train:
    dataset: "{outputs.mysql-read.dataset}"
    model_config:
      schema_version: "1.0"
      name: "xform"
      models:
        - transform_v2:
            data_source: __tmp__
            globals:
              classify:
                enable: true
                num_samples: 3
            steps:
              - rows:
                  update:
                    - condition: column.entity is not none
                      value: column.entity | fake

...or a reference to a blueprint template can be provided via from:

type: gretel_tabular
name: model-train-run
input: mysql-read
config:
  project_id: proj_1
  train:
    dataset: "{outputs.mysql-read.dataset}"
    model_config:
      from: "transform/default"

You can apply different model configs to different tables by supplying table-specific configs:

type: gretel_tabular
name: model-train-run
input: mysql-read
config:
  project_id: proj_1
  train:
    dataset: "{outputs.mysql-read.dataset}"
    model_config:
      from: "transform/default"
    table_specific_configs:
      - tables: ["users"]
        model_config:
          schema_version: "1.0"
          name: "users-xform"
          models:
            - transform_v2:
                data_source: __tmp__
                globals:
                  classify:
                    enable: false
                steps:
                  - rows:
                      update:
                        - name: "ssn"
                          value: this | hash | truncate(10)

To pass a subset of tables through unaltered by the model (e.g. for static reference data), specify tables to skip:

type: gretel_tabular
name: model-train-run
input: mysql-read
config:
  project_id: proj_1
  train:
    dataset: "{outputs.mysql-read.dataset}"
    model_config:
      from: "transform/default"
    skip_tables:
      - table: countries
      - table: states

Autotune (Gretel Tuner)

Instead of providing a specific model config, you can instruct the gretel_tabular action to run trials to identify the best model config for each table. This is accomplished via the autotune option inside model_config fields (at either the root train level to apply to all tables, or inside a table_specific_config to apply to only a subset of tables).

Autotune objects accept the following fields:

enabled

This boolean field must be explicitly set to true to enable config tuning.

trials_per_table

Optionally specify the number of trials to run for each table. Defaults to 4.

metric

The metric to optimize for. Defaults to synthetic_data_quality_score; also accepts field_correlation_stability, field_distribution_stability, principal_component_stability.

tuner_config

The specific Gretel Tuner config to use. Like model_config, this accepts either full configuration objects, or references to blueprints via from.

Example configs with Autotune

Using all autotune defaults:

name: synthesize
type: gretel_tabular
input: extract
config:
  project_id: proj_1
  train:
    dataset: "{outputs.extract.dataset}"
    model_config:
      autotune:
        enabled: true

By default, gretel_tabular uses the tuner/tabular-actgan blueprint Tuner config, but a different blueprint can be referenced...

name: synthesize
type: gretel_tabular
input: extract
config:
  project_id: proj_1
  train:
    dataset: "{outputs.extract.dataset}"
    model_config:
      autotune:
        enabled: true
        tuner_config:
          from: "synthetics/tabular-lstm"

...or a Tuner config can be spelled out explicitly:

name: synthesize
type: gretel_tabular
input: extract
config:
  project_id: proj_1
  train:
    dataset: "{outputs.extract.dataset}"
    model_config:
      autotune:
        enabled: true
        tuner_config:
          base_config: synthetics/tabular-actgan
          params:
            batch_size:
              fixed: 500
            epochs:
              choices: [100, 500]
            generator_lr:
              log_range: [0.00001, 0.001]
            discriminator_lr:
              log_range: [0.00001, 0.001]
            embedding_dim:
              choices: [64, 128, 256]
            generator_dim:
              choices:
                - [512, 512, 512, 512]
                - [1024, 1024]
                - [1024, 1024, 1024]
                - [2048, 2048]
                - [2048, 2048, 2048]
            discriminator_dim:
              choices:
                - [512, 512, 512, 512]
                - [1024, 1024]
                - [1024, 1024, 1024]
                - [2048, 2048]
                - [2048, 2048, 2048]

Last updated