Gretel Tabular

Like gretel_model, the gretel_tabular action can be used to train and generate records from Gretel Models. gretel_tabular's primary value add is the maintence of referential integrity between related tables. This action is therefore recommended for workflows involving relational databases or data warehouses. gretel_tabular also allows specifying different model configs for different tables, and even instructing Gretel to find optimal model configs for your data via Gretel Tuner.

Inputs

project_id

The project to create the model in.

train

(Training details, see following fields)

train.dataset

Data to use for training, including relationships between tables (if applicable). This should be a reference to a dataset output from a previous action.

train.model

(Deprecated, prefer train.model_config) A reference to a blueprint or config location. If a config location is used, it must be addressable by the workflow action. This field is mutually exclusive to train.model_config.

train.model_config

A yaml object that accepts a few different shapes (detailed below): 1) a complete Gretel model config; 2) a reference to a blueprint or config location (from); 3) an autotune configuration.

train.skip_tables

(List of tables to pass through unaltered to outputs, see following fields)

train.skip_tables.table

The name of a table to skip, i.e. omit from model training and pass through unaltered.

train.table_specific_configs

(List of table-specific training details, see following fields)

train.table_specific_configs.tables

A list of table names to which the other fields in this object apply.

train.table_specific_configs.model_config

An alternative to the global default train.model_config value defined above.

run

(Run details, see following fields)

run.encode_keys

(Transform models only.) Whether to transform primary and foreign key columns. Defaults to false.

run.num_records_multiplier

(Synthetics models only.) Parameter for scaling output table size. Defaults to 1.0.

Outputs

dataset

A dataset object containing the outputs from the models created by this action.

Example Configs

Generate a synthetic database by applying a consistent synthetics model to all tables in the dataset. Note that the model config can be specified as a full object...

type: gretel_tabular
name: model-train-run
input: mysql-read
config:
  project_id: proj_1
  train:
    dataset: "{outputs.mysql-read.dataset}"
    model_config:
      schema_version: "1.0"
      name: "tabular-actgan"
      models:
        - synthetics:
            data_source: __tmp__
            params:
              epochs: auto
              vocab_size: auto
              learning_rate: auto
              batch_size: auto
              rnn_units: auto
            privacy_filters:
              outliers: auto
              similarity: auto
  run:
    num_records_multiplier: 1.0

...or a reference to a blueprint template can be provided via from:

type: gretel_tabular
name: model-train-run
input: mysql-read
config:
  project_id: proj_1
  train:
    dataset: "{outputs.mysql-read.dataset}"
    model_config:
      from: "synthetics/tabular-actgan"
  run:
    num_records_multiplier: 1.0

You can apply different model configs to different tables by supplying table-specific configs:

type: gretel_tabular
name: model-train-run
input: mysql-read
config:
  project_id: proj_1
  train:
    dataset: "{outputs.mysql-read.dataset}"
    model_config:
      from: "synthetics/tabular-actgan"
    table_specific_configs:
      - tables: ["users"]
        model_config:
          from: "synthetics/tabular-differential-privacy"
  run:
    num_records_multiplier: 1.0

To pass a subset of tables through unaltered by the model (e.g. for static reference data), specify tables to skip:

type: gretel_tabular
name: model-train-run
input: mysql-read
config:
  project_id: proj_1
  train:
    dataset: "{outputs.mysql-read.dataset}"
    model_config:
      from: "synthetics/tabular-actgan"
    skip_tables:
      - table: countries
      - table: states
  run:
    num_records_multiplier: 1.0

Autotune (Gretel Tuner)

Instead of providing a specific model config, you can instruct the gretel_tabular action to run trials to identify the best model config for each table. This is accomplished via the autotune option inside model_config fields (at either the root train level to apply to all tables, or inside a table_specific_config to apply to only a subset of tables).

Autotune objects accept the following fields:

enabled

This boolean field must be explicitly set to true to enable config tuning.

trials_per_table

Optionally specify the number of trials to run for each table. Defaults to 4.

metric

The metric to optimize for. Defaults to synthetic_data_quality_score; also accepts field_correlation_stability, field_distribution_stability, principal_component_stability.

tuner_config

The specific Gretel Tuner config to use. Like model_config, this accepts either full configuration objects, or references to blueprints via from.

Example configs with Autotune

Using all autotune defaults:

name: synthesize
type: gretel_tabular
input: extract
config:
  project_id: proj_1
  train:
    dataset: "{outputs.extract.dataset}"
    model_config:
      autotune:
        enabled: true

By default, gretel_tabular uses the tuner/tabular-actgan blueprint Tuner config, but a different blueprint can be referenced...

name: synthesize
type: gretel_tabular
input: extract
config:
  project_id: proj_1
  train:
    dataset: "{outputs.extract.dataset}"
    model_config:
      autotune:
        enabled: true
        tuner_config:
          from: "synthetics/tabular-lstm"

...or a Tuner config can be spelled out explicitly:

name: synthesize
type: gretel_tabular
input: extract
config:
  project_id: proj_1
  train:
    dataset: "{outputs.extract.dataset}"
    model_config:
      autotune:
        enabled: true
        tuner_config:
          base_config: synthetics/tabular-actgan
          params:
            batch_size:
              fixed: 500
            epochs:
              choices: [100, 500]
            generator_lr:
              log_range: [0.00001, 0.001]
            discriminator_lr:
              log_range: [0.00001, 0.001]
            embedding_dim:
              choices: [64, 128, 256]
            generator_dim:
              choices:
                - [512, 512, 512, 512]
                - [1024, 1024]
                - [1024, 1024, 1024]
                - [2048, 2048]
                - [2048, 2048, 2048]
            discriminator_dim:
              choices:
                - [512, 512, 512, 512]
                - [1024, 1024]
                - [1024, 1024, 1024]
                - [2048, 2048]
                - [2048, 2048, 2048]

Last updated