BigQuery

Read from and write to BigQuery.

Getting Started

Prerequisites to create a BigQuery based workflow. You will need

  1. A source BigQuery connection.

  2. (optional) A list of tables OR SQL queries.

  3. (optional) A destination BigQuery connection.

Do not use your input data warehouse connection as an output connector. This action can result in the unintended overwriting of existing data.

Create a BigQuery Connection

Google BigQuery related actions require creating a bigquery connection. The connection must be configured with the correct permissions for each Gretel Workflow Action.

For specific permissions, please refer to the Minimum Permissions section under each corresponding action.

Gretel bigquery connections require the following fields:

Connection Creation Parameters

ParameterDescriptionExample

name

Display name of your choosing used to identify your connection within Gretel.

my-bigquery-connection

connection_target_type

source or destination depending on whether you want to read data from or write data to the connection.

source

project_id

ID of the Google project containing your dataset.

my-project-id

service_account_email

The service account email associated with your private key.

service-account-name@my-project-id.iam.gserviceaccount.com

dataset

Name of the dataset to connect to.

my-dataset-name

private_json_key

Private key JSON blob used to authenticate Gretel.

{ "type": "service_account", "project_id": "my-project-id", "private_key_id": "Oabc1def2345678g90123h456789012h34561718", "private_key": "-----BEGIN PRIVATE KEY-----/ ... }

Create a Service Account

In order to generate a private key you will first need to create a service account, and then download the key for that service account.

Configure Bucket IAM Permissions

After the service account has been created, you can attach dataset specific permissions to the service account.

Please see each action's Minimum Permissions section for a list of permissions to attach to the service account.

Creating Connections

First, create a file on your local computer containing the connection credentials. This file should also include type, name , config, and credentials. The config and credentials fields should contain fields that are specific to the connection being created.

Below is an example BigQuery connection credential file:

{
    "type": "bigquery",
    "name": "my-bigquery-connection",
    "connection_target_type": "source",
    "config": {
        "project_id": "my-project-id",
        "service_account_email": "service-account-name@my-project-id.iam.gserviceaccount.com",
        "dataset": "my-dataset"
    },
    "credentials": {
        "private_key_json": "..."
    }
}

Now that you've created the credentials file, use the CLI to create the connection

gretel connections create --project [project id] --from-file [credential_file.json]

BigQuery Source Action

Type

bigquery_source

Connection

bigquery

The bigquery_source action reads data from your BigQuery dataset. It can be used to extract:

  • the entire dataset, OR

  • selected tables from the dataset, OR

  • the results of SQL query/queries against the dataset.

Each time the workflow is run the source action will extract the most recent data from the source database.

When combined in a workflow, the data extracted from the bigquery_source action is used to train models and generate synthetic data with the gretel_tabular action, and can be written to an output database with the bigquery_destination action. Your generated data can also be written to object storage connections, for more information see Writing to Object Storage.

Inputs

The bigquery_source action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.

Entire Dataset

sync.mode

full - extracts all records from tables in dataset

(coming soon) subset - extract percentage of records from tables in dataset

Example Source Action YAML

actions:
  - name: bigquery-read
    type: bigquery_source
    connection: conn_1
    config:
      sync:
          mode: full

Outputs

Whether you are extracting an entire dataset, selected tables, or querying against a dataset, the bigquery_source action always provides a single output, dataset.

dataset

A reference to the data extracted from the database, including tables and (if defined) relationships/schema.

The output of a bigquery_source action can be used as the input to a gretel_tabular action in order to transform and/or synthesize a dataset.

Minimum Permissions

The associated service account must have the following permissions for the configured dataset:

  • bigquery.datasets.get

BigQuery Destination Action

Type

bigquery_destination

Connection

bigquery

The bigquery_destination action can be used to write gretel_tabular action outputs to Snowflake destination databases.

Inputs

Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the bigquery_destination action always takes the same input, dataset.

dataset

A reference to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.

sync.mode

replace - overwrites any existing data in table(s) at destination

append - add generated data to existing table(s); only supported for query-created tables without primary keys

Example Destination Action YAML

actions:
...
  - name: bigquery-write
    type: bigquery_destination
    connection: conn_2
    input: synthesize
    config:
      sync:
        mode: replace
      dataset: {synthesize.outputs.dataset}

Sync Modes

The BigQuery destination action uses a load job to write records into destination tables.

There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode field on a destination config.

sync.mode may be one of truncate, replace, or append. Each sync mode will configure a write and create disposition that determines how rows are inserted, and how destination tables are created.

When sync.mode is configured with truncate

  • Records are written with WRITE_TRUNCATE

  • The destination table must already exist in the destination dataset.

When sync mode is configured with replace

  • Records are written with WRITE_TRUNCATE

  • The destination table is created if necessary with CREATE_IF_NEEDED

When sync.mode is configured with append

  • Records are appended with WRITE_APPEND

  • The destination table is created if necessary with CREATE_IF_NEEDED

For more information on how job dispositions behave, please reference writeDisposition and createDisposition from JobConfigurationQuery docs.

Minimum Permissions

The associated service account must have the following permissions for the configured dataset:

  • bigquery.datasets.create

  • bigquery.datasets.delete (supports replacing an existing file in the dataset)

Writing to Object Storage

You can also write your output dataset to an object storage connection like Google Cloud Storage. Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the {object_storage}_destination action always takes the same inputs - filename and input, and path. Additionally, S3 and GCS take bucket while Azure Blob takes container.

Example Destination Action YAML

actions:
...
  - name: gcs-write
    type: gcs_destination
    connection: conn_2
    input: <action-name>
    config:
     bucket: my-gretel-bucket
     filename: {outputs.<action-name>.dataset.files.filename}
     input: {outputs.<action-name>.dataset.files.data}
     path: gretel

Example Workflow Configs

Create a synthetic version of your BigQuery dataset.

The following config will extract the entire BigQuery dataset, train and run a synthetic model, then write the outputs of the model back to a destination BigQuery dataset while maintaining referential integrity.

name: sample-bigquery-workflow-full-dataset

actions:
  - name: bigquery-read
    type: bigquery_source
    connection: conn_1
    config:
      sync:
          mode: full

  - name: synthesize
    type: gretel_tabular
    input: bigquery-read
    config:
      project_id: proj_1
      train:
        model: "synthetics/tabular-actgan"
        dataset: {outputs.bigquery-read.dataset}
      run:
        num_records_multiplier: 1.0

  - name: bigquery-write
    type: bigquery_destination
    connection: conn_2
    input: synthesize
    config:
      sync:
        mode: replace
      dataset: {outputs.synthesize.dataset}

Last updated