BigQuery
Read from and write to BigQuery.
Getting Started
Prerequisites to create a BigQuery based workflow. You will need
A source BigQuery connection.
(optional) A list of tables OR SQL queries.
(optional) A destination BigQuery connection.
Do not use your input data warehouse connection as an output connector. This action can result in the unintended overwriting of existing data.
Create a BigQuery Connection
Google BigQuery related actions require creating a bigquery
connection. The connection must be configured with the correct permissions for each Gretel Workflow Action.
For specific permissions, please refer to the Minimum Permissions section under each corresponding action.
Gretel bigquery
connections require the following fields:
Connection Creation Parameters
name
Display name of your choosing used to identify your connection within Gretel.
my-bigquery-connection
connection_target_type
source
or destination
depending on whether you want to read data from or write data to the connection.
source
project_id
ID of the Google project containing your dataset.
my-project-id
service_account_email
The service account email associated with your private key.
service-account-name@my-project-id.iam.gserviceaccount.com
dataset
Name of the dataset to connect to.
my-dataset-name
private_json_key
Private key JSON blob used to authenticate Gretel.
{ "type": "service_account", "project_id": "my-project-id", "private_key_id": "Oabc1def2345678g90123h456789012h34561718", "private_key": "-----BEGIN PRIVATE KEY-----/ ... }
Create a Service Account
In order to generate a private key you will first need to create a service account, and then download the key for that service account.
Configure Bucket IAM Permissions
After the service account has been created, you can attach dataset specific permissions to the service account.
Please see each action's Minimum Permissions section for a list of permissions to attach to the service account.
Creating Connections
First, create a file on your local computer containing the connection credentials. This file should also include type
, name
, config
, and credentials
. The config
and credentials
fields should contain fields that are specific to the connection being created.
Below is an example BigQuery connection credential file:
Now that you've created the credentials file, use the CLI to create the connection
BigQuery Source Action
Type
bigquery_source
Connection
bigquery
The bigquery_source
action reads data from your BigQuery dataset. It can be used to extract:
the entire dataset, OR
selected tables from the dataset, OR
the results of SQL query/queries against the dataset.
Each time the workflow is run the source action will extract the most recent data from the source database.
When combined in a workflow, the data extracted from the bigquery_source
action is used to train models and generate synthetic data with the gretel_tabular
action, and can be written to an output database with the bigquery_destination
action. Your generated data can also be written to object storage connections, for more information see Writing to Object Storage.
Inputs
The bigquery_source
action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.
Entire Dataset
sync.mode
full
- extracts all records from tables in dataset
(coming soon) subset
- extract percentage of records from tables in dataset
Example Source Action YAML
Outputs
Whether you are extracting an entire dataset, selected tables, or querying against a dataset, the bigquery_source
action always provides a single output, dataset
.
dataset
A reference to the data extracted from the database, including tables and (if defined) relationships/schema.
The output of a bigquery_source
action can be used as the input to a gretel_tabular
action in order to transform and/or synthesize a dataset.
Minimum Permissions
The associated service account must have the following permissions for the configured dataset:
bigquery.datasets.get
BigQuery Destination Action
Type
bigquery_destination
Connection
bigquery
The bigquery_destination
action can be used to write gretel_tabular
action outputs to Snowflake destination databases.
Inputs
Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the bigquery_destination
action always takes the same input, dataset
.
dataset
A reference to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.
sync.mode
replace
- overwrites any existing data in table(s) at destination
append
- add generated data to existing table(s); only supported for query-created tables without primary keys
Example Destination Action YAML
Sync Modes
The BigQuery destination action uses a load job to write records into destination tables.
There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode
field on a destination config.
sync.mode
may be one of truncate
, replace
, or append
. Each sync mode will configure a write and create disposition that determines how rows are inserted, and how destination tables are created.
When sync.mode
is configured with truncate
Records are written with
WRITE_TRUNCATE
The destination table must already exist in the destination dataset.
When sync mode is configured with replace
Records are written with
WRITE_TRUNCATE
The destination table is created if necessary with
CREATE_IF_NEEDED
When sync.mode
is configured with append
Records are appended with
WRITE_APPEND
The destination table is created if necessary with
CREATE_IF_NEEDED
For more information on how job dispositions behave, please reference writeDisposition
and createDisposition
from JobConfigurationQuery docs.
Minimum Permissions
The associated service account must have the following permissions for the configured dataset:
bigquery.datasets.create
bigquery.datasets.delete
(supports replacing an existing file in the dataset)
Writing to Object Storage
You can also write your output dataset to an object storage connection like Google Cloud Storage. Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the {object_storage}_destination
action always takes the same inputs - filename
and input
, and path
. Additionally, S3 and GCS take bucket
while Azure Blob takes container
.
Example Destination Action YAML
Example Workflow Configs
Create a synthetic version of your BigQuery dataset.
The following config will extract the entire BigQuery dataset, train and run a synthetic model, then write the outputs of the model back to a destination BigQuery dataset while maintaining referential integrity.
Last updated