Databricks

Read from and write to Databricks.

Getting Started

Prerequisites to create a Databricks based workflow. You will need

  1. A source Databricks connection.

  2. (optional) A list of tables OR SQL queries.

  3. (optional) A destination Databricks connection.

Do not use your input Databricks connection as an output connector. This action can result in the unintended overwriting of existing data.

Create a Connection

Before creating the Databricks connection on Gretel, please ensure that the compute cluster has been started (i.e. Spark Cluster or SQL Warehouse) to ensure that validation doesn't timeout.

A databricks connection is created using the following parameters:

Connection Creation Parameters

name

Display name of your choosing used to identify your connection within Gretel.

my-databricks-connection

server_hostname

Fully qualified domain name (FQDN) used to establish connection to database server.

account_identifier.cloud.databricks.com

http_path

The http path of the cluster.

/sql/1.0/warehouses/foo

personal_access_token

Security credential to authenticate databricks account (36 characters)

dapi....

catalog

Name of catalog to connect to.

MY_CATALOG

schema

Name of schema.

MY_SCHEMA

(optional) params

Optional JDBC URL parameters that can be used for advanced configuration.

role=MY_ROLE

Create a Service Principal & Personal Access Token

In order to generate a personal access token you will first need to create a service principal, and then generate a personal access token for that service account.

Creating Connections

First, create a file on your local computer containing the connection credentials. This file should also include type, name , config, and credentials. The config and credentials fields should contain fields that are specific to the connection being created.

Below is an example Databricks connection:

{
    "type": "databricks",
    "name": "my-databricks-connection",
    "config": {
        "server_hostname": "account_identifier.cloud.databricks.com",
        "http_path": "/sql/1.0/warehouses/foo",
        "catalog": "MY_WAREHOUSE",
        "schema": "MY_SCHEMA",
        "params": "role=MY_ROLE"
    },
    "credentials": {
        "personal_access_token": "dapi..."
    }
}

Now that you've created the credentials file, use the CLI to create the connection

gretel connections create --project [project id] --from-file [credential_file.json]

Permissions

Source Connection Permissions

The Databricks source action requires enough access to read from tables and access schema metadata.

Add the following permissions to the Service Principal that was created above in order to be able to read data.

Destination Connection Permissions

Ensure that the user/server principal is a part of the ownership group for the Destination catalog or schema.

The Databricks destination action requires enough permissions to write to the destination schema.

Add the following permissions to the Service Principal that was created above in order to be able to write data.

Databricks Source Action

Type

databricks_source

Connection

databricks

The databricks_source action reads data from your Databricks database. It can be used to extract:

  • an entire database, OR

  • selected tables from a database, OR

  • the results of SQL query/queries against a database.

Each time the workflow is run the source action will extract the most recent data from the source database.

When combined in a workflow, the data extracted from the databricks_source action is used to train models and generate synthetic data with the gretel_tabular action, and can be written to an output database with the databricks_destination action.

For the source Databricks connection, we recommend using a backup or clone with read-only permissions, instead of connecting directly to your production database.

Inputs

The databricks_source action takes slightly different inputs depending on the type of data you wish to extract. Flip through the tabs below to see the input config parameters and example action YAMLs for each type of extraction.

Entire Database

sync.mode

full - extracts all records from tables in database

(coming soon) subset - extract percentage of records from tables in database

Example Source Action YAML

actions:
  - name: extract-database
    type: databricks_source
    connection: conn_1
    config:
      sync:
          mode: full

Outputs

Whether you are extracting an entire database, selected tables, or querying against a database, the databricks_source action always provides a single output, dataset.

dataset

A reference to the data extracted from the database, including tables and relationships/schema.

The output of a databricks_source action can be used as the input to a gretel_tabular action in order to transform and/or synthesize a database.

Databricks Destination Action

Type

databricks_destination

Connection

databricks

The databricks_destination action can be used to write gretel_tabular action outputs to Databricks destination databases.

Inputs

Whether you are writing an entire database, selected tables, or table(s) created via SQL query, the databricks_destination action always takes the same input, dataset.

dataset

A reference to the table(s) generated by Gretel and (if applicable) the relationship schema extracted from the source database.

sync.mode

replace - overwrites any existing data in table(s) at destination

append - add generated data to existing table(s); only supported for query-created tables without primary keys

volume

Unity Catalog volume where the destination data will be staged temporarily before writing to tables. A volume name must be specified in the destination action YAML in order for the write to succeed..

Example Destination Action YAML

actions:
...
  - name: databricks-write
    type: databricks_destination
    connection: conn_2
    input: synthesize
    config:
      sync:
        mode: replace
      dataset: {outputs.synthesize.dataset}
      volume: "MY_VOLUME_NAME"

Sync Modes

There are multiple strategies for writing records into the destination database. These strategies are configured from the sync.mode field on a destination config.

sync.mode may be one of truncate, replace, or append.

Sync Mode: Truncate

When sync.mode is configured with truncate, records are first truncated from the destination table using the TRUNCATE TABLE DML command.

When sync mode is configured with truncate the destination table must already exist in the database.

Sync Mode: Replace

When sync.mode is configured with replace, the destination table is first dropped and then recreated using the schema from the source table.

If the source table is from Databricks, the DDL is extracted using the GET_DDL metadata function. If the source table is from a non Databricks source, the destination table schema is inferred based on the column types of the database.

When sync mode is configured with replace the destination table does not need to exist in the destination.

To respect foreign key constraints and referential integrity, tables without foreign keys are inserted first, and tables with foreign key references are inserted last.

When applying table DML for truncate or replace, operations are applied in reverse insertion order. This is to ensure records aren't deleted with incoming foreign key references.

It's also important to note: all table data is first dropped from the database before inserting new records back in. These operations are not atomic, so there may be periods of time when the destination database is in an incomplete state.

Sync Mode: Append

When sync.mode is configured with append, the destination action will simply insert records into the table, leaving any existing records in place.

When using the append sync mode, referential integrity is difficult to maintain. It's only recommended to use append mode when syncing adhoc queries to a destination table.

If append mode is configured with a source that syncs an entire database, it's likely the destination will be unable to insert records while maintaining foreign key constraints or referential integrity.

Example Workflow Configs

Create a synthetic version of your Databricks database.

The following config will extract the entire Databricks database, train and run a synthetic model, then write the outputs of the model back to a destination Databricks database while maintaining referential integrity.

name: sample-databricks-workflow-full-db

actions:
  - name: databricks-read
    type: databricks_source
    connection: conn_1
    config:
      sync:
          mode: full

  - name: synthesize
    type: gretel_tabular
    input: databricks-read
    config:
      project_id: proj_1
      train:
        model: "synthetics/tabular-actgan"
        dataset: {outputs.databricks-read.dataset}
      run:
        num_records_multiplier: 1.0

  - name: databricks-write
    type: databricks_destination
    connection: conn_2
    input: synthesize
    config:
      sync:
        mode: replace
      dataset: {outputs.synthesize.dataset}
      volume: "MY_VOLUME_NAME"

Demo Notebook

Check out this sample notebook which demonstrates an end to end flow of running a Workflow using the Databricks Connector. Optionally, to run this notebook on Databricks, you can import the notebook directly into Databricks by providing the URL to the notebook.

Last updated