Object Storage

Connect Gretel to object storage based services.

Gretel Workflows support connecting to the following object storage services

Reading Objects

Object storage source actions will incrementally crawl buckets searching for files that have changed between runs. Crawled files can then be configured as inputs to Gretel Models.

Each crawled object is passed as an input to the configured Gretel Model. Some models may need a certain amount of records, while other models might not work for datasets that are too large.

For best results, ensure objects contain the appropriate amount of records to successfully train and run the downstream model.

Glob Filter and Path Configurations

A glob filter can be configured to ensure files matching a specific pattern are used as sources. Files not matching the pattern will be excluded from the crawl.

  • A glob filter is evaluated against the filename or key of the object.

  • The character * is used to matches any number of characters, excluding slashes.

  • Passing ** recursively matches any number of nested directories.

  • Checks are case-sensitive

Examples

FilterFileMatch

*.txt

data.txt

Yes, any txt file in the current path will be matched.

*.png

data.json

No, json files do not container a png ending.

my/path/*.txt

my/path/data.txt

Yes, any txt files under my/path are matched

**/*.csv

my/path/data.csv

Yes, any csv file is recursively matched.

**

data.csv

Yes, all files are recursively matched.

*/**

data.csv

No, any files in the root directory are excluded.

In addition to a glob filter, a source action can be configured to crawl in a specific path. Configuring a path will narrow the set of objects that the bucket crawler will list or search.

It's recommended to configure a narrow bucket path when possible. This reduces the amount of objects the crawler must list, and speeds up each crawl.

Writing Objects

Object storage destination actions can be configured to write the synthetic data outputs of a Gretel Model back to object storage.

Each object storage destination action can be configured to mirror the directory structure of the source bucket or can be configured to create new directory layouts.

Limitations

For a list of supported file types, please refer to Inputs and Outputs.

Last updated