Inputs and Outputs

Supported input and output formats

Gretel supports a number of input and output data formats which are outlined on this page. Gretel also provides a way for you to connect directly to your source and destination data sources using Gretel Connectors.

Input Formats

Gretel supports input datasets in the following formats:

CSV (Comma Separated Values)
- The first row of the CSV file will be treated as column names, and these are required for processing.
JSON (JavaScript Object Notation)
- The files may be formatted as a single JSON doc, or as JSONLines (where each line is a separate JSON doc).
- Processing JSONL files is much more efficient for larger datasets, therefore we recommend it over regular JSON.
- The JSON documents may be flat or contain nested objects. While there is no limit to the number of levels in a nested document, the more complex the structure, the longer the data will take to process.
Apache Parquet
- The following compression algorithms for column data are supported: snappy, gzip, brotli, zstd.

When using the console, we recommend uploading files no larger than 500MB. We don't impose any limits on training data size, but larger uploads could be hampered by connectivity issues or timeouts.

Output Formats

Results are automatically output in the same format as the input dataset.

JSON Outputs

The output will be written in the same format as the input, however whitespaces and order of fields from the input will not be preserved.

Field Names for JSON Data

In CSV files, field names correspond to the column name. JSON data doesn't have columns, but we still want to be able to reference fields for reporting purposes. Therefore, field names are created by referencing the dot-delimited path from the root of the document to each scalar value (the leaf). In the example below, the field that contains the value [email protected] will be referenced as: user.emails.address.

{
  "user": {
    "emails": [
      {"address": "[email protected]"}
    ]
  }
}

Note that in the example above, the array index is omitted. Thus the values inside the array will be aggregated together since typically all elements inside an array have the same schema. This method of field naming works well for JSON datasets that have a uniform schema across all the records. The naming convention could vary in the case of optional fields, etc.

Parquet Outputs

The output will use the same schema and Parquet version as the input file.

Field Names for Parquet Data

Field names that appear in reports when processing Parquet files correspond to column names in the Parquet schema. For columns that contain nested data, field names are constructed in the same way as for JSON data (see above).

PreviousProjects NextGretel Connectors

Last updated 2 months ago

Was this helpful?