Gretel supports datasets in the following formats:
CSV (Comma Separated Values)
CSV data input is supported for Synthetics, Transform and Classify workflows.
The first row of the CSV file will be treated as column names, and these are required for processing.
The files may be formatted as a single JSON doc, or as JSONLines (where each line is a separate JSON doc).
Processing JSONL files is much more efficient for larger datasets, therefore we recommend it over regular JSON.
The JSON documents may be flat or contain nested objects. While there is no limit to the number of levels in a nested document, the more complex the structure, the longer the data will take to process.
JSON datasets are currently supported for Classify and Transform only. Support for Synthetics is coming soon.
The following compression algorithms for column data are supported: snappy, gzip, brotli, zstd.
Parquet datasets are currently supported for Classify and Transform only. Support for Synthetics is coming soon.
Uploading Parquet datasets as project artifacts is currently only supported in the Gretel CLI and SDK. The ability to upload these in the Gretel Console is coming soon.
When using the console, we recommend uploading files no larger than 500MB. We don't impose any limits on training data size, but larger uploads could be hampered by connectivity issues or timeouts.
Results are automatically output in the same format as the input dataset.
For JSON datasets in Classify, there will be an additional field for each detected entity: json_path. This field contains the JSONPath location of that detected entity within the JSON document. See below for a sample classify result on a JSON dataset.
For Transform, the output will be written in the same format as the input, however whitespaces and order of fields from the input will not be preserved.
Field Names for JSON Data
In CSV files, field names correspond to the column name. JSON data doesn't have columns, but we still want to be able to reference fields for reporting purposes. Therefore, field names are created by referencing the dot-delimited path from the root of the document to each scalar value (the leaf). In the example below, the field that contains the value [email protected] will be referenced as: user.emails.address.
Note that in the example above, the array index is omitted. Thus the values inside the array will be aggregated together since typically all elements inside an array have the same schema. This method of field naming works well for JSON datasets that have a uniform schema across all the records. The naming convention could vary in the case of optional fields, etc.
JSON support is currently available only for Classify and Transform workflows. Synthetics support is coming soon.
For Classify, the result structure for Parquet datasets will be the same as that of JSON datasets. Since Parquet data can be nested in a similar way as JSON data, each detected entity will contain a json_path field.
For Transform, the output will use the same schema and Parquet version as the input file.
Field Names for Parquet Data
Field names that appear in Classify and Transform reports when processing Parquet files correspond to column names in the Parquet schema. For columns that contain nested data, field names are constructed in the same way as for JSON data (see above).
If you would like us to import a different format, let us know.