Gretel supports datasets in the following formats:
CSV (Comma Separated Values)
CSV data input is supported for Synthetics, Transforms and Classify workflows.
The first row of the CSV file will be treated as column names, and these are required for processing.
The files may be formatted as a single JSON doc, or as JSONLines (where each line is a separate JSON doc).
Processing JSONL files is much more efficient for larger datasets, therefore we recommend it over regular JSON.
The JSON documents may be flat or contain nested objects. While there is no limit to the number of levels in a nested document, the more complex the structure, the longer the data will take to process.
JSON datasets are currently supported for Classify and Transform only. Support for Synthetics is coming soon.
When using the console, we recommend uploading files no larger than 500MB. We don't impose any limits on training data size, but larger uploads could be hampered by connectivity issues or timeouts.
Results are automatically output in the same format as the input dataset.
For JSON datasets in Classify, there will be an additional field for each detected entity: json_path. This field contains the JSONPath location of that detected entity within the JSON document. See below for a sample classify result on a JSON dataset.
For Transforms, the output will be written in the same format as the input, however whitespaces and order of fields from the input will not be preserved.
Field Names for JSON Data
In CSV files, field names correspond to the column name. JSON data doesn't have columns, but we still want to be able to reference fields for reporting purposes. Therefore, field names are created by referencing the dot-delimited path from the root of the document to each scalar value (the leaf). In the example below, the field that contains the value [email protected] will be referenced as: user.emails.address.
Note that in the example above, the array index is omitted. Thus the values inside the array will be aggregated together since typically all elements inside an array have the same schema. This method of field naming works well for JSON datasets that have a uniform schema across all the records. The naming convention could vary in the case of optional fields, etc.
JSON support is currently available only for Classify and Transform workflows. Synthetics support is coming soon.
If you would like us to import a different format, let us know.