LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page
  • Input Formats
  • Output Formats
  • JSON Outputs
  • Field Names for JSON Data
  • Parquet Outputs
  • Field Names for Parquet Data

Was this helpful?

Export as PDF
  1. Gretel Basics
  2. Getting Started

Inputs and Outputs

Supported input and output formats

PreviousProjectsNextGretel Connectors

Last updated 26 days ago

Was this helpful?

Gretel supports a number of input and output data formats which are outlined on this page. Gretel also provides a way for you to connect directly to your source and destination data sources using .

Input Formats

Gretel supports input datasets in the following formats:

  1. CSV (Comma Separated Values)

    • The first row of the CSV file will be treated as column names, and these are required for processing.

  2. JSON (JavaScript Object Notation)

    • The files may be formatted as a single JSON doc, or as (where each line is a separate JSON doc).

    • Processing JSONL files is much more efficient for larger datasets, therefore we recommend it over regular JSON.

    • The JSON documents may be flat or contain nested objects. While there is no limit to the number of levels in a nested document, the more complex the structure, the longer the data will take to process.

    • The following compression algorithms for column data are supported: snappy, gzip, brotli, zstd.

When using the console, we recommend uploading files no larger than 500MB. We don't impose any limits on training data size, but larger uploads could be hampered by connectivity issues or timeouts.

Output Formats

Results are automatically output in the same format as the input dataset.

JSON Outputs

The output will be written in the same format as the input, however whitespaces and order of fields from the input will not be preserved.

Field Names for JSON Data

In CSV files, field names correspond to the column name. JSON data doesn't have columns, but we still want to be able to reference fields for reporting purposes. Therefore, field names are created by referencing the dot-delimited path from the root of the document to each scalar value (the leaf). In the example below, the field that contains the value test@example.com will be referenced as: user.emails.address.

{
  "user": {
    "emails": [
      {"address": "test@example.com"}
    ]
  }
}

Note that in the example above, the array index is omitted. Thus the values inside the array will be aggregated together since typically all elements inside an array have the same schema. This method of field naming works well for JSON datasets that have a uniform schema across all the records. The naming convention could vary in the case of optional fields, etc.

Parquet Outputs

The output will use the same schema and Parquet version as the input file.

Field Names for Parquet Data

Field names that appear in reports when processing Parquet files correspond to column names in the Parquet schema. For columns that contain nested data, field names are constructed in the same way as for JSON data (see above).

Gretel Connectors
JSONLines
Apache Parquet