LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page

Was this helpful?

Export as PDF
  1. Create Synthetic Data
  2. Gretel Data Designer
  3. Building your Dataset

Generating Data

Data Generation in Data Designer

Bringing Your Data Design to Life

Once you've set up your Data Designer with appropriate seeds and column definitions, you're ready for the exciting part: generating data! This guide explains how to preview your design, create full datasets, and access your generated data.

The Data Generation Process

Data Designer follows this straightforward workflow when generating data:

  1. Design Phase: Define your data schema by adding columns and establishing their relationships

  2. Preview Phase: Generate a small sample for validation

  3. Iteration Phase: Refine your design based on preview results

  4. Batch Generation: Scale up to create large datasets

You must create at least one non-LLM generated column before you can create an LLM generated column. This is to ensure best practices of Synthetic Data Generation where you must provide seeds to your LLMg eneration to ensure diversity and high quality data.

Understanding the Generation Workflow

1. Design Phase

During this first phase, you define what data you want to generate by adding columns, setting up relationships, and establishing constraints.

Key activities:

  • Adding columns of various types (sampling-based, LLM-based)

  • Setting up person samplers

  • Defining constraints between columns

  • Creating templates that reference other columns

Data Designer automatically analyzes your column definitions to determine the correct generation order based on how columns reference each other.

2. Preview Phase

The preview phase generates a small dataset (typically 10 records) to help you validate your design:

# Validate that you have the right config
aidd.validate()

# Generate a preview
preview = aidd.preview()

This quick process lets you see your design in action without waiting for a full dataset generation.

Inspecting Preview Results

Data Designer provides several ways to examine your preview results:

# Method 1: Display a sample record with formatted output
preview.display_sample_record()

# Method 2: Access the preview dataset as a pandas DataFrame
preview_df = preview.dataset.df

These inspection methods help you assess whether your design is producing the expected data. You'll often go through multiple design-preview-iterate cycles before you're ready to generate a full dataset.

3. Iteration Phase

Based on preview results, you can refine your design by modifying columns, adjusting parameters, or changing templates:

# Modify a column definition
aidd.delete_column("product_description")
aidd.add_column(
    name="product_description",
    prompt="Write a more detailed description for {{product_name}}."
)

# Preview again
preview = aidd.preview()

This iterative cycle helps you optimize your design before generating a full dataset.

4. Batch Generation

Once your design meets your requirements, you can scale up to create a full dataset:

# Generate the full dataset
workflow_run = aidd.create(
    num_records=1000,
    name="my_dataset"
)

Parameters for Batch Generation

  • num_records: The number of records to generate

  • workflow_run_name: A descriptive name for your job (helps with identification later)

  • wait_for_completion:

    • True: The function will block until the job completes

    • False: The function will return immediately, and you can check status later

Checking Job Status

If you didn't wait for completion, you can check the status later:

# Check if the job is completed
if workflow_run.is_completed:
    print("Job completed successfully!")
else:
    print(f"Job status: {workflow_run.status}")
    # Wait for completion if desired
    workflow_run.wait_until_done()

After successful generation, you can access your data as follows:

# Access the generated dataset as a pandas DataFrame
generated_df = workflow_run.dataset.df

If you didn't wait for completion or need to reconnect to a previous job:

from gretel_client.navigator_client import Gretel

# Initialize Gretel client
gretel = Gretel(api_key="YOUR_API_KEY")

# Get the workflow run by name
workflow_run = gretel.get_workflow_run("your-project-id", "product_catalog")

# Wait for completion if it's still running
if not workflow_run.is_completed:
    workflow_run.wait_until_done()

# Access the dataset
generated_df = workflow_run.dataset.df

Saving your Data Designer Object

You can save your Data Designer object as a configuration by running the following code:

config = aidd.config.to_dict()

You can create a new Data Designer object form an existing config as follows:

gretel.data_designer.from_config(config=config)

Best Practices for Data Generation

  1. Always Preview First: Validate your design with a preview before generating a full dataset.

  2. Start Small: Begin with a small number of records to test your design before scaling up.

  3. Name Jobs Clearly: Use descriptive workflow run names to help identify your jobs later.

  4. Monitor Performance: For large datasets, monitor the job status and resources.

  5. Process in Batches: For very large datasets, consider generating and processing in smaller batches.

Note: If you're looking for a more automated approach to creating data designs with less configuration, check out the Magic SDK documentation.

PreviousSeeding your DatasetNextGenerate Realistic Personal Details

Last updated 29 days ago

Was this helpful?