Generating Data

Data Generation in Data Designer

Bringing Your Data Design to Life

Once you've set up your Data Designer with appropriate seeds and column definitions, you're ready for the exciting part: generating data! This guide explains how to preview your design, create full datasets, and access your generated data.

The Data Generation Process

Data Designer follows this straightforward workflow when generating data:

  1. Design Phase: Define your data schema by adding columns and establishing their relationships

  2. Preview Phase: Generate a small sample for validation

  3. Iteration Phase: Refine your design based on preview results

  4. Batch Generation: Scale up to create large datasets

You must create at least one non-LLM generated column before you can create an LLM generated column. This is to ensure best practices of Synthetic Data Generation where you must provide seeds to your LLMg eneration to ensure diversity and high quality data.

Understanding the Generation Workflow

1. Design Phase

During this first phase, you define what data you want to generate by adding columns, setting up relationships, and establishing constraints.

Key activities:

  • Adding columns of various types (sampling-based, LLM-based)

  • Setting up person samplers

  • Defining constraints between columns

  • Creating templates that reference other columns

Data Designer automatically analyzes your column definitions to determine the correct generation order based on how columns reference each other.

2. Preview Phase

The preview phase generates a small dataset (typically 10 records) to help you validate your design:

# Validate that you have the right config
aidd.validate()

# Generate a preview
preview = aidd.preview()

This quick process lets you see your design in action without waiting for a full dataset generation.

Inspecting Preview Results

Data Designer provides several ways to examine your preview results:

# Method 1: Display a sample record with formatted output
preview.display_sample_record()

# Method 2: Access the preview dataset as a pandas DataFrame
preview_df = preview.dataset.df

These inspection methods help you assess whether your design is producing the expected data. You'll often go through multiple design-preview-iterate cycles before you're ready to generate a full dataset.

3. Iteration Phase

Based on preview results, you can refine your design by modifying columns, adjusting parameters, or changing templates:

# Modify a column definition
aidd.delete_column("product_description")
aidd.add_column(
    name="product_description",
    prompt="Write a more detailed description for {{product_name}}."
)

# Preview again
preview = aidd.preview()

This iterative cycle helps you optimize your design before generating a full dataset.

4. Batch Generation

Once your design meets your requirements, you can scale up to create a full dataset:

# Generate the full dataset
workflow_run = aidd.create(
    num_records=1000,
    name="my_dataset"
)

Parameters for Batch Generation

  • num_records: The number of records to generate

  • workflow_run_name: A descriptive name for your job (helps with identification later)

  • wait_for_completion:

    • True: The function will block until the job completes

    • False: The function will return immediately, and you can check status later

Checking Job Status

If you didn't wait for completion, you can check the status later:

# Check if the job is completed
if workflow_run.is_completed:
    print("Job completed successfully!")
else:
    print(f"Job status: {workflow_run.status}")
    # Wait for completion if desired
    workflow_run.wait_until_done()

After successful generation, you can access your data as follows:

# Access the generated dataset as a pandas DataFrame
generated_df = workflow_run.dataset.df

If you didn't wait for completion or need to reconnect to a previous job:

from gretel_client.navigator_client import Gretel

# Initialize Gretel client
gretel = Gretel(api_key="YOUR_API_KEY")

# Get the workflow run by name
workflow_run = gretel.get_workflow_run("your-project-id", "product_catalog")

# Wait for completion if it's still running
if not workflow_run.is_completed:
    workflow_run.wait_until_done()

# Access the dataset
generated_df = workflow_run.dataset.df

Saving your Data Designer Object

You can save your Data Designer object as a configuration by running the following code:

config = aidd.config.to_dict()

You can create a new Data Designer object form an existing config as follows:

gretel.data_designer.from_config(config=config)

Best Practices for Data Generation

  1. Always Preview First: Validate your design with a preview before generating a full dataset.

  2. Start Small: Begin with a small number of records to test your design before scaling up.

  3. Name Jobs Clearly: Use descriptive workflow run names to help identify your jobs later.

  4. Monitor Performance: For large datasets, monitor the job status and resources.

  5. Process in Batches: For very large datasets, consider generating and processing in smaller batches.

Note: If you're looking for a more automated approach to creating data designs with less configuration, check out the Magic SDK documentation.

Last updated

Was this helpful?