Data Designer is currently in early preview, you can sign up for access here!
Data Designer is a general purpose system for building datasets to improve your AI models. Developers can describe the attributes of the dataset they want and iterate on the generated data through fast previews and detailed evaluations.
With Data Designer, you get:
Speed: Generate preview datasets in minutes, production datasets in hours
Quality: Built-in evaluation metrics ensure accuracy and relevance
Here we will walk through a simple example using a configuration to defining your data generation workflow.
Note: This is a simple configuration that may not yield high quality data. This config is used for illustration purposes only.
model_suite: apache-2.0
special_system_instructions: >-
You are an expert at generating consistent event log entries. Your job is to create realistic event data.
categorical_seed_columns:
- name: event_type
values: [Login, Logout, PageView]
- name: user_type
values: [Anonymous, Registered]
- name: device_type
values: [Mobile, Desktop, Tablet]
- name: action_status
values: [Success, Failure]
generated_data_columns:
- name: timestamp
generation_prompt: >-
Generate a realistic timestamp for an {event_type} event within the last 24 hours.
Format: YYYY-MM-DD HH:MM:SS
- name: event_details
generation_prompt: >-
Create event details for a {event_type} by a {user_type} user on {device_type} with status {action_status}.
Include basic information like device ID and session duration if applicable.
columns_to_list_in_prompt: all_categorical_seed_columns
Model Suites: To learn more about model suites, check out this page!
Load your Config
Once you define the configuration in YAML, you can use the Gretel SDK to load the configuration and then generate data.
If you prefer not to use YAML, you can use the Gretel SDK to define your Data Designer workflow, here is a simple example.
Step 1: Define your Model Suite and System Prompt
from gretel_client.navigator import DataDesigner
session_kwargs = {
"api_key": "<YOUR_API_KEY>",
"endpoint": "https://api.gretel.cloud",
}
model_suite = 'apache-2.0'
special_system_instructions = """
You are an expert conversation designer and domain specialist. Your job is to
produce realistic user-assistant dialogues for fine-tuning a model. Always ensure:
- Responses are factually correct and contextually appropriate.
- Communication is clear, helpful, and matches the complexity level.
- Avoid disallowed content and toxicity.
- After the two-turn conversation, provide a single toxicity assessment for the user's messages in the entire conversation.
"""
data_designer.add_generated_data_column(
name="timestamp",
generation_prompt=(
"Generate a realistic timestamp for an {event_type} event within the last 24 hours." \
"Format: YYYY-MM-DD HH:MM:SS"
)
)
data_designer.add_generated_data_column(
name="event_details",
generation_prompt=(
"Create event details for a {event_type} by a {user_type} user on {device_type} with status {action_status}." \
"Include basic information like device ID and session duration if applicable."
)
)
Generate Data
Whether you followed the YAML approach or the SDK approach, you should have a DataDesigner Python object that you can use to generate your dataset.
Preview Data
You can generate a quick preview of your dataset, assess the data generated, and adjust your config if needed.
preview = data_designer.generate_dataset_preview()
-
[17:11:23] [INFO] ๐ Generating dataset preview
[17:11:24] [INFO] ๐ฅ Step 1: Load data seeds
[17:11:24] [INFO] ๐ฒ Step 2: Sample data seeds
[17:11:24] [INFO] ๐ฆ Step 3: Generate column from template >> generating timestamp
[17:11:25] [INFO] ๐ฆ Step 4: Generate column from template >> generating event details
[17:11:27] [INFO] ๐ Your dataset preview is ready for a peek!