Seeding your Dataset

Diversity in data is at the core of successfully generating a large-scale synthetic dataset. Data Designer introduces the concept of a "Data Seed" which is a key value pair used to inject diversity in the dataset. Data Designer uses these seeds to guide the data generation process based on the seed values to ensure maximal diversity in the dataset.

Defining your Seeds

There are 3 ways to define your seeds:

  • Specify them in your config: As shown above, you can provide the seed values you are interested in directly in your config or Python script.

# YAML Config

categorical_seed_columns:
  - name: event_type
    values: [Login, Logout, PageView]
# Python Script

data_designer.add_categorical_seed_column(
    name="event_type",
    values=["Login", "Logout", "PageView"]
)
  • Let Data Designer create seed values: Sometimes you could want Data Designer to generate the values for a specific seed, this is useful in cases where you have "Nested Seeds". An example of this could be in a Text-to-Python dataset where you have code complexity as the "Seed" and you want an LLM to generate Descriptions for each of them.

# YAML Config

- name: sql_complexity
    values: [Beginner, Intermediate, Advanced]
    subcategories:
      - name: sql_complexity_description
        description: The complexity level of the given SQL complexity and SQL concept.
        num_new_values_to_generate: 1
# Python script

data_designer.add_categorical_seed_column(
    name="sql_complexity",
    values=["Beginner", "Intermediate", "Advanced"],
    subcategories=[
        {
            "name": "sql_complexity_description",
            "description": "The complexity level of the given SQL complexity and SQL concept.",
            "num_new_values_to_generate": 2
        }
    ],
)
  • Generate Seeds from sample records: Sometimes you may not know the best way to define seeds for your dataset, but you might have some examples of the data you want. You can provide Data Designer a few records of your data and Data Designer will figure out the best seeds to use. This capability is a powerful way to quickly go from a few records to an entire dataset. Learn more about this in our "Sample-to-Dataset" blog.

# Python Script

from gretel_client.navigator import DataDesignerFactory

session_kwargs = {
    "api_key": "<YOUR_API_KEY>",
    "endpoint": "https://api.gretel.cloud",
}

NUM_SAMPLES = 10
MODEL_SUITE = "apache-2.0"

df = pd.read_csv("https://gretel-datasets.s3.us-west-2.amazonaws.com/realestate_data_london_2024_nov.csv")
sample_records = df.sample(NUM_SAMPLES).to_dict(orient="records")
df.head()

data_designer = DataDesignerFactory.from_sample_records(
    sample_records=sample_records,
    model_suite=MODEL_SUITE,
    api_key="prompt"
)

Seeding from sample records is only supported when using the Gretel SDK, not in the YAML config.

Last updated