Seeding your Dataset
Diversity in data is at the core of successfully generating a large-scale synthetic dataset. Data Designer introduces the concept of a "Data Seed" which is a key value pair used to inject diversity in the dataset. Data Designer uses these seeds to guide the data generation process based on the seed values to ensure maximal diversity in the dataset.
Defining your Seeds
There are 3 ways to define your seeds:
Specify them in your config: As shown above, you can provide the seed values you are interested in directly in your config or Python script.
Let Data Designer create seed values: Sometimes you could want Data Designer to generate the values for a specific seed, this is useful in cases where you have "Nested Seeds". An example of this could be in a Text-to-Python dataset where you have code complexity as the "Seed" and you want an LLM to generate Descriptions for each of them.
Generate Seeds from sample records: Sometimes you may not know the best way to define seeds for your dataset, but you might have some examples of the data you want. You can provide Data Designer a few records of your data and Data Designer will figure out the best seeds to use. This capability is a powerful way to quickly go from a few records to an entire dataset. Learn more about this in our "Sample-to-Dataset" blog.
Seeding from sample records is only supported when using the Gretel SDK, not in the YAML config.
Last updated