Data Designer Configuration
The data designer configuration is the primary interface customers will use to build their dataset and inject diversity into it.
Here is an example of a Data Designer Configuration for building a Text-to-Python dataset (here is a link to a full notebook for this).
Key Details:
Model Suite: Model Suites are curated collections of models designed to easily navigate the challenges of model selection, regulatory compliance, and legal rights over generated data. We support two model suites -
apache-2.0
andllama-3.x.
For more on model suites, view this page.Special System Instruction: Customers can use this to specify a prompt that is used to provide guidance to the entire Navigator system when it generates data.
Categorial Seed Columns: Navigator Data Designer uses data seeds to inject diversity in the dataset. You can define seeds as key value pairs so that the columns you want to generate can use these seeds as context to generate data related to specific concepts. Seed columns support subcategories which allow you to specify topics related to a specific seed.
Generated Data Columns: These are the columns you are interested in generated from scratch in your dataset, for example, Text and Code are the two data columns you want to generated in a Text-to-Code dataset. For each data column you can provide a detailed generation prompt to guide how that column should be generated.
Post Processors: We offer two types of post processing for the data you generate. Validation is used to check the correctness of the data generated in a specific column. In this beta we support Python and SQL validation to ensure that the code generated is valid SQL or Python. Evaluation is used to explain how readable, relevant, and diverse the data you generated is. Evaluation is done using LLMs on specific records and the entire dataset.
We provide Blueprint configurations for common use cases, like Text-to-Code. For more on Blueprints, read these docs.
Last updated