Data Designer Configuration

The data designer configuration is the primary interface customers will use to build their dataset and inject diversity into it.

Here is an example of a Data Designer Configuration for building a Text-to-Python dataset (here is a link to a full notebook for this).

model_suite: apache-2.0

special_system_instructions: You are an expert at writing, analyzing, and  editing Python code.Your job is to assist the user with their Python-related tasks.

categorical_seed_columns:
  - name: industry_sector
    values:
      - Healthcare
      - Finance
    subcategories:
      - name: topic
        values:
          Healthcare:
            - Electronic Health Records (EHR) Systems
            - Telemedicine Platforms
          Finance:
            - Fraud Detection Software
            - Automated Trading Systems
  
generated_data_columns:
  - name: text
    generation_prompt: Write a prompt for a text-to-code dataset that 
                       is related to {topic} in the {industry_sector} 
                       sector.
  - name: code
    generation_prompt: Write Python code that will be paired with the
                       following prompt: {text}

post_processors:
  - validator: code
    settings:
      code_lang: python
      code_columns:code
  - evaluator: text_to_python
    settings:
      text_column: text
      code_column: code

Key Details:

  • Model Suite: Model Suites are curated collections of models designed to easily navigate the challenges of model selection, regulatory compliance, and legal rights over generated data. We support two model suites - apache-2.0 and llama-3.x.For more on model suites, view this page.

  • Special System Instruction: Customers can use this to specify a prompt that is used to provide guidance to the entire Navigator system when it generates data.

  • Categorial Seed Columns: Navigator Data Designer uses data seeds to inject diversity in the dataset. You can define seeds as key value pairs so that the columns you want to generate can use these seeds as context to generate data related to specific concepts. Seed columns support subcategories which allow you to specify topics related to a specific seed.

  • Generated Data Columns: These are the columns you are interested in generated from scratch in your dataset, for example, Text and Code are the two data columns you want to generated in a Text-to-Code dataset. For each data column you can provide a detailed generation prompt to guide how that column should be generated.

  • Post Processors: We offer two types of post processing for the data you generate. Validation is used to check the correctness of the data generated in a specific column. In this beta we support Python and SQL validation to ensure that the code generated is valid SQL or Python. Evaluation is used to explain how readable, relevant, and diverse the data you generated is. Evaluation is done using LLMs on specific records and the entire dataset.

We provide Blueprint configurations for common use cases, like Text-to-Code. For more on Blueprints, read these docs.

Last updated