Column Types

Data Designer supports various column types that determine how data is generated. This guide explains the different column types available and how to use them.

Two Ways to Define Columns

Data Designer offers two approaches to define columns:

  1. Simplified API: Direct parameter passing with string type names

  2. Typed API: More verbose but provides better type checking and IDE support

Both approaches offer the same functionality - choose the style that works best for your needs.

Simplified API Example

The simplified approach is concise and easy to use:

# Simplified API approach
aidd.add_column(
    name="product_category",
    type="category",
    params={"values": ["Electronics", "Clothing", "Home Goods"]}
)

Typed API Example

The typed API provides better code completion and type checking:

from gretel_client.data_designer import columns as C
from gretel_client.data_designer import params as P

# Typed API approach
aidd.add_column(
    C.SamplerColumn(
        name="product_category",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(values=["Electronics", "Clothing", "Home Goods"])
    )
)

When to Use Each Approach

Choose the Simplified API when:

  • You prefer concise, readable code

  • You're working on quick prototypes or simple designs

  • You don't need IDE autocompletion for parameters

Choose the Typed API when:

  • You want code completion and type checking in your IDE

  • You're working on complex designs where type safety helps prevent errors

  • You need clarity about available parameters and their types

  • You're collaborating with a team and want more self-documenting code

Both approaches use the same underlying implementation, so you can mix and match them as needed.

Column Type Categories

Data Designer columns fall into these main categories:

  1. Sampling-based columns: Generate data through statistical sampling methods

  2. Expression columns: Generate data by evaluating expressions

  3. LLM-based columns: Generate data using large language models

Sampling-Based Column Types

Category

Creates categorical values from a defined set of options.

Simplified API:

aidd.add_column(
    name="product_category",
    type="category",
    params={
        "values": ["Electronics", "Clothing", "Home Goods", "Books"],
        "weights": [0.4, 0.3, 0.2, 0.1],  # Optional: probability weights
        "description": "Product category classification"  # Optional
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="product_category",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home Goods", "Books"],
            weights=[0.4, 0.3, 0.2, 0.1],  # Optional: probability weights
            description="Product category classification"  # Optional
        )
    )
)

Subcategory

Creates values associated with a parent category.

Simplified API:

aidd.add_column(
    name="product_subcategory",
    type="subcategory",
    params={
        "category": "product_category",  # Parent category column
        "values": {
            "Electronics": ["Smartphones", "Laptops", "Headphones"],
            "Clothing": ["Shirts", "Pants", "Dresses", "Shoes"],
            "Home Goods": ["Kitchen", "Bathroom", "Bedroom"]
        }
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="product_subcategory",
        type=P.SamplerType.SUBCATEGORY,
        params=P.SubcategoryParams(
            category="product_category",  # Parent category column
            values={
                "Electronics": ["Smartphones", "Laptops", "Headphones"],
                "Clothing": ["Shirts", "Pants", "Dresses", "Shoes"],
                "Home Goods": ["Kitchen", "Bathroom", "Bedroom"]
            }
        )
    )
)

UUID

Generates unique identifiers.

Simplified API:

aidd.add_column(
    name="order_id",
    type="uuid",
    params={
        "prefix": "ORD-",  # Optional: adds a prefix
        "short_form": True,  # Optional: uses a shorter format
        "uppercase": True  # Optional: uses uppercase letters
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="order_id",
        type=P.SamplerType.UUID,
        params=P.UUIDSamplerParams(
            prefix="ORD-",  # Optional: adds a prefix
            short_form=True,  # Optional: uses a shorter format
            uppercase=True  # Optional: uses uppercase letters
        )
    )
)

Numerical Samplers

Uniform Distribution

Simplified API:

aidd.add_column(
    name="product_rating",
    type="uniform",
    params={"low": 1, "high": 5},
    convert_to="int"  # Optional: converts to integer
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="product_rating",
        type=P.SamplerType.UNIFORM,
        params=P.UniformSamplerParams(low=1, high=10),
        convert_to="int"  # Optional: converts to integer
    )
)

Gaussian Distribution

Simplified API:

aidd.add_column(
    name="item_weight",
    type="gaussian",
    params={"mean": 50, "stddev": 10}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="item_weight",
        type=P.SamplerType.GAUSSIAN,
        params=P.GaussianSamplerParams(mean=50, stddev=10)
    )
)

Poisson Distribution

Simplified API:

aidd.add_column(
    name="number_of_pets",
    type="poisson",
    params={"mean": 2}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="number_of_pets",
        type=P.SamplerType.POISSON,
        params=P.PoissonSamplerParams(mean=2)
    )
)

Bernoulli Distribution

Simplified API:

aidd.add_column(
    name="is_in_stock",
    type="bernoulli",
    params={"p": 0.8}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="is_in_stock",
        type=P.SamplerType.BERNOULLI,
        params=P.BernoulliSamplerParams(p=0.8)
    )
)

Bernoulli Mixture Distribution

Simplified API:

aidd.add_column(
    name="bern_exp",
    type="bernoulli_mixture",
    params={"p": 0.4, "dist_name": "expon", "dist_params": {"scale": 10}}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="bern_exp",
        type=P.SamplerType.BERNOULLI_MIXTURE,
        params=P.BernoulliMixtureSamplerParams(p=0.8, dist_name="expon", dist_params={"scale": 10})
    )
)

Binomial Distribution

Simplified API:

aidd.add_column(
    name="items_returned",
    type="binomial",
    params={"n": 10, "p": 0.1}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="items_returned",
        type=P.SamplerType.BINOMIAL,
        params=P.BinomialSamplerParams(n=10, p=0.1)
    )
)

Scipy Sampler

Use this sampler to access any statistical methods available in scipy.stats

Simplified API:

aidd.add_column(
    name="log_gaussian", 
    type="scipy", 
    params={
        "dist_name": "lognorm", 
        "dist_params": {
            "s": 0.9,   # sigma 
            "scale": 8, # exp(mean) 
        }
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="log_gaussian",
        type=P.SamplerType.SCIPY,
        params=P.ScipySamplerParams(dist_name="lognorm", dist_params={"s": 0.9, "scale": 8})
    )
)

Date and Time

DateTime

Simplified API:

aidd.add_column(
    name="order_date",
    type="datetime",
    params={"start": "2023-01-01", "end": "2023-12-31"}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="order_date",
        type=P.SamplerType.DATETIME,
        params=P.DatetimeSamplerParams(start="2023-01-01", end="2023-12-31")
    )
)

TimeDelta

Simplified API:

aidd.add_column(
    name="delivery_date",
    type="timedelta",
    params={
        "dt_min": 1,  # Minimum days
        "dt_max": 7,  # Maximum days
        "reference_column_name": "order_date"  # Reference date column
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="delivery_date",
        type=P.SamplerType.TIMEDELTA,
        params=P.TimedeltaSamplerParams(
            dt_min=1,  # Minimum days
            dt_max=7,  # Maximum days
            reference_column_name="order_date"  # Reference date column
        )
    )
)

Person

Defines person samplers that create realistic person entities.

Simplified API:

# Define person samplers
aidd.with_person_samplers(
    {
        "customer": {
            "sex": "Female",  # Optional
            "locale": "en_US"  # Optional
        }
    }
)

Typed API:

# Define person samplers
aidd.with_person_samplers(
    {
        "customer": P.PersonSamplerParams(
            sex="Female",  # Optional
            locale="en_US",  # Optional
        )
    }
)

Expression Columns

The Expression column type computes values using expressions involving other columns.

Basic Expressions

Simplified API:

add.add_column(
    name="final_price",
    type="expression",
    expr="{{ base_price }} * {{ new_price }}" # Or "{{ base_price * new_price }}"
) 

Typed API:

aidd.add_column(
    C.ExpressionColumn(
        name="total_price",
        expr="{{quantity}} * {{unit_price}}" # Or "{{ base_price * new_price }}"
    )
)

Person Attribute Expressions

Simplified API:

aidd.with_person_samplers(
    {
        "customer": P.PersonSamplerParams(),
    },
)

aidd.add_column(
    name="customer_full_name",
    type="expression",
    params={"expr": "{{ customer.first_name }} {{ customer.last_name }}"}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="person",  # This creates a nested object with all person attributes
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",
            age_range=[22, 65],
            state="CA"
        )
    )
)

LLM-Based Column Types

LLM Generated Content

Generates text data using large language models based on prompts.

There are three types of llm columns, llm-text, llm-code, llm-structured

The default type is llm-text, if you are generating code using an LLM, use the type llm-code, and use output_formatto provide the code language for formatting. If you are defining structured outputs for the LLM responses, use llm-structured, and provide a Pydantic or JSON schema to the output_formatargument.

Simplified API:

aidd.add_column(
    name="product_description",
    type="llm-text" # "llm-code", "llm-structured"
    model_alias="text" # Optional (default: text)
    prompt="Generate a detailed description for a {{product_category}} product.",
    system_prompt="You are a professional product copywriter.",  # Optional 
    # output_format=".." # Optional
)

Typed API:

aidd.add_column(
    C.LLMGenColumn(
        name="product_description",
        output_type="text" # "code", "structured"
        model_alias="text",
        prompt="Generate a detailed description for a {{product_category}} product.",
        system_prompt="You are a professional product copywriter.",  # Optional
        # output_format=".." # Optional
    )
)

For details on using Structured Outputs for LLM generated content, read this section.

Data Designer supports text , code , and judge as default model aliases, if using the llm-judge by default the column will use the judge alias. You can define your own custom model aliases with the generation parameters you want, learn more about how to do that in the model configuration section.

LLM Judge

Evaluates data quality using large language models.

Simplified API:

from gretel_client.data_designer.judge_rubrics import TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE, PYTHON_RUBRICS

aidd.add_column(
    name="code_quality",
    type="llm-judge",
    prompt=TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE,
    rubrics=PYTHON_RUBRICS
)

Typed API:

aidd.add_column(
    C.LLMJudgeColumn(
        name="code_quality",
        prompt=TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE,
        rubrics=PYTHON_RUBRICS
    )
)

Code Validation

Validates code in another column.

Simplified API:

aidd.add_column(
    name="code_validation_result",
    type="code-validation",
    code_lang="python",  # Language to validate
    target_column="code_implementation"  # Column containing code
)

Typed API:

aidd.add_column(
    C.CodeValidationColumn(
        name="code_validation_result",
        code_lang="python",  # Language to validate
        target_column="code_implementation"  # Column containing code
    )
)

Using Conditional Parameters

The Data Designer supports conditional parameters that change based on other column values:

Simplified API:

aidd.add_column(
    name="pet_type",
    type="category",
    params={"values": ["dog", "cat", "fish"], "weights": [0.5, 0.3, 0.2]},
    conditional_params={
        "number_of_pets == 0": {"values": ["none"]}
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="pet_type",
        type=P.SamplingSourceType.CATEGORY,
        params=P.CategorySamplerParams(values=["dog", "cat", "fish"], weights=[0.5, 0.3, 0.2]),
        conditional_params={
            "number_of_pets == 0": P.CategorySamplerParams(values=["none"])
        }
    )
)

Reference Table

Simplified API Type
Typed API Equivalent
Description

"category"

P.SamplerType.CATEGORY

Categorical values

"subcategory"

P.SamplerType.SUBCATEGORY

Dependent categories

"uuid"

P.SamplerType.UUID

Unique identifiers

"uniform"

P.SamplerType.UNIFORM

Uniform distribution

"gaussian"

P.SamplerType.GAUSSIAN

Normal distribution

"poisson"

P.SamplerType.POISSON

Poisson distribution

"bernoulli"

P.SamplerType.BERNOULLI

Binary outcomes

"binomial"

P.SamplerType.BINOMIAL

Number of successes

"datetime"

P.SamplerType.DATETIME

Date/time values

"timedelta"

P.SamplerType.TIMEDELTA

Time intervals

"expression"

C.ExpressionColumn

Computed expressions

"llm-text"

C.LLMTextColumn

LLM-generated text content

"llm-structured"

C.LLMStructuredColumn

LLM-generated structured content

"llm-code"

C.LLMCodeColumn

LLM-generated code content

"llm-judge"

C.LLMJudgeColumn

LLM-based evaluation

"code-validation"

C.CodeValidationColumn

Code validation

Choosing the Right Approach

Data Designer offers flexibility in how you define your columns. Both approaches are fully supported, so you can choose the style that best fits your needs.

Key points to remember:

  1. Same functionality: Both approaches provide access to the same features

  2. Interchangeable: You can mix both styles in the same project

  3. Simplified == concise: The simplified API is more concise

  4. Typed == safer: The typed API offers better IDE support and type checking

For quick experiments, the simplified API might be more convenient. For larger projects, the additional safety of the typed API can help prevent errors.

Last updated

Was this helpful?