LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page

Was this helpful?

Export as PDF
  1. Create Synthetic Data
  2. Gretel Data Designer
  3. Define your Data Columns

Column Types

Data Designer supports various column types that determine how data is generated. This guide explains the different column types available and how to use them.

Two Ways to Define Columns

Data Designer offers two approaches to define columns:

  1. Simplified API: Direct parameter passing with string type names

  2. Typed API: More verbose but provides better type checking and IDE support

Both approaches offer the same functionality - choose the style that works best for your needs.

Simplified API Example

The simplified approach is concise and easy to use:

# Simplified API approach
aidd.add_column(
    name="product_category",
    type="category",
    params={"values": ["Electronics", "Clothing", "Home Goods"]}
)

Typed API Example

The typed API provides better code completion and type checking:

from gretel_client.data_designer import columns as C
from gretel_client.data_designer import params as P

# Typed API approach
aidd.add_column(
    C.SamplerColumn(
        name="product_category",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(values=["Electronics", "Clothing", "Home Goods"])
    )
)

When to Use Each Approach

Choose the Simplified API when:

  • You prefer concise, readable code

  • You're working on quick prototypes or simple designs

  • You don't need IDE autocompletion for parameters

Choose the Typed API when:

  • You want code completion and type checking in your IDE

  • You're working on complex designs where type safety helps prevent errors

  • You need clarity about available parameters and their types

  • You're collaborating with a team and want more self-documenting code

Both approaches use the same underlying implementation, so you can mix and match them as needed.

Column Type Categories

Data Designer columns fall into these main categories:

  1. Sampling-based columns: Generate data through statistical sampling methods

  2. Expression columns: Generate data by evaluating expressions

  3. LLM-based columns: Generate data using large language models

Sampling-Based Column Types

Category

Creates categorical values from a defined set of options.

Simplified API:

aidd.add_column(
    name="product_category",
    type="category",
    params={
        "values": ["Electronics", "Clothing", "Home Goods", "Books"],
        "weights": [0.4, 0.3, 0.2, 0.1],  # Optional: probability weights
        "description": "Product category classification"  # Optional
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="product_category",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home Goods", "Books"],
            weights=[0.4, 0.3, 0.2, 0.1],  # Optional: probability weights
            description="Product category classification"  # Optional
        )
    )
)

Subcategory

Creates values associated with a parent category.

Simplified API:

aidd.add_column(
    name="product_subcategory",
    type="subcategory",
    params={
        "category": "product_category",  # Parent category column
        "values": {
            "Electronics": ["Smartphones", "Laptops", "Headphones"],
            "Clothing": ["Shirts", "Pants", "Dresses", "Shoes"],
            "Home Goods": ["Kitchen", "Bathroom", "Bedroom"]
        }
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="product_subcategory",
        type=P.SamplerType.SUBCATEGORY,
        params=P.SubcategoryParams(
            category="product_category",  # Parent category column
            values={
                "Electronics": ["Smartphones", "Laptops", "Headphones"],
                "Clothing": ["Shirts", "Pants", "Dresses", "Shoes"],
                "Home Goods": ["Kitchen", "Bathroom", "Bedroom"]
            }
        )
    )
)

UUID

Generates unique identifiers.

Simplified API:

aidd.add_column(
    name="order_id",
    type="uuid",
    params={
        "prefix": "ORD-",  # Optional: adds a prefix
        "short_form": True,  # Optional: uses a shorter format
        "uppercase": True  # Optional: uses uppercase letters
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="order_id",
        type=P.SamplerType.UUID,
        params=P.UUIDSamplerParams(
            prefix="ORD-",  # Optional: adds a prefix
            short_form=True,  # Optional: uses a shorter format
            uppercase=True  # Optional: uses uppercase letters
        )
    )
)

Numerical Samplers

Uniform Distribution

Simplified API:

aidd.add_column(
    name="product_rating",
    type="uniform",
    params={"low": 1, "high": 5},
    convert_to="int"  # Optional: converts to integer
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="product_rating",
        type=P.SamplerType.UNIFORM,
        params=P.UniformSamplerParams(low=1, high=10),
        convert_to="int"  # Optional: converts to integer
    )
)

Gaussian Distribution

Simplified API:

aidd.add_column(
    name="item_weight",
    type="gaussian",
    params={"mean": 50, "stddev": 10}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="item_weight",
        type=P.SamplerType.GAUSSIAN,
        params=P.GaussianSamplerParams(mean=50, stddev=10)
    )
)

Poisson Distribution

Simplified API:

aidd.add_column(
    name="number_of_pets",
    type="poisson",
    params={"mean": 2}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="number_of_pets",
        type=P.SamplerType.POISSON,
        params=P.PoissonSamplerParams(mean=2)
    )
)

Bernoulli Distribution

Simplified API:

aidd.add_column(
    name="is_in_stock",
    type="bernoulli",
    params={"p": 0.8}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="is_in_stock",
        type=P.SamplerType.BERNOULLI,
        params=P.BernoulliSamplerParams(p=0.8)
    )
)

Bernoulli Mixture Distribution

Simplified API:

aidd.add_column(
    name="bern_exp",
    type="bernoulli_mixture",
    params={"p": 0.4, "dist_name": "expon", "dist_params": {"scale": 10}}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="bern_exp",
        type=P.SamplerType.BERNOULLI_MIXTURE,
        params=P.BernoulliMixtureSamplerParams(p=0.8, dist_name="expon", dist_params={"scale": 10})
    )
)

Binomial Distribution

Simplified API:

aidd.add_column(
    name="items_returned",
    type="binomial",
    params={"n": 10, "p": 0.1}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="items_returned",
        type=P.SamplerType.BINOMIAL,
        params=P.BinomialSamplerParams(n=10, p=0.1)
    )
)

Scipy Sampler

Simplified API:

aidd.add_column(
    name="log_gaussian", 
    type="scipy", 
    params={
        "dist_name": "lognorm", 
        "dist_params": {
            "s": 0.9,   # sigma 
            "scale": 8, # exp(mean) 
        }
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="log_gaussian",
        type=P.SamplerType.SCIPY,
        params=P.ScipySamplerParams(dist_name="lognorm", dist_params={"s": 0.9, "scale": 8})
    )
)

Date and Time

DateTime

Simplified API:

aidd.add_column(
    name="order_date",
    type="datetime",
    params={"start": "2023-01-01", "end": "2023-12-31"}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="order_date",
        type=P.SamplerType.DATETIME,
        params=P.DatetimeSamplerParams(start="2023-01-01", end="2023-12-31")
    )
)

TimeDelta

Simplified API:

aidd.add_column(
    name="delivery_date",
    type="timedelta",
    params={
        "dt_min": 1,  # Minimum days
        "dt_max": 7,  # Maximum days
        "reference_column_name": "order_date"  # Reference date column
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="delivery_date",
        type=P.SamplerType.TIMEDELTA,
        params=P.TimedeltaSamplerParams(
            dt_min=1,  # Minimum days
            dt_max=7,  # Maximum days
            reference_column_name="order_date"  # Reference date column
        )
    )
)

Person

Defines person samplers that create realistic person entities.

Simplified API:

# Define person samplers
aidd.with_person_samplers(
    {
        "customer": {
            "sex": "Female",  # Optional
            "locale": "en_US"  # Optional
        }
    }
)

Typed API:

# Define person samplers
aidd.with_person_samplers(
    {
        "customer": P.PersonSamplerParams(
            sex="Female",  # Optional
            locale="en_US",  # Optional
        )
    }
)

Expression Columns

The Expression column type computes values using expressions involving other columns.

Basic Expressions

Simplified API:

add.add_column(
    name="final_price",
    type="expression",
    expr="{{ base_price }} * {{ new_price }}" # Or "{{ base_price * new_price }}"
) 

Typed API:

aidd.add_column(
    C.ExpressionColumn(
        name="total_price",
        expr="{{quantity}} * {{unit_price}}" # Or "{{ base_price * new_price }}"
    )
)

Person Attribute Expressions

Simplified API:

aidd.with_person_samplers(
    {
        "customer": P.PersonSamplerParams(),
    },
)

aidd.add_column(
    name="customer_full_name",
    type="expression",
    params={"expr": "{{ customer.first_name }} {{ customer.last_name }}"}
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="person",  # This creates a nested object with all person attributes
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",
            age_range=[22, 65],
            state="CA"
        )
    )
)

LLM-Based Column Types

LLM Generated Content

Generates text data using large language models based on prompts.

There are three types of llm columns, llm-text, llm-code, llm-structured

The default type is llm-text, if you are generating code using an LLM, use the type llm-code, and use output_formatto provide the code language for formatting. If you are defining structured outputs for the LLM responses, use llm-structured, and provide a Pydantic or JSON schema to the output_formatargument.

Simplified API:

aidd.add_column(
    name="product_description",
    type="llm-text" # "llm-code", "llm-structured"
    model_alias="text" # Optional (default: text)
    prompt="Generate a detailed description for a {{product_category}} product.",
    system_prompt="You are a professional product copywriter.",  # Optional 
    # output_format=".." # Optional
)

Typed API:

aidd.add_column(
    C.LLMGenColumn(
        name="product_description",
        output_type="text" # "code", "structured"
        model_alias="text",
        prompt="Generate a detailed description for a {{product_category}} product.",
        system_prompt="You are a professional product copywriter.",  # Optional
        # output_format=".." # Optional
    )
)

Data Designer supports text , code , and judge as default model aliases, if using the llm-judge by default the column will use the judge alias. You can define your own custom model aliases with the generation parameters you want, learn more about how to do that in the model configuration section.

LLM Judge

Evaluates data quality using large language models.

Simplified API:

from gretel_client.data_designer.judge_rubrics import TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE, PYTHON_RUBRICS

aidd.add_column(
    name="code_quality",
    type="llm-judge",
    prompt=TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE,
    rubrics=PYTHON_RUBRICS
)

Typed API:

aidd.add_column(
    C.LLMJudgeColumn(
        name="code_quality",
        prompt=TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE,
        rubrics=PYTHON_RUBRICS
    )
)

Code Validation

Validates code in another column.

Simplified API:

aidd.add_column(
    name="code_validation_result",
    type="code-validation",
    code_lang="python",  # Language to validate
    target_column="code_implementation"  # Column containing code
)

Typed API:

aidd.add_column(
    C.CodeValidationColumn(
        name="code_validation_result",
        code_lang="python",  # Language to validate
        target_column="code_implementation"  # Column containing code
    )
)

Using Conditional Parameters

The Data Designer supports conditional parameters that change based on other column values:

Simplified API:

aidd.add_column(
    name="pet_type",
    type="category",
    params={"values": ["dog", "cat", "fish"], "weights": [0.5, 0.3, 0.2]},
    conditional_params={
        "number_of_pets == 0": {"values": ["none"]}
    }
)

Typed API:

aidd.add_column(
    C.SamplerColumn(
        name="pet_type",
        type=P.SamplingSourceType.CATEGORY,
        params=P.CategorySamplerParams(values=["dog", "cat", "fish"], weights=[0.5, 0.3, 0.2]),
        conditional_params={
            "number_of_pets == 0": P.CategorySamplerParams(values=["none"])
        }
    )
)

Reference Table

Simplified API Type
Typed API Equivalent
Description

"category"

P.SamplerType.CATEGORY

Categorical values

"subcategory"

P.SamplerType.SUBCATEGORY

Dependent categories

"uuid"

P.SamplerType.UUID

Unique identifiers

"uniform"

P.SamplerType.UNIFORM

Uniform distribution

"gaussian"

P.SamplerType.GAUSSIAN

Normal distribution

"poisson"

P.SamplerType.POISSON

Poisson distribution

"bernoulli"

P.SamplerType.BERNOULLI

Binary outcomes

"binomial"

P.SamplerType.BINOMIAL

Number of successes

"datetime"

P.SamplerType.DATETIME

Date/time values

"timedelta"

P.SamplerType.TIMEDELTA

Time intervals

"expression"

C.ExpressionColumn

Computed expressions

"llm-text"

C.LLMTextColumn

LLM-generated text content

"llm-structured"

C.LLMStructuredColumn

LLM-generated structured content

"llm-code"

C.LLMCodeColumn

LLM-generated code content

"llm-judge"

C.LLMJudgeColumn

LLM-based evaluation

"code-validation"

C.CodeValidationColumn

Code validation

Choosing the Right Approach

Data Designer offers flexibility in how you define your columns. Both approaches are fully supported, so you can choose the style that best fits your needs.

Key points to remember:

  1. Same functionality: Both approaches provide access to the same features

  2. Interchangeable: You can mix both styles in the same project

  3. Simplified == concise: The simplified API is more concise

  4. Typed == safer: The typed API offers better IDE support and type checking

For quick experiments, the simplified API might be more convenient. For larger projects, the additional safety of the typed API can help prevent errors.

PreviousDefine your Data ColumnsNextAdd Constraints to Columns

Last updated 26 days ago

Was this helpful?

Use this sampler to access any statistical methods available in

For details on using Structured Outputs for LLM generated content, read section.

scipy.stats
this