Generate Realistic Personal Details

Person Objects in Data Designer

Generate Realistic Person Data

Data Designer provides powerful capabilities for generating realistic person data. This feature allows you to create synthetic individuals with complete demographic profiles, including names, contact information, addresses, and more. These synthetic personas can be used for a wide range of applications, from testing user databases to creating realistic sample data for applications.

Creating Person Samplers

Person samplers generate realistic person entities with various attributes. You can create them using the with_person_samplers method:

aidd.with_person_samplers({
        "customer": {"sex": "Female", "locale": "en_US"},
        "employee": {"sex": "Female", "locale": "en_GB"},
        "random_person": {}  # Default settings
    },
    keep_person_columns=True # False by default
)

Each sampler creates a different person object that you can reference throughout your data design.

Configuration Options

Person samplers accept these configuration parameters:

sex: Specify "Male" or "Female" (optional)
locale: Language and region code (optional, e.g., "en_US", "fr_FR", "de_DE")
city: City within the specified locale (optional)
age_range: Age range for filtering (default: ages above 18 only)
state: US state code, only valid when locale is set to "en_US" (optional)
keep_persons_columns (default: False): When set to False, all person columns will be dropped from the final dataset.

Note: When using a US locale ("en_US"), you can filter on age range, sex, city, and state. For non-US locales, filtering is limited to age range, sex, and city only.

You can choose either city or state when filtering, not both.

Locale Support and Data Quality

Important Quality Difference Between Locales:

US Locale (en_US): For locale="en_US", Data Designer uses Gretel's proprietary Probabilistic Generative Model (PGM) trained on US census demographic data. This provides extremely high-quality, realistic, and demographically accurate person data. The relationships between attributes (e.g., age, occupation, education level) are preserved, resulting in coherent and plausible person profiles.
Other Locales: For non-US locales, Data Designer uses the Faker library as a fallback. While Faker provides decent data for basic attributes like names and addresses, it doesn't maintain the same level of demographic accuracy or attribute relationships as the PGM. The data quality is notably lower than for US-based personas.

If demographic accuracy and realism are important for your use case, consider using the en_US locale whenever possible.

Examples

US-Based Realistic Personas

aidd.with_person_samplers({
    "us_customer": {"locale": "en_US", "sex": "Female"}
})

This will generate high-quality, demographically accurate US-based person data using the PGM.

International Personas (Faker-based)

aidd.with_person_samplers({
    "french_customer": {"locale": "fr_FR"},
    "german_customer": {"locale": "de_DE"},
    "spanish_customer": {"locale": "es_ES"}
})

These will use Faker to generate person data for the respective locales.

Accessing Person Attributes

Person objects have many attributes you can reference in your data generation:

Field Name

Type

Description

first_name

str

Person's first name

middle_name

str | None

Person's middle name

last_name

str

Person's last name

sex

Sex

Person's sex (enum type)

age

int

Person's age

zipcode

str

Zipcode/Postal Code

street_number

int | str

Street number (can be numeric or alphanumeric)

street_name

str

Name of the street

unit

str

Unit/apartment number (US locale only)

city

str

City name

state

str | None

State (US locale only)

county

str | None

County (US locale only)

country

str

Country name

ethnic_background

str | None

Ethnic background (US locale only)

marital_status

str | None

Marital status

education_level

str | None

Education level

bachelors_field

str | None

Field of bachelor's degree

occupation

str | None

Occupation

uuid

str | None

Unique identifier

locale

str

Locale setting

phone_number

str | None

Generated phone number based on location (None for age < 18)

email_address

str | None

Generated email address (None for age < 18)

birth_date

date

Calculated birth date based on age

ssn

str | None

SSN (US locale only)

Using Person Data in Columns

There are two main ways to use person data in your dataset:

1. Creating Columns from Person Attributes

Extract specific attributes from a person into separate columns:

aidd.add_column(
    name="first_name",
    type="expression",
    expr="{{customer.first_name}}"
)

aidd.add_column(
    name="last_name",
    type="expression",
    expr="{{customer.last_name}}"
)

aidd.add_column(
    name="email",
    type="expression",
    expr="{{customer.email_address}}"
)

2. Referencing Person Attributes in Prompts

Use person attributes in prompt templates for LLM-generated columns:

aidd.add_column(
    name="customer_profile",
    prompt="""
    Create a customer profile summary for:
    Name: {{customer.first_name}} {{customer.last_name}}
    Age: {{customer.age}}
    Occupation: {{customer.occupation}}
    Education: {{customer.education_level}}
    
    The summary should be professional and highlight their background and potential needs.
    """
)

Complete Example

Here's a full example showing person sampler usage with locale differences highlighted:

from gretel_client.navigator_client import Gretel

# Initialize Gretel client
gretel = Gretel(api_key="YOUR_API_KEY")

# Create a new Data Designer instance
aidd = gretel.data_designer.new(model_suite="apache-2.0")

# Create person samplers - note the different locales
aidd.with_person_samplers({
    "us_customer": {"sex": "Female", "locale": "en_US"},  # Uses PGM for high-quality data
    "intl_customer": {"sex": "Male", "locale": "fr_FR"}   # Uses Faker as fallback
})

# Extract customer attributes
aidd.add_column(
    name="customer_id",
    type="uuid",
    params={"prefix": "CUST-"}
)

# US customer (PGM-based)
aidd.add_column(
    name="us_customer_name",
    type="expression",
    expr="{{us_customer.first_name}} {{us_customer.last_name}}"
)

aidd.add_column(
    name="us_customer_email",
    type="expression",
    expr="{{us_customer.email_address}}"
)

aidd.add_column(
    name="us_customer_location",
    type="expression",
    expr="{{us_customer.city}}, {{us_customer.region}}"
)

aidd.add_column(
    name="us_customer_demographics",
    type="expression",
    expr="{{us_customer.education_level}}/{{us_customer.occupation}}"
)

# International customer (Faker-based)
aidd.add_column(
    name="intl_customer_name",
    type="expression",
    expr="{{intl_customer.first_name}} {{intl_customer.last_name}}"
)

aidd.add_column(
    name="intl_customer_location",
    type="expression",
    expr="{{intl_customer.city}} {{intl_customer.country}}"
)

# Add a support scenario category
aidd.add_column(
    name="support_scenario",
    type="category",
    params={
        "values": ["Account Access", "Billing Issue", "Technical Problem", "Feature Request"]
    }
)

aidd.add_column(
    name="intl_customer_location",
    type="expression",
    expr="{{intl_customer.city}}, {{intl_customer.country}}"
)

# Add a support scenario category
aidd.add_column(
    name="support_scenario",
    type="category",
    params={
        "values": ["Account Access", "Billing Issue", "Technical Problem", "Feature Request"]
    }
)

# Generate a comparative customer support interaction
aidd.add_column(
    name="support_conversation",
    prompt="""
    Generate a support conversation snippet between two customers and a support agent.
    
    US Customer: {{us_customer.first_name}} {{us_customer.last_name}}
    US Customer Location: {{us_customer.city}}, {{us_customer.state}}
    US Customer Demographics: {{us_customer_demographics}}
    
    International Customer: {{intl_customer.first_name}} {{intl_customer.last_name}}
    International Customer Location: {{intl_customer.city}}, {{intl_customer.country}}
    
    Support Scenario: {support_scenario}
    
    Write a realistic support conversation where both customers experience the same {support_scenario}
    but have slightly different needs based on their backgrounds and locations.
    """
)

# Preview the results
preview = aidd.preview()
preview.display_sample_record()

Best Practices for Person Samplers

Use en_US for Maximum Quality: When demographic accuracy is important, prefer the US locale to leverage the high-quality PGM.
Create Multiple Personas: Generate different personas for different roles in your data scenarios (e.g., customers, employees, support agents).
Use Filters: Filter person objects based on sex, location, and age.
Test Different Locales: If you need international data, test the Faker-generated attributes to ensure they meet your quality requirements.

PreviousGenerating Data NextStructured Outputs

Last updated 3 months ago

Was this helpful?