LogoLogo
  • Welcome to Gretel!
  • Gretel Basics
    • Getting Started
      • Quickstart
      • Blueprints
      • Use Case Examples
      • Environment Setup
        • Console
        • SDK
      • Projects
      • Inputs and Outputs
      • Gretel Connectors
        • Object Storage
          • Amazon S3
          • Google Cloud Storage
          • Azure Blob
        • Database
          • MySQL
          • PostgreSQL
          • MS SQL Server
          • Oracle Database
        • Data Warehouse
          • Snowflake
          • BigQuery
          • Databricks
        • Gretel Project
    • Release Notes
      • Platform Release Notes
        • May 2025
        • April 2025
        • March 2025
        • February 2025
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
        • July 2024
        • June 2024
      • Console Release Notes
        • January 2025
        • December 2024
        • November 2024
        • October 2024
        • September 2024
        • August 2024
      • Python SDKs
  • Create Synthetic Data
    • Gretel Safe Synthetics
      • Transform
        • Reference
        • Examples
        • Supported Entities
      • Synthetics
        • Gretel Tabular Fine-Tuning
        • Gretel Text Fine-Tuning
        • Gretel Tabular GAN
        • Benchmark Report
        • Privacy Protection
      • Evaluate
        • Synthetic Quality & Privacy Report
        • Tips to Improve Synthetic Data Quality
        • Data Privacy 101
      • SDK
    • Gretel Data Designer
      • Getting Started with Data Designer
      • Define your Data Columns
        • Column Types
        • Add Constraints to Columns
        • Custom Model Configurations
        • Upload Files as Seeds
      • Building your Dataset
        • Seeding your Dataset
        • Generating Data
      • Generate Realistic Personal Details
      • Structured Outputs
      • Code Validation
      • Data Evaluation
      • Magic Assistance
      • Using Jinja Templates
  • Gretel Playground [Legacy]
    • Getting Started
    • Prompts Tips & Best Practices
    • FAQ
    • SDK Examples
    • Tutorials
    • Videos
    • Gretel Playground [Legacy] Inference API
    • Batch Job SDK
  • Reference
    • Gretel's Python Client
    • Gretel’s Open Source Synthetic Engine
    • Gretel’s REST API
    • Homepage
    • Model Suites
Powered by GitBook
On this page
  • Step configuration
  • Train parameters
  • Generate parameters
  • Limitations and Biases

Was this helpful?

Export as PDF
  1. Create Synthetic Data
  2. Gretel Safe Synthetics
  3. Synthetics

Gretel Text Fine-Tuning

Model type: Generative pre-trained transformer for text generation

Gretel Text Fine-Tuning simplifies the process of training popular Large Language Models (LLMs) to generate synthetic text. It offers support for differentially private training, ensuring data privacy, and includes automated quality reporting with Gretel's Text Synthetic Quality Score (SQS). This allows you to create labeled examples to train or test other machine learning models, fine-tune the model on your data, or prompt it with examples for inference.

Step configuration

The config below shows all the available training and generation parameters for Text Fine-Tuning. It is best to use Text Fine-Tuning on datasets with only a single column of free text.

If your training dataset is a multi-column format, you MUST set the column_name when using Text Fine-Tuning.

schema_version: "1.0"
name: default
task:
  name: text_ft
  config:
    train:
      pretrained_model: "gretelai/gpt-auto"
      prompt_template: null
      column_name: null
      validation: null
      params:
        batch_size: 4
        epochs: null
        steps: 750
        weight_decay: 0.01
        warmup_steps: 100
        lr_scheduler: "linear"
        learning_rate: 0.0001
        max_tokens: 512
        gradient_accumulation_steps: 8
      peft_params:
        lora_r: 8
        lora_alpha_over_r: 1.0
        target_modules: null
      privacy_params:
        dp: false
        epsilon: 8.0
        delta: "auto"
        per_sample_max_grad_norm: 1.0
        entity_column_name: null
    generate:
      num_records: 80
      maximum_text_length: 100

Train parameters

  • pretrained_model (optional, defaults to meta-llama/Llama-3.1-8B-Instruct) - Base model used for fine-tuning. These are the models currently supported:

    • gretelai/gpt-auto - defaults to meta-llama/Meta-Llama-3-8B-Instruct

    • mistralai/Mistral-7B-Instruct-v0.2

    • TinyLlama/TinyLlama-1.1B-Chat-v1.0

    • meta-llama/Llama-3.1-8B-Instruct

  • column_name (optional) - Column with text for training if multi-column input. This parameter is required if multi-column input is used.

  • params - Parameters that control the model training process:

    • batch_size (optional, default 4) - Batch size per GPU/TPU/CPU. Lower if out of memory.

    • epochs (optional, default 3) - Number of training epochs.

    • weight_decay (optional, default 0.01) - Weight decay for AdamW optimizer. 0 to 1.

    • warmup_steps (optional, default 100) - Warmup steps for linear lr increase.

    • lr_scheduler (optional, default linear) - Learning rate scheduler type.

    • learning_rate (optional, default 0.0002) - Initial AdamW learning rate.

    • max_tokens (optional, default 512) - Max input length in tokens.

    • validation (optional) - Validation set size. Integer is absolute number of samples.

    • gradient_accumulation_steps (optional, default 8) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.

    • lora_r (optional, default 8) - Rank of the matrices that are updated. A lower value means fewer trainable model parameters.

    • lora_alpha_over_r (optional, default 1) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, values of 0.5, 1 or 2 work well.

    • target_modules (optional, default null) - List of module names or regex expression of the module names to replace with LoRA. When unspecified, modules will be chosen according to the model architecture (e.g. Mistral, Llama).

  • privacy_params - To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.

    • dp (optional, default false) - Flag to turn on differentially private fine tuning when a data source is provided.

    • epsilon (optional, default 8) - Privacy loss parameter for differential privacy. Specify the maximum value available for model fine tuning.

    • entity_column_name (optional, default null) - Column representing unit of privacy. e.g. name or id. When null, record-level differential privacy will be maintained, i.e. the final model does not change by much when the input dataset changes by one record. When specified as e.g. user_id, user-level differential privacy is maintained.

Generate parameters

  • num_records (optional, default 80) - Number of output records

  • maximum_text_length (optional, default 100) - Max tokens per output record

Limitations and Biases

Large-scale language models such as Gretel Text Fine-Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.

PreviousGretel Tabular Fine-TuningNextGretel Tabular GAN

Last updated 29 days ago

Was this helpful?

peft_params - Gretel Text Fine-Tuning uses Low-Rank Adaptation () of LLMs, which makes fine-tuning more efficient by drastically reducing the number of trainable parameters by updating weights of smaller matrices through low-rank decomposition.

delta (optional, default auto) - . It is typically set to be much less than 1/n, where n is the number of training records. By default, delta is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.2. You can also choose your own value for delta. Decreasing delta (for example to 1/n^2, which corresponds to delta: 0.000004 for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.

LoRA
Probability of accidentally leaking information