Gretel GPT

Model type: Generative pre-trained transformer for text generation

Gretel GPT simplifies the process of training popular Large Language Models (LLMs) to generate synthetic text. It offers support for differentially private training, ensuring data privacy, and includes automated quality reporting with Gretel's Text Synthetic Quality Score (SQS). This allows you to create labeled examples to train or test other machine learning models, fine-tune the model on your data, or prompt it with examples for inference.

Model creation

To prompt the base model directly without fine-tuning, set data_source to null at initialization.

When fine-tuning Gretel GPT models, these constraints apply:

  1. Use 100+ examples if possible. Less than 100 - just prompt the base model directly.

  2. Providing only 1-5 records will cause an error.

  3. If your training dataset is a multi-column format, you MUST set the column_name.

schema_version: "1.0"
name: "natural-language-gpt"
models:
  - gpt_x:
      data_source:
        - "https://blueprints.gretel.cloud/sample_data/sample-banking-questions-intents.csv"
      pretrained_model: "gretelai/gpt-auto"
      prompt_template: null
      column_name: null
      validation: null
      params:
        batch_size: 4
        epochs: null
        steps: 750
        weight_decay: 0.01
        warmup_steps: 100
        lr_scheduler: "linear"
        learning_rate: 0.0001
        max_tokens: 512
        gradient_accumulation_steps: 8
      peft_params:
        lora_r: 8
        lora_alpha_over_r: 1.0
        target_modules: null
      privacy_params:
        dp: false
        epsilon: 8.0
        delta: "auto"
        per_sample_max_grad_norm: 1.0
        entity_column_name: null
      generate:
        num_records: 80
        seed_records_multiplier: 1
        maximum_text_length: 100
        top_p: 0.8987601335810778
        top_k: 43
        num_beams: 1
        do_sample: true
        do_early_stopping: true
        typical_p: 0.8
        temperature: null
  • data_source (required) - Use __tmp__ or a valid CSV, JSON, or JSONL file. Leave blank to skip fine-tuning and use the base LLM weights, for few-shot or zero-shot generation.

  • pretrained_model (optional, defaults to mistralai/Mistral-7B-Instruct-v0.2) - Gretel supports PEFT and LORA for fast adaptation of LLMs from models. Use a causal language model from the HuggingFace model repository.

  • column_name (optional) - Column with text for training if multi-column input. This parameter is required if multi-column input is used.

  • params - Controls the model training process.

    • batch_size (optional, default 4) - Batch size per GPU/TPU/CPU. Lower if out of memory.

    • epochs (optional, default 3) - Number of training epochs.

    • weight_decay (optional, default 0.01) - Weight decay for AdamW optimizer. 0 to 1.

    • warmup_steps (optional, default 100) - Warmup steps for linear lr increase.

    • lr_scheduler (optional, default linear) - Learning rate scheduler type.

    • learning_rate (optional, default 0.0002) - Initial AdamW learning rate.

    • max_tokens (optional, default 512) - Max input length in tokens.

    • validation (optional) - Validation set size. Integer is absolute number of samples.

    • gradient_accumulation_steps (optional, default 8) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.

  • peft_params - Gretel GPT uses Low-Rank Adaptation (LoRA) of LLMs, which makes fine-tuning more efficient by drastically reducing the number of trainable parameters by updating weights of smaller matrices through low-rank decomposition.

    • lora_r (optional, default 8) - Rank of the matrices that are updated. A lower value means fewer trainable model parameters.

    • lora_alpha_over_r (optional, default 1) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, values of 0.5, 1 or 2 work well.

    • target_modules (optional, default null) - List of module names or regex expression of the module names to replace with LoRA. When unspecified, modules will be chosen according to the model architecture (e.g. Mistral, Llama).

  • privacy_params - To fine tune on a privacy sensitive data source with differential privacy, use the parameters in this section.

    • dp (optional, default false) - Flag to turn on differentially private fine tuning when a data source is provided.

    • epsilon (optional, default 8) - Privacy loss parameter for differential privacy. Specify the maximum value available for model fine tuning.

    • delta (optional, default auto) - Probability of accidentally leaking information. It is typically set to be much less than 1/n, where n is the number of training records. By default, delta is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.2. You can also choose your own value for delta. Decreasing delta (for example to 1/n^2, which corresponds to delta: 0.000004 for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.

    • entity_column_name (optional, default null) - Column representing unit of privacy. e.g. name or id. When null, record-level differential privacy will be maintained, i.e. the final model does not change by much when the input dataset changes by one record. When specified as e.g. user_id, user-level differential privacy is maintained.

  • generate (optional) - Controls generated outputs during training.

    • num_records (optional, default 10) - Number of outputs.

    • maximum_text_length (optional, default 100) - Max tokens per output.

Data generation

Parameters Documentation

General Configuration

  • schema_version (optional): Defines the version of the configuration schema.

  • name (optional): Name of the model configuration.

Models

  • models (required): List of model configurations.

    • gpt_x: Configuration for a specific model instance.

      • data_source (required): URLs or paths to the data files (CSV, JSON, JSONL). For temporary data, use "tmp".

      • pretrained_model (optional): Pretrained LLM model to use. Defaults to "gretelai/gpt-auto".

      • prompt_template (optional): Template for prompting the model.

      • column_name (optional): Name of the column with text data if using multi-column input. Required parameter if using multi-column input.

      • validation (optional): Size of the validation set, specified as an integer (absolute number of samples).

Training Parameters

  • params (optional): Configuration for training parameters.

    • batch_size (default 4): Number of samples per batch per GPU/TPU/CPU.

    • epochs (optional): Number of complete passes through the training dataset.

    • steps (default 750): Total number of training steps to perform.

    • weight_decay (default 0.01): Weight decay coefficient for the AdamW optimizer, a regularization parameter.

    • warmup_steps (default 100): Number of steps for learning rate warmup.

    • lr_scheduler (default linear): Type of learning rate scheduler.

    • learning_rate (default 0.0002): Initial learning rate for the AdamW optimizer.

    • max_tokens (default 512): Maximum number of tokens for each input sequence.

    • gradient_accumulation_steps (default 8): Number of steps to accumulate gradients before updating model parameters.

Parameter-Efficient Fine-Tuning (PEFT) Parameters

  • peft_params (optional): Parameters for fine-tuning using PEFT.

    • lora_r (default 8): Rank of the low-rank adaptation matrix in LoRA.

    • lora_alpha_over_r (default 1.0): Scaling factor for the LoRA adaptation.

    • target_modules (optional): Specific modules to apply LoRA adaptation.

Privacy Parameters

  • privacy_params (optional): Configuration for differential privacy (DP).

    • dp (default false): Enable differentially private training using DP-SGD.

    • epsilon (default 8.0): Privacy budget parameter for DP.

    • delta (default "auto"): Privacy parameter for DP, usually a very small number.

    • per_sample_max_grad_norm (default 1.0): Clipping norm for gradients per sample to ensure privacy.

    • entity_column_name (optional): Column name for entity-level differential privacy.

Generation Parameters

  • generate (optional): Parameters controlling the generation of synthetic text.

    • num_records (default 10): Number of records to generate.

    • seed_records_multiplier (default 1): Multiplier for the number of rows emitted per prompt in prompt-based generation.

    • maximum_text_length (default 100): Maximum number of tokens per generated text.

    • top_p (default 0.89876): Probability threshold for nucleus sampling (top-p).

    • top_k (default 43): Number of highest probability tokens to keep for top-k sampling.

    • num_beams (default 1): Number of beams for beam search. Use 1 to disable beam search.

    • do_sample (default true): Enable sampling if true, otherwise use greedy search.

    • do_early_stopping (default true): Enable early stopping in beam search if true.

    • typical_p (default 0.8): Typical probability mass to consider in sampling.

    • temperature (default 1.0): Sampling temperature. Higher values increase randomness.

Usage

  • Training Configuration: Define your data source and configure model parameters. Optionally, enable privacy settings.

  • Data Generation: Supports unconditional and prompt-based text generation. Configure generation parameters to control output features.

Make sure to set data_source and pretrained_model as per your requirements. Use column_name for specifying the text column in multi-column data inputs.

Model Information

The Gretel GPT model supports fine-tuning and inference of commercially viable large language models. Specific model information can be found on each model card linked below.

Supported Models

  • gretelai/gpt-auto: Automatically selects the best available LLM for model training

  • mistralai/Mistral-7B-Instruct-v0.2

  • meta-llama/Meta-Llama-3-8B-Instruct

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.

CPU: Minimum 4 cores, 32GB RAM.

GPU (Required). Minimum Nvidia A10G, RTX3090 or better CUDA compliant GPU with 24GB+ RAM is required to run basic language models. For fine-tuning on datasets with more than 1,000 examples, a NVIDIA A100 or H100 with 40+GB RAM is recommended.

Limitations and Biases

Large-scale language models such as Gretel GPT may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results. For more information on each model, please read the model cards linked under "model information".

Last updated