Gretel GPT

Model type: Generative pre-trained transformer for text generation

Gretel GPT uses Large Language Models to generate synthetic text. This can create labeled examples to train or test other ML models. You can fine-tune the model on your data or prompt it with examples for inference.

Model creation

Initialize a model to begin using Gretel GPT. Use the gpt_x tag to select this model. Here is a sample config to create and fine-tune a Gretel GPT model. All Gretel models use a common interface for training synthetic data models from their config. See the reference for how to Create and Train a Model.

To prompt the base model directly without fine-tuning, set data_source to None at initialization.

When fine-tuning Gretel GPT models, these constraints apply:

  1. Use 100+ examples if possible. Less than 100 - just prompt the base model directly.

  2. Providing only 1-5 records will cause an error.

  3. Less than 100 records triggers a warning. 100+ recommended for fine tuning.

  4. If your training dataset is a multi-column format, you MUST set the column_name.

schema_version: "1.0"

models:
  - gpt_x:
      data_source: __tmp__
      pretrained_model: 'gretelai/mpt-7b'
      column_name: null
      validation: null
      params:
          batch_size: 4
          epochs: 3.0
          weight_decay: 0.01
          warmup_steps: 100
          lr_scheduler: "linear"
          learning_rate: 0.0002
          max_tokens: 512
      generate:
          num_records: 80
          maximum_text_length: 100

Parameters that may be used to configure model training.

  • data_source (required) - Use __tmp__ or a valid CSV, JSON, or JSONL file. Leave blank to skip fine-tuning and use the base LLM weights, for few-shot or zero-shot generation.

  • pretrained_model (optional, defaults to gretelai/mpt-7b) - Gretel supports PEFT and LORA for fast adaptation of LLMs from models. Use a causal language model from the HuggingFace model repository.

  • column_name (optional) - Column with text for training if multi-column input. This parameter is required if multi-column input is used.

  • batch_size (optional, default 4) - Batch size per GPU/TPU/CPU. Lower if out of memory.

  • epochs (optional, default 3) - Number of training epochs.

  • weight_decay (optional, default 0.01) - Weight decay for AdamW optimizer. 0 to 1.

  • warmup_steps (optional, default 100) - Warmup steps for linear lr increase.

  • lr_scheduler (optional, default linear) - Learning rate scheduler type.

  • learning_rate (optional, default 0.0002) - Initial AdamW learning rate.

  • max_tokens (optional, default 512) - Max input length in tokens.

  • validation (optional) - Validation set size. Integer is absolute number of samples.

  • generate (optional) - Controls generated outputs during training.

    • num_records (optional, default 10) - Number of outputs.

    • maximum_text_length (optional, default 100) - Max tokens per output.

For training data inputs with multiple columns, use the column_name parameter to specify which column to train on. column_name should be set as the field name of the natural language text column, e.g. column_name: "text".

Data generation

Gretel GPT supports two modes for generating data: unconditional generation, and generation with prompting.

  • Unconditional generation is only available for models that have been fine-tuned on a data source. In this mode, a specified number of text records will be generated that reflect the characteristics of the dataset used for fine-tuning.

  • Generation with prompting is available for both fine-tuned and non-fine-tuned models. In this mode, a single-column input data source containing prompts needs to be supplied during generation. Each record in the output dataset corresponds to one prompt record in the input data source that is generated as the continuation of the respective prompt.

The following parameters control the generation of new records. All Gretel models implement a common interface to generate new data. See the example on how to Generate data from a model.

  • data_source (optional) - Prompts file to use for generation with prompting. Single-column CSV/JSON/JSONL.

  • seed_records_multiplier (optional, default 1) - Multiplier to control the number of rows emitted per prompt. By default, the output dataset will contain one record for each prompt record in the input dataset. If set to a number greater than one, the size of the output will be the number of prompts times seed_records_multiplier, with all records generated from each individual prompt appearing consecutively in the output. Ignored for unconditional generation.

  • num_records (optional, default 10) - Number of outputs to generate. Ignored for generation with prompting.

  • max_text_length (optional, default 100) - Max tokens in each output.

  • top_p (optional, default 0.89876) - Top p filtering probability threshold.

  • top_k (optional, default 43) - Top k tokens to keep. 0 to disable.

  • num_beams (optional, default 1) - Number of beams. 1 to disable beam search.

  • do_sample (optional, default True) - Use sampling if True, greedy if False.

  • do_early_stopping (optional, default True) - Stop beams early if True.

  • typical_p (optional, default 0.8) - Original distribution probability mass to consider.

  • temperature (optional, default 1.0) - Randomness amount. Higher is more random.

Model Information

The Gretel GPT model supports fine-tuning and inference of commercially viable large language models. Specific model information can be found on each model card linked below.

Supported Models

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.

CPU: Minimum 4 cores, 32GB RAM.

GPU (Required). Minimum Nvidia A10G, RTX3090 or better CUDA compliant GPU with 24GB+ RAM is required to run basic language models. For fine-tuning on datasets with more than 1,000 examples, a Nvidia A100 or H100 with 40+GB RAM is recommended.

Limitations and Biases

Large-scale language models such as Gretel GPT may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results. For more information on each model, please read the model cards linked under "model information".

Last updated