Gretel GPT

Model type: Generative Pre-trained Transformer for natural language text generation.
Gretel GPT is an API for creating synthetic natural language text using Large Language Models (LLMs), which can be used for creating labeled examples for training or testing downstream machine learning models by either fine-tuning the model on your own unique data, or by providing a few examples for the model to learn to recreate.
This API is currently offered as preview and may change. Please contact us at [email protected] if you have any questions or would like to discuss natural language text generation in more detail.

Model creation

This model can be selected using the gpt_x model tag. Below is an example configuration that may be used to create and fine-tune a Gretel GPT model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to Create and Train a Model.
The following constraints exist for training data size:
  • Training data with less than 5 records (but greater than 0) will result in a model_error because that is not enough data to fine tune a model.
  • Training data with less than 100 records will have a warning message emitted in the model logs. We highly recommend using more than 100 training records.
  • If you do not provide training data, the model will not be fine-tuned and immediately saved for future inference jobs.
The following constraints exist on tokens in training and generated data:
  • The verbatim string ### is used as a universal start and stop padding string for text generation. This is a special string that Gretel uses regardless of the selected base model because different models have different special tokens.
  • If you are curating training text, for example by combining multiple fields together, you should avoid using ### as separator and also ensure this string does not appear in training samples.
schema_version: "1.0"
- gpt_x:
data_source: __tmp__
pretrained_model: 'gretelai/mpt-7b'
batch_size: 4
epochs: 3.0
weight_decay: 0.01
warmup_steps: 100
lr_scheduler: "linear"
learning_rate: 0.0002
max_tokens: 512
column_name: null
validation: null
num_records: 10
maximum_text_length: 100
Parameters that may be used to configure model training.
  • data_source (str, required) - __tmp__ or point to a valid and accessible file in CSV, JSON, or JSONL format. Leave this blank to bypass fine-tuning and use the frozen model weights- for example, when few-shot or zero-shot generating from a base LLM.
  • pretrained_model (str, optional, defaults to gretelai/mpt-7b) - It is highly recommended to use a model where Gretel has built and tested support for PEFT and LORA, which allow you to adapt versions of LLMS with significantly reduced training time. Must be a valid causal language model from the HuggingFace model repository.
  • batch_size (int, optional, defaults to 4)- The batch size per GPU/TPU core/CPU for training. Note: if you hit OOM (out of memory) errors with your GPU, try lowering the batch size.
  • epochs (float, optional, defaults to 3.0)- Total number of training epochs to perform while fine-tuning the model.
  • weight_decay (float, optional, defaults to 0.01) - The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer. Must be between 0 and 1 (inclusive.
  • warmup_steps (int, optional, defaults to 100) - The number of steps used for a linear warmup from 0 to learning_rate.
  • lr_scheduler (str, optional, defaults to linear) - The scheduler type to use. See the HuggingFace documentation of SchedulerType for all possible values and details.
    • Possible values include: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup.
  • learning_rate (float, optional, defaults to 0.0002) - The initial learning rate for the AdamW optimizer.
  • max_tokens (int, optional, defaults to 512) - The maximum length (in number of tokens) for any input record. The tokenizer used corresponds to the pretrained model selected.
  • column_name (str, optional, defaults to null) - The name of the column used for training. For a multi-column training data input, this parameter is used to specify which column contains the natural language text that should be used for training.
  • validation ([bool, int], optional, defaults to null) - The test size to use for validation. The integer value represents the absolute number of test samples.
  • generate (dict, optional, defaults to generating 10 records) - Section that controls the output generated during model training.
    • num_records (int, optional, defaults to 10) - The number of sample text outputs to generate during model training.
    • maximum_text_length (int, optional, defaults to 100) - Maximum number of tokens to generate (not including the prompt) in output text.
    • All the parameters from the Data generation section (excluding data_source) can be used in generate as well.
For training data inputs with multiple columns, use the column_name parameter to specify which column to train on. column_name should be set as the field name of the natural language text column, e.g. column_name: "text".

Data generation

Parameters controlling the generation of new records. All Gretel models implement a common interface to generate new data. See the example on how to Generate data from a model.
  • data_source (str, optional) - Provide a series of prompts in single-column CSV, JSON, or JSONL format. If specified, this will override thenum_records parameter, generating one record for each prompt in the data_source param. Must point to a valid and accessible file in single-column CSV, JSON, or JSONL format.
  • num_records (int, optional, defaults to 10) - The number of text outputs to generate.
  • maximum_text_length (int, optional, defaults to 100) - Maximum number of tokens to generate (not including the prompt) in output text.
  • top_p (float, optional, defaults to 0.89876) - If set to a float value < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
  • top_k (float, optional, defaults to 43) - Number of highest probability vocabulary tokens to keep for top_k filtering. Set to 0 to disable top_k filtering.
  • num_beams (int, optional, defaults to 1) - Number of beams for beam search. Set to 1 to disable beam search.
  • do_sample (bool, optional, defaults to True) - Whether or not to use sampling, otherwise use greedy decoding.
  • do_early_stopping (bool, optional, defaults toTrue) - Whether to stop the beam search when at least num_beams sentences are finished per batch or not.
  • typical_p (float, optional, defaults to 0.8) - The amount of probability mass from the original distribution that we wish to consider.
  • temperature (float, optional, defaults to 1.0) - The value used to module the next token probabilities. Higher temperatures lead to more randomness in the output.

Model information

The Gretel GPT model uses state of the art, commercially viable large language models. Specific model information can be found on each model card linked below.
Supported models

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia A10G, RTX3090 or better CUDA compliant GPU with 24GB+ RAM is required to run basic language models. For fine-tuning on datasets with more than 1,000 examples, a Nvidia A100 or H100 with 40+GB RAM is recommended.

Limitations and Biases

Large-scale language models such as Gretel GPT may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results. For more information on each model, please read the model cards linked under "model information".