Gretel GPT
Model type: Generative pre-trained transformer for text generation
Gretel GPT simplifies the process of training popular Large Language Models (LLMs) to generate synthetic text. It offers support for differentially private training, ensuring data privacy, and includes automated quality reporting with Gretel's Text Synthetic Quality Score (SQS). This allows you to create labeled examples to train or test other machine learning models, fine-tune the model on your data, or prompt it with examples for inference.
Model creation
Initialize a model to begin using Gretel GPT. Use the gpt_x
tag to select this model. Here is a sample config to create and fine-tune a Gretel GPT model. All Gretel models use a common interface for training synthetic data models from their config. See the reference for how to Create and Train a Model.
To prompt the base model directly without fine-tuning, set data_source
to null
at initialization.
When fine-tuning Gretel GPT models, these constraints apply:
Use 100+ examples if possible. Less than 100 - just prompt the base model directly.
Providing only 1-5 records will cause an error.
If your training dataset is a multi-column format, you MUST set the
column_name
.
data_source
(required) - Use__tmp__
or a valid CSV, JSON, or JSONL file. Leave blank to skip fine-tuning and use the base LLM weights, for few-shot or zero-shot generation.pretrained_model
(optional, defaults tomistralai/Mistral-7B-Instruct-v0.2
) - Gretel supports PEFT and LORA for fast adaptation of LLMs from models. Use a causal language model from the HuggingFace model repository.column_name
(optional) - Column with text for training if multi-column input. This parameter is required if multi-column input is used.params
- Controls the model training process.batch_size
(optional, default 4) - Batch size per GPU/TPU/CPU. Lower if out of memory.epochs
(optional, default 3) - Number of training epochs.weight_decay
(optional, default 0.01) - Weight decay for AdamW optimizer. 0 to 1.warmup_steps
(optional, default 100) - Warmup steps for linear lr increase.lr_scheduler
(optional, default linear) - Learning rate scheduler type.learning_rate
(optional, default 0.0002) - Initial AdamW learning rate.max_tokens
(optional, default 512) - Max input length in tokens.validation
(optional) - Validation set size. Integer is absolute number of samples.gradient_accumulation_steps
(optional, default 8) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.
peft_params
- Gretel GPT uses Low-Rank Adaptation (LoRA) of LLMs, which makes fine-tuning more efficient by drastically reducing the number of trainable parameters by updating weights of smaller matrices through low-rank decomposition.lora_r
(optional, default 8) - Rank of the matrices that are updated. A lower value means fewer trainable model parameters.lora_alpha_over_r
(optional, default 1) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, values of 0.5, 1 or 2 work well.target_modules
(optional, default null) - List of module names or regex expression of the module names to replace with LoRA. When unspecified, modules will be chosen according to the model architecture (e.g. Mistral, Llama).
privacy_params
- To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.dp
(optional, default false) - Flag to turn on differentially private fine tuning when a data source is provided.epsilon
(optional, default 8) - Privacy loss parameter for differential privacy. Specify the maximum value available for model fine tuning.delta
(optional, default auto) - Probability of accidentally leaking information. It is typically set to be much less than1/n
, wheren
is the number of training records. By default,delta
is automatically set based on the characteristics of your dataset to be less than or equal to1/n^1.2
. You can also choose your own value fordelta
. Decreasingdelta
(for example to1/n^2
, which corresponds todelta: 0.000004
for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.entity_column_name
(optional, default null) - Column representing unit of privacy. e.g.name
orid
. When null, record-level differential privacy will be maintained, i.e. the final model does not change by much when the input dataset changes by one record. When specified as e.g.user_id
, user-level differential privacy is maintained.
generate
(optional) - Controls generated outputs during training.num_records
(optional, default 10) - Number of outputs.maximum_text_length
(optional, default 100) - Max tokens per output.
Data generation
Parameters Documentation
General Configuration
schema_version (optional): Defines the version of the configuration schema.
name (optional): Name of the model configuration.
Models
models (required): List of model configurations.
gpt_x: Configuration for a specific model instance.
data_source (required): URLs or paths to the data files (CSV, JSON, JSONL). For temporary data, use "tmp".
pretrained_model (optional): Pretrained LLM model to use. Defaults to "gretelai/gpt-auto".
prompt_template (optional): Template for prompting the model.
column_name (optional): Name of the column with text data if using multi-column input. Required parameter if using multi-column input.
validation (optional): Size of the validation set, specified as an integer (absolute number of samples).
Training Parameters
params (optional): Configuration for training parameters.
batch_size (default 4): Number of samples per batch per GPU/TPU/CPU.
epochs (optional): Number of complete passes through the training dataset.
steps (default 750): Total number of training steps to perform.
weight_decay (default 0.01): Weight decay coefficient for the AdamW optimizer, a regularization parameter.
warmup_steps (default 100): Number of steps for learning rate warmup.
lr_scheduler (default linear): Type of learning rate scheduler.
learning_rate (default 0.0002): Initial learning rate for the AdamW optimizer.
max_tokens (default 512): Maximum number of tokens for each input sequence.
gradient_accumulation_steps (default 8): Number of steps to accumulate gradients before updating model parameters.
Parameter-Efficient Fine-Tuning (PEFT) Parameters
peft_params (optional): Parameters for fine-tuning using PEFT.
lora_r (default 8): Rank of the low-rank adaptation matrix in LoRA.
lora_alpha_over_r (default 1.0): Scaling factor for the LoRA adaptation.
target_modules (optional): Specific modules to apply LoRA adaptation.
Privacy Parameters
privacy_params (optional): Configuration for differential privacy (DP).
dp (default false): Enable differentially private training using DP-SGD.
epsilon (default 8.0): Privacy budget parameter for DP.
delta (default "auto"): Privacy parameter for DP, usually a very small number.
per_sample_max_grad_norm (default 1.0): Clipping norm for gradients per sample to ensure privacy.
entity_column_name (optional): Column name for entity-level differential privacy.
Generation Parameters
generate (optional): Parameters controlling the generation of synthetic text.
num_records (default 10): Number of records to generate.
seed_records_multiplier (default 1): Multiplier for the number of rows emitted per prompt in prompt-based generation.
maximum_text_length (default 100): Maximum number of tokens per generated text.
top_p (default 0.89876): Probability threshold for nucleus sampling (top-p).
top_k (default 43): Number of highest probability tokens to keep for top-k sampling.
num_beams (default 1): Number of beams for beam search. Use 1 to disable beam search.
do_sample (default true): Enable sampling if true, otherwise use greedy search.
do_early_stopping (default true): Enable early stopping in beam search if true.
typical_p (default 0.8): Typical probability mass to consider in sampling.
temperature (default 1.0): Sampling temperature. Higher values increase randomness.
Usage
Training Configuration: Define your data source and configure model parameters. Optionally, enable privacy settings.
Data Generation: Supports unconditional and prompt-based text generation. Configure generation parameters to control output features.
Make sure to set data_source and pretrained_model as per your requirements. Use column_name for specifying the text column in multi-column data inputs.
Model Information
The Gretel GPT model supports fine-tuning and inference of commercially viable large language models. Specific model information can be found on each model card linked below.
Supported Models
gretelai/gpt-auto
: Automatically selects the best available LLM for model trainingmistralai/Mistral-7B-Instruct-v0.2
meta-llama/Meta-Llama-3-8B-Instruct
Minimum requirements
If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia A10G, RTX3090 or better CUDA compliant GPU with 24GB+ RAM is required to run basic language models. For fine-tuning on datasets with more than 1,000 examples, a NVIDIA A100 or H100 with 40+GB RAM is recommended.
Limitations and Biases
Large-scale language models such as Gretel GPT may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results. For more information on each model, please read the model cards linked under "model information".
Last updated