Gretel Text Fine-Tuning

Model type: Generative pre-trained transformer for text generation

Gretel Text Fine-Tuning simplifies the process of training popular Large Language Models (LLMs) to generate synthetic text. It offers support for differentially private training, ensuring data privacy, and includes automated quality reporting with Gretel's Text Synthetic Quality Score (SQS). This allows you to create labeled examples to train or test other machine learning models, fine-tune the model on your data, or prompt it with examples for inference.

Step configuration

The config below shows all the available training and generation parameters for Text Fine-Tuning. It is best to use Text Fine-Tuning on datasets with only a single column of free text.

If your training dataset is a multi-column format, you MUST set the column_name when using Text Fine-Tuning.

train:
  pretrained_model: "gretelai/gpt-auto"
  prompt_template: null
  column_name: null
  validation: null
  params:
    batch_size: 4
    epochs: null
    steps: 750
    weight_decay: 0.01
    warmup_steps: 100
    lr_scheduler: "linear"
    learning_rate: 0.0001
    max_tokens: 512
    gradient_accumulation_steps: 8
  peft_params:
    lora_r: 8
    lora_alpha_over_r: 1.0
    target_modules: null
  privacy_params:
    dp: false
    epsilon: 8.0
    delta: "auto"
    per_sample_max_grad_norm: 1.0
    entity_column_name: null
generate:
  num_records: 5000
  maximum_text_length: 100

Train parameters

pretrained_model (optional, defaults to meta-llama/Llama-3.1-8B-Instruct) - Base model used for fine-tuning. These are the models currently supported:
- gretelai/gpt-auto - defaults to meta-llama/Meta-Llama-3-8B-Instruct
- mistralai/Mistral-7B-Instruct-v0.2
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
- meta-llama/Llama-3.1-8B-Instruct
column_name (optional) - Column with text for training if multi-column input. This parameter is required if multi-column input is used.
params - Parameters that control the model training process:
- batch_size (optional, default 4) - Batch size per GPU/TPU/CPU. Lower if out of memory.
- epochs (optional, default 3) - Number of training epochs.
- weight_decay (optional, default 0.01) - Weight decay for AdamW optimizer. 0 to 1.
- warmup_steps (optional, default 100) - Warmup steps for linear lr increase.
- lr_scheduler (optional, default linear) - Learning rate scheduler type.
- learning_rate (optional, default 0.0002) - Initial AdamW learning rate.
- max_tokens (optional, default 512) - Max input length in tokens.
- validation (optional) - Validation set size. Integer is absolute number of samples.
- gradient_accumulation_steps (optional, default 8) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.
peft_params - Gretel Text Fine-Tuning uses Low-Rank Adaptation (LoRA) of LLMs, which makes fine-tuning more efficient by drastically reducing the number of trainable parameters by updating weights of smaller matrices through low-rank decomposition.
- lora_r (optional, default 8) - Rank of the matrices that are updated. A lower value means fewer trainable model parameters.
- lora_alpha_over_r (optional, default 1) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, values of 0.5, 1 or 2 work well.
- target_modules (optional, default null) - List of module names or regex expression of the module names to replace with LoRA. When unspecified, modules will be chosen according to the model architecture (e.g. Mistral, Llama).
privacy_params - To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.
- dp (optional, default false) - Flag to turn on differentially private fine tuning when a data source is provided.
- epsilon (optional, default 8) - Privacy loss parameter for differential privacy. Specify the maximum value available for model fine tuning.
- delta (optional, default auto) - Probability of accidentally leaking information. It is typically set to be much less than 1/n, where n is the number of training records. By default, delta is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.2. You can also choose your own value for delta. Decreasing delta (for example to 1/n^2, which corresponds to delta: 0.000004 for a 500-record training dataset) provides even stronger privacy guarantees, while increasing it may improve synthetic data quality.
- entity_column_name (optional, default null) - Column representing unit of privacy. e.g. name or id. When null, record-level differential privacy will be maintained, i.e. the final model does not change by much when the input dataset changes by one record. When specified as e.g. user_id, user-level differential privacy is maintained.

Generate parameters

num_records (optional, default 5000) - Number of output records
maximum_text_length (optional, default 100) - Max tokens per output record

Limitations and Biases

Large-scale language models such as Gretel Text Fine-Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.

PreviousGretel Tabular Fine-Tuning NextGretel Tabular GAN

Last updated 11 days ago

Was this helpful?