Gretel Tabular Fine-Tuning

LLM-based AI system supporting multi-modal data.

Gretel Tabular Fine-Tuning (tabular_ft) is an AI system combining a Large-Language Model pre-trained specifically on tabular datasets with learned schema based rules. It can train on datasets of various sizes (we recommend 10,000 or more records) and generate synthetic datasets with unlimited records.

tabular_ft excels at matching the correlations (both within a single record and across multiple records) and distributions in its training data across multiple tabular modalities, such as numeric, categorical, free text, JSON, and time series values.

tabular_ft is particularly useful when:

Your dataset contains both numerical / categorical data AND free text data
You want to reduce the chance of replaying values from the original dataset, particularly rare values
Your dataset is event-driven, oriented around some column that groups rows into closely related events in a sequence

Step configuration

The config below shows all the available training and generation parameters for Tabular Fine-Tuning. Leaving all parameters unspecified (we will use defaults) is a good starting point for training on datasets with independent records, while the group_training_examples_by parameter is required to capture correlations across records within a group. The order_training_records_by parameter is strongly recommended if records within a group follow a logical order, as is the case for time series or sequential events.

For example, to generate realistic stock prices in the dow_jones_index dataset, we would set group_training_examples_by to "stock" and order_training_records_by to "date". This ensures that correlations within each stock ticker are maintained across multiple days, and the daily price and volume fluctuations are reasonable.

train:
  # Optionally group records by the column(s) set below.
  # This is useful if you need to maintain correlations
  # across multiple records. Otherwise, the training
  # assumes records are independent.
  group_training_examples_by: null
  
  # Optionally order records by the column set below.
  # This is useful if your records are sequential.
  # Note that this parameter can only be used when 
  # your records are grouped using the above parameter.
  order_training_examples_by: null
  
  max_sequences_per_example: auto
    
  params:
    # The parameter below is a proxy for training time.
    # If set to 'auto', we will automatically choose an
    # appropriate value. An integer value will set the
    # number of records from the input dataset that the
    # model will see during training. It can be smaller
    # (we downsample), larger (we resample), or the same
    # size as your input dataset. A starting value to
    # experiment with is 25,000.
    num_input_records_to_sample: auto
    batch_size: 1
    gradient_accumulation_steps: 8
    learning_rate: 0.0005
    lr_scheduler: cosine
    warmup_ratio: 0.05
    weight_decay: 0.01
    lora_alpha_over_r: 1
    lora_r: 32
    lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
    rope_scaling_factor: auto
  
  privacy_params:
    dp: false
    epsilon: 8.0
    per_sample_max_grad_norm: 1.0 
  
generate:
  num_records: 1000
  temperature: 0.75
  repetition_penalty: 1.2      
  top_p: 1.0
  stop_params:
    invalid_fraction_threshold: 0.8
    patience: 3
  use_structured_generation: false

Train parameters

group_training_examples_by (str or list of str, optional) - Column(s) to group training examples by. This is useful when you want the model to learn inter-record correlations for a given grouping of records. This parameter is only useful in the context of event-driven data, where records in a single group have a sequential order that is important for the model to learn (such as hospital events grouped by patient ID, where the model needs to learn that intake always happens prior to release for a given patient). If your data has categorical columns but no event sequences, using this parameter will not provide any benefit; the model is able to learn the correlation of the category to the other columns without this parameter.
order_training_examples_by (str, optional) - Column to order training examples by. This is useful when you want the model to learn sequential relationships for a given ordering of records. If you provide this parameter, you must also provide group_training_examples_by.
max_sequences_per_example (int, optional, defaults to auto) - This controls how examples are assembled for training and automatically set to a suitable value with auto (default).
params - Parameters that control the model training process:
- num_input_records_to_sample (int or auto, required, defaults to auto) - This parameter is a proxy for training time. It sets the number of records from the input dataset that the model will see during training. It can be smaller (we downsample), larger (we resample), or the same size as your input dataset. Setting this to the same size as your input dataset is effectively equivalent to training for a single epoch. A starting value to experiment with is 25,000. If set to auto, we will automatically choose an appropriate value.
- batch_size (int, required, defaults to 1) - The batch size per device for training. Recommended to increase this when differential privacy is enabled. However, if the value is too high, you could get an out-of-memory error. A good size to start with is 8.
- gradient_accumulation_steps (int, required, defaults to 8) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.
- learning_rate (float, required, defaults to 0.0005) - The initial learning rate for AdamW optimizer.
- lr_scheduler (str, required, defaults to cosine) - The scheduler type to use. See the HuggingFace documentation of SchedulerType for all possible values.
- warmup_ratio (float, required, defaults to 0.05) - Ratio of total training steps used for a linear warmup from 0 to the learning rate.
- weight_decay (float, required, defaults to 0.01) - The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in the AdamW optimizer.
- lora_alpha_over_r (float, required, defaults to 1.0) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2.
- lora_r (int, required, defaults to 32) - The rank of the LoRA update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
- lora_target_modules (list of str, required, defaults to ["q_proj", "k_proj", "v_proj", "o_proj"]) - The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.
- rope_scaling_factor (int or auto, required, defaults to auto) - Scale the base LLM's context length by this factor using RoPE scaling to handle datasets with more columns, or datasets containing groups with more than a few records. If you hit the error for maximum tokens, you can try increasing the rope_scaling_factor. Maximum is 6, and you may first want to try increasing to 2. The auto parameter will estimate the number of tokens per example in your dataset to determine what is likely to be the appropriate value for this parameter.
privacy_params - To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.
- dp (bool, optional, default false) - Flag to turn on differentially private fine tuning when a data source is provided.
- epsilon (float, optional, default 8) - Privacy loss parameter for differential privacy. Lower values indicate higher privacy.
- per_sample_max_grad_norm (float, optional, default 0.1) - Clipping norm for gradients per sample to ensure privacy. For each data sample, the gradient norm (magnitude of the gradient vector) is calculated. If it exceeds per_sample_max_grad_norm, it is scaled down to this threshold. This ensures that no single sample’s gradient contributes more than a set maximum amount to the overall update.

Generate parameters

num_records (int, required, defaults to 1000) - Number of records to generate. If you want to generate more than 50_000 records, we recommend breaking the generation job into smaller batches, which you can run in parallel.
temperature (float, required, defaults to 0.75) - The value used to control the randomness of the generated data. Higher values make the data more random.
repetition_penalty (float, required, defaults to 1.2) - The value used to control the likelihood of the model repeating the same token.
top_p (float, required, defaults to 1.0) - The cumulative probability cutoff for sampling tokens.
stop_params (optional) - Optional mechanism to stop generation if too many invalid records are being created. This helps guard against extremely long generation jobs that likely do not have the potential to generate high-quality data. This parameter is enabled by default and can be disabled by setting it to null. It can also be controlled using the following two parameters:
- invalid_fraction_threshold (float, required, defaults to 0.8) - The fraction of invalid records generated by the model that will stop generation after the patience limit is reached.
- patience (int, required, defaults to 3) - Number of consecutive generations where the invalid_fraction_threshold is reached before stopping generation.
use_structured_generation (bool, optional, default false) - With DP, we might have issues learning the tabular format, so in order to get more valid records, it helps to do structured generation.

Limitations and Biases

The default context length for the underlying model in Tabular Fine-Tuning can handle datasets with roughly 50 columns (less if modeling inter-row correlations using group_training_examples_by). Similarly, the default context length can handle event-driven data with sequences up to roughly 20 rows. To go beyond that, increase rope_scaling_factor. Note that the exact threshold (where the job will crash) depends on the number of tokens needed to encode each row, so decreasing the length of column names, abbreviating values, or reducing the number of columns can also help.
tabular_ft is a great first option to try for most datasets. However, for unique datasets or needs, other models may be a better fit. For heavily numerical tables or use cases requiring 1 million records or more to be generated (tabular_ft can generate batches of up to 130,000 records at a time), we recommend using tabular_gan. It will typically be much faster at generating results in these scenarios. For text-only datasets where you are willing to trade off generation time for an additional quality boost, we recommend using text_ft.
Given the model is an LLM, mappings from the training data often persist in the synthetic output, but there is no guarantee. If you require mappings across columns to persist, we recommend doing pre-processing to concatenate the columns or post-processing to filter out rows where the mappings did not persist.
Pre-trained models such as the underlying model in Tabular Fine-Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.

Troubleshooting

Tabular Fine-Tuning is only recommended when:

You want to generate synthetic data and have a sample of at least 500 rows of real data
Your dataset has (or can be reduced to) relatively few columns (<30)
There are relatively few events per sequence (<10) if you have event-driven data

Common errors

There are two common errors you might face when running Tabular Fine-Tuning: limited context window and max runtime.

Limited context window

When you train the Tabular Fine-Tuning model, we pass data into the LLM's context window repeatedly. All of the data related to a single example (i.e. one record, or all records for event-driven data) must fit inside the context window so that it can get passed in together.

When your data has many columns (>30), a single row can often exceed the context window, especially if any columns contain long free text. Similarly, if you have many events per sequence (>10) for event-driven data, passing in all of those at one time (required to learn the event sequences) can exceed the context window.

It is important to highlight that these are all rough guidelines. If your data has fewer columns, you may be able to fit more events per sequence. If your data has long free text in some columns, you may be limited at far fewer than 30 columns.

If any record (or set of records within a sequence for event-driven data) exceeds the context window, the job will not even start fine-tuning.

Possible configuration and data changes that can help this error:

You can increase the rope_scaling_factor in order to scale up the context window size (integer between 1 and 6; maximum value is 6)
- This can be effective, but note that it typically increases the runtime. Jobs on the free tier are limited to 1 hour per job, so by increasing rope_scaling_factor, it is more likely you will hit the max runtime error.
If you have sequenced data, you can try reducing the number of rows in each sequence (<8-10). Each sequence with all its included rows is assumed as a single example for the LLM. Hence, with more rows, you are more likely to exceed the context window size limit. (Note: Having a high number of columns can make this situation worse!)
You can reduce the number of columns (try <20). In particular, columns with long text tend to eat up a good chunk of the context window and are great candidates for removal.

Max runtime

In this case, the job is stopped in the middle of fine-tuning or inference due to the runtime limit per job (1 hour on the free tier). If you hit this error, you can try to reduce the time the job runs. Let’s see what data criteria causes long job runtimes and how we can prevent this.

Model phases that are time consuming:

Model fine-tuning
Inference (usually due to a low percentage of valid records, inference can become time-consuming by having to generate far more records than the target)

Possible solutions:

Reduce the number of records you are choosing to generate num_records generate parameter. You could try 500 rather than the default of 1000.
Remove any non-critical columns (try to reduce to <20). This can help the model learn the data better, which will typically increase the percentage of valid records and reduce time spent on inference.
If you have a sequenced dataset and a high number of records per sequence, reducing the number of records per sequence could result in more efficient fine-tuning. For example, you could filter out any sequences with >8 events per sequence. This may make it easier for the model to learn the sequences, increasing the percentage of valid records and reducing the time spent on inference.
Experiment with the num_input_records_to_sample parameter. Setting this value too low could make it very difficult for the model to learn the data, which could then lead to a lower percentage of valid records and increase the time spent on inference. However, at some point, there are diminishing returns. The model has learned the data quite well, and now it is spending unnecessary time re-reviewing the records (and possibly overfitting). It is possible there is a better balance point for your dataset than the default, either higher or lower. For example, you could try 5000 records instead of the default of 25000.

While removing unnecessary columns can help to avoid the limited context window and max runtime errors, increasing the number of rows can actually help the model learn better, increasing the percentage of valid records and reducing overall runtime.

PreviousSynthetics NextGretel Text Fine-Tuning

Last updated 20 days ago

Was this helpful?