Gretel Navigator Fine Tuning
LLM-based AI system supporting multi-modal data.
Gretel Navigator Fine Tuning (navigator_ft
) is an AI system combining a Large-Language Model pre-trained specifically on tabular datasets with learned schema based rules. It can train on datasets of various sizes (we recommend 10,000 or more records) and generate synthetic datasets with unlimited records.
navigator_ft
excels at matching the correlations (both within a single record and across multiple records) and distributions in its training data across multiple tabular modalities, such as numeric, categorical, free text, JSON, and time series values.
navigator_ft
is particularly useful when:
Your dataset contains both numerical / categorical data AND free text data
You want to reduce the chance of replaying values from the original dataset, particularly rare values
Your dataset is event-driven, oriented around some column that groups rows into closely related events in a sequence
Model creation
The config below shows all the available training and generation parameters for Navigator Fine Tuning. Leaving all parameters unspecified (we will use defaults) is a good starting point for training on datasets with independent records, while the group_training_examples_by
parameter is required to capture correlations across records within a group. The order_training_records_by
parameter is strongly recommended if records within a group follow a logical order, as is the case for time series or sequential events.
For example, to generate realistic stock prices in the dow_jones_index dataset, we would set group_training_examples_by
to "stock" and order_training_records_by
to "date". This ensures that correlations within each stock ticker are maintained across multiple days, and the daily price and volume fluctuations are reasonable.
Parameter descriptions
data_source
(str, required) -__tmp__
or point to a valid and accessible file in CSV, JSONL, or Parquet format.group_training_examples_by
(str or list of str, optional) - Column(s) to group training examples by. This is useful when you want the model to learn inter-record correlations for a given grouping of records.order_training_examples_by
(str, optional) - Column to order training examples by. This is useful when you want the model to learn sequential relationships for a given ordering of records. If you provide this parameter, you must also providegroup_training_examples_by
.params
- Parameters that control the model training process:num_input_records_to_sample
(int or auto, required, defaults toauto
) - This parameter is a proxy for training time. It sets the number of records from the input dataset that the model will see during training. It can be smaller (we downsample), larger (we resample), or the same size as your input dataset. Setting this to the same size as your input dataset is effectively equivalent to training for a single epoch. A starting value to experiment with is 25,000. If set toauto
, we will automatically choose an appropriate value.batch_size
(int, required, defaults to1
) - The batch size per device for training. Recommended to increase this when differential privacy is enabled. However, if the value is too high, you could get an out-of-memory error. A good size to start with is 8.gradient_accumulation_steps
(int, required, defaults to8
) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.learning_rate
(float, required, defaults to0.0005
) - The initial learning rate forAdamW
optimizer.lr_scheduler
(str, required, defaults tocosine
) - The scheduler type to use. See the HuggingFace documentation ofSchedulerType
for all possible values.warmup_ratio
(float, required, defaults to0.05
) - Ratio of total training steps used for a linear warmup from0
to the learning rate.weight_decay
(float, required, defaults to0.01
) - The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in the AdamW optimizer.lora_alpha_over_r
(float, required, defaults to1.0
) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2.lora_r
(int, required, defaults to32
) - The rank of the LoRA update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.lora_target_modules
(list of str, required, defaults to["q_proj", "k_proj", "v_proj", "o_proj"]
) - The list of transformer modules to apply LoRA to. Possible modules:'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'
.rope_scaling_factor
(int, required, defaults to1
) - Scale the base LLM's context length by this factor using RoPE scaling to handle datasets with more columns, or datasets containing groups with more than a few records.max_sequences_per_example
(int, optional, defaults toauto
) - This controls how examples are assembled for training and automatically set to a suitable value withauto
(default).use_structured_generation
(bool, optional, default false) - With DP, we might have issues learning the tabular format, so in order to get more valid records, it helps to do structured generation.
privacy_params
- To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.dp
(bool, optional, defaultfalse
) - Flag to turn on differentially private fine tuning when a data source is provided.epsilon
(float, optional, default8
) - Privacy loss parameter for differential privacy. Lower values indicate higher privacy.per_sample_max_grad_norm
(float, optional, default0.1
) - Clipping norm for gradients per sample to ensure privacy. For each data sample, the gradient norm (magnitude of the gradient vector) is calculated. If it exceedsper_sample_max_grad_norm
, it is scaled down to this threshold. This ensures that no single sample’s gradient contributes more than a set maximum amount to the overall update.
generate
- Parameters that control model inference:num_records
(int, required, defaults to5000
) - Number of records to generate. If you want to generate more than50_000
records, we recommend breaking the generation job into smaller batches, which you can run in parallel.temperature
(float, required, defaults to0.75
) - The value used to control the randomness of the generated data. Higher values make the data more random.repetition_penalty
(float, required, defaults to1.2
) - The value used to control the likelihood of the model repeating the same token.top_p
(float, required, defaults to1.0
) - The cumulative probability cutoff for sampling tokens.stop_params
(optional) - Optional mechanism to stop generation if too many invalid records are being created. This helps guard against extremely long generation jobs that likely do not have the potential to generate high-quality data. To turn this parameter on, you must set two parameters:invalid_record_fraction
(float, required) - The fraction of invalid records generated by the model that will stop generation after thepatience
limit is reached.patience
(int, required) - Number of consecutive generations where theinvalid_record_fraction
is reached before stopping generation.
Minimum requirements
If running this system in hybrid mode, the following instance specifications are recommended:
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required): Minimum Nvidia A10G, L4, RTX4090 or better CUDA compliant GPU with 24GB+ RAM and Ada or newer architecture. For faster training and generation speeds and/or rope_scaling_factor
values above 2, we recommend GPUs with 40+GB RAM such as NVIDIA A100 or H100.
Limitations and Biases
The default context length for the underlying model in Navigator Fine Tuning can handle datasets with roughly 50 columns (less if modeling inter-row correlations using
group_training_examples_by
). Similarly, the default context length can handle event-driven data with sequences up to roughly 20 rows. To go beyond that, increaserope_scaling_factor
. Note that the exact threshold (where the job will crash) depends on the number of tokens needed to encode each row, so decreasing the length of column names, abbreviating values, or reducing the number of columns can also help.navigator_ft
is a great first option to try for most datasets. However, for unique datasets or needs, other models may be a better fit. For heavily numerical tables or use cases requiring 1 million records or more to be generated (navigator_ft
can generate batches of up to 130,000 records at a time), we recommend usingactgan
. It will typically be much faster at generating results in these scenarios. For text-only datasets where you are willing to trade off generation time for an additional quality boost, we recommend usinggpt_x
.Given the model is an LLM, mappings from the training data often persist in the synthetic output, but there is no guarantee. If you require mappings across columns to persist, we recommend doing pre-processing to concatenate the columns or post-processing to filter out rows where the mappings did not persist.
Pre-trained models such as the underlying model in Navigator Fine Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
Last updated