Gretel Navigator Fine Tuning

LLM-based AI system supporting multi-modal data.

Gretel Navigator Fine Tuning (navigator_ft) is an AI system combining a Large-Language Model pre-trained specifically on tabular datasets with learned schema based rules. It can train on datasets of various sizes (we recommend 10,000 or more records) and generate synthetic datasets with unlimited records.

navigator_ft excels at matching the correlations (both within a single record and across multiple records) and distributions in its training data across multiple tabular modalities, such as numeric, categorical, free text, JSON, and time series values.

navigator_ft is particularly useful when:

  1. Your dataset contains both numerical / categorical data AND free text data

  2. You want to reduce the chance of replaying values from the original dataset, particularly rare values

  3. Your dataset is event-driven, oriented around some column that groups rows into closely related events in a sequence

Model creation

The config below shows all the available training and generation parameters for Navigator Fine Tuning. Leaving all parameters unspecified (we will use defaults) is a good starting point for training on datasets with independent records, while the group_training_examples_by parameter is required to capture correlations across records within a group. The order_training_records_by parameter is strongly recommended if records within a group follow a logical order, as is the case for time series or sequential events.

For example, to generate realistic stock prices in the dow_jones_index dataset, we would set group_training_examples_by to "stock" and order_training_records_by to "date". This ensures that correlations within each stock ticker are maintained across multiple days, and the daily price and volume fluctuations are reasonable.

schema_version: "1.0"
name: "navigator_ft"
models:
  - navigator_ft:
      data_source: __tmp__
      # Optionally group records by the column(s) set below.
      # This is useful if you need to maintain correlations  
      # across multiple records. Otherwise, the model training 
      # assumes the records are independent.
      group_training_examples_by: null
      # Optionally order records by the column set below.
      # This is useful if your records are sequential.
      # Note that this parameter can only be used when 
      # your records are grouped using the above parameter.
      order_training_examples_by: null
        
      params:
        # The parameter below is a proxy for training time.
        # If set to 'auto', we will automatically choose an
        # appropriate value. An integer value will set the
        # number of records from the input dataset that the
        # model will see during training. It can be smaller
        # (we downsample), larger (we resample), or the same
        # size as your input dataset. A starting value to
        # experiment with is 25,000.
        num_input_records_to_sample: auto
        batch_size: 1
        gradient_accumulation_steps: 8
        learning_rate: 0.0005
        lr_scheduler: cosine
        warmup_ratio: 0.05
        weight_decay: 0.01
        lora_alpha_over_r: 1
        lora_r: 32
        lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
        rope_scaling_factor: 1        
        max_sequences_per_example: auto
        use_structured_generation: false
      
      privacy_params:
        dp: false
        epsilon: 8.0
        per_sample_max_grad_norm: 1.0 
        
      generate:
        num_records: 5000
        temperature: 0.75
        repetition_penalty: 1.2      
        top_p: 1.0
        stop_params: null

Parameter descriptions

  • data_source (str, required) - __tmp__ or point to a valid and accessible file in CSV, JSONL, or Parquet format.

  • group_training_examples_by (str or list of str, optional) - Column(s) to group training examples by. This is useful when you want the model to learn inter-record correlations for a given grouping of records.

  • order_training_examples_by (str, optional) - Column to order training examples by. This is useful when you want the model to learn sequential relationships for a given ordering of records. If you provide this parameter, you must also provide group_training_examples_by.

  • params - Parameters that control the model training process:

    • num_input_records_to_sample (int or auto, required, defaults to auto) - This parameter is a proxy for training time. It sets the number of records from the input dataset that the model will see during training. It can be smaller (we downsample), larger (we resample), or the same size as your input dataset. Setting this to the same size as your input dataset is effectively equivalent to training for a single epoch. A starting value to experiment with is 25,000. If set to auto, we will automatically choose an appropriate value.

    • batch_size (int, required, defaults to 1) - The batch size per device for training. Recommended to increase this when differential privacy is enabled. However, if the value is too high, you could get an out-of-memory error. A good size to start with is 8.

    • gradient_accumulation_steps (int, required, defaults to 8) - Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory.

    • learning_rate (float, required, defaults to 0.0005) - The initial learning rate for AdamW optimizer.

    • lr_scheduler (str, required, defaults to cosine) - The scheduler type to use. See the HuggingFace documentation of SchedulerType for all possible values.

    • warmup_ratio (float, required, defaults to 0.05) - Ratio of total training steps used for a linear warmup from 0 to the learning rate.

    • weight_decay (float, required, defaults to 0.01) - The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in the AdamW optimizer.

    • lora_alpha_over_r (float, required, defaults to 1.0) - The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2.

    • lora_r (int, required, defaults to 32) - The rank of the LoRA update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.

    • lora_target_modules (list of str, required, defaults to ["q_proj", "k_proj", "v_proj", "o_proj"]) - The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.

    • rope_scaling_factor (int, required, defaults to 1) - Scale the base LLM's context length by this factor using RoPE scaling to handle datasets with more columns, or datasets containing groups with more than a few records.

    • max_sequences_per_example (int, optional, defaults to auto) - This controls how examples are assembled for training and automatically set to a suitable value with auto (default).

    • use_structured_generation (bool, optional, default false) - With DP, we might have issues learning the tabular format, so in order to get more valid records, it helps to do structured generation.

  • privacy_params - To fine tune on a privacy-sensitive data source with differential privacy, use the parameters in this section.

    • dp (bool, optional, default false) - Flag to turn on differentially private fine tuning when a data source is provided.

    • epsilon (float, optional, default 8) - Privacy loss parameter for differential privacy. Lower values indicate higher privacy.

    • per_sample_max_grad_norm (float, optional, default 0.1) - Clipping norm for gradients per sample to ensure privacy. For each data sample, the gradient norm (magnitude of the gradient vector) is calculated. If it exceeds per_sample_max_grad_norm, it is scaled down to this threshold. This ensures that no single sample’s gradient contributes more than a set maximum amount to the overall update.

  • generate - Parameters that control model inference:

    • num_records (int, required, defaults to 5000) - Number of records to generate. If you want to generate more than 50_000 records, we recommend breaking the generation job into smaller batches, which you can run in parallel.

    • temperature (float, required, defaults to 0.75) - The value used to control the randomness of the generated data. Higher values make the data more random.

    • repetition_penalty (float, required, defaults to 1.2) - The value used to control the likelihood of the model repeating the same token.

    • top_p (float, required, defaults to 1.0) - The cumulative probability cutoff for sampling tokens.

    • stop_params (optional) - Optional mechanism to stop generation if too many invalid records are being created. This helps guard against extremely long generation jobs that likely do not have the potential to generate high-quality data. To turn this parameter on, you must set two parameters:

      • invalid_record_fraction (float, required) - The fraction of invalid records generated by the model that will stop generation after the patience limit is reached.

      • patience (int, required) - Number of consecutive generations where the invalid_record_fraction is reached before stopping generation.

Minimum requirements

If running this system in hybrid mode, the following instance specifications are recommended:

CPU: Minimum 4 cores, 32GB RAM.

GPU (Required): Minimum Nvidia A10G, L4, RTX4090 or better CUDA compliant GPU with 24GB+ RAM and Ada or newer architecture. For faster training and generation speeds and/or rope_scaling_factor values above 2, we recommend GPUs with 40+GB RAM such as NVIDIA A100 or H100.

Limitations and Biases

  1. The default context length for the underlying model in Navigator Fine Tuning can handle datasets with roughly 50 columns (less if modeling inter-row correlations using group_training_examples_by). Similarly, the default context length can handle event-driven data with sequences up to roughly 20 rows. To go beyond that, increase rope_scaling_factor. Note that the exact threshold (where the job will crash) depends on the number of tokens needed to encode each row, so decreasing the length of column names, abbreviating values, or reducing the number of columns can also help.

  2. navigator_ft is a great first option to try for most datasets. However, for unique datasets or needs, other models may be a better fit. For heavily numerical tables or use cases requiring 1 million records or more to be generated (navigator_ft can generate batches of up to 130,000 records at a time), we recommend using actgan. It will typically be much faster at generating results in these scenarios. For text-only datasets where you are willing to trade off generation time for an additional quality boost, we recommend using gpt_x.

  3. Given the model is an LLM, mappings from the training data often persist in the synthetic output, but there is no guarantee. If you require mappings across columns to persist, we recommend doing pre-processing to concatenate the columns or post-processing to filter out rows where the mappings did not persist.

  4. Pre-trained models such as the underlying model in Navigator Fine Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.

Last updated