Search…
Gretel-LSTM
Model type: Language Model that supports tabular, time-series, and natural language text data.
The Gretel LSTM model API is a generative data model that works with any language or character set, and is open-sourced as part of the gretel-synthetics library. The Gretel LSTM supports advanced features such as conditional data generation and differentially private learning.

Model creation

This model can be selected using the synthetics model tag. Below is an example configuration that may be used to create a Gretel LSTM model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example to train a model.
The configuration below contains additional options for training a Gretel LSTM model, with the default options displayed.
1
schema_version: "1.0"
2
3
models:
4
- synthetics:
5
data_source: __tmp__
6
params:
7
epochs: 100
8
batch_size: 64
9
vocab_size: 20000
10
reset_states: False
11
learning_rate: 0.001
12
rnn_units: 256
13
dropout_rate: 0.2
14
field_cluster_size: 20
15
early_stopping: True
16
gen_temp: 1.0
17
predict_batch_size: 64
18
validation_split: False
19
dp: False
20
dp_noise_multiplier: 0.001
21
dp_l2_norm_clip: 5.0
22
dp_microbatches: 1
Copied!
These params are the same params available in Gretel’s Open Source synthetic package. When one of these parameters is null, the default value from the Open Source package will be used. This helps ensure a similar experience when switching between open source and Gretel Cloud.
  • data_source (str, required) - Must point to a valid and accessible file URL in single-column CSV format.
  • batch_size (int, optional, defaults to 64) - Number of samples per gradient update. Using larger batch sizes can help make more efficient use of CPU/GPU parallelization, at the cost of memory.
  • vocab_size (int, optional, defaults to 20000) - The maximum vocabulary size for the tokenizer created by the unsupervised SentencePiece model. Set to 0 to use character-based tokenization.
  • reset_states (bool, optional, defaults to False) - Reset RNN model states between each generation run. This guarantees more consistent dataset creation over time, at the expense of model accuracy.
  • learning_rate (float, optional, defaults to 0.01) - The higher the learning rate, the more that each update during training matters. Note: When training with differential privacy enabled, if the updates are noisy (such as when the additive noise is large compared to the clipping threshold), a low learning rate may help with training.
  • rnn_units (int, optional, defaults to 256) - Positive integer, dimensionality of the output space for LSTM layers.
  • dropout_rate (float, optional, defaults to 0.2) - Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs. Using a dropout can help to prevent overfitting by ignoring randomly selected neurons during training. 0.2 (20%) is often used as a good compromise between retaining model accuracy and preventing overfitting.
  • field_cluster_size (int, optional, defaults to 20) - The maximum number of fields (columns) to train per model batch.
  • early_stopping (float, optional, defaults to 0.004972) - The initial learning rate for the AdamW optimizer.
  • gen_temp (int, optional, defaults to 1.0) - Controls the randomness of predictions by scaling the logits before applying softmax. Low temperatures result in more predictable text. Higher temperatures result in more surprising text. Experiment to find the best setting.
  • predict_batch_size (int, optional, defaults to 64) - How many words to generate in parallel. Higher values may result in increased throughput. The default of 64 should provide reasonable performance for most users.
  • validation_split (bool, optional, defaults to False) - Use a fraction of the training data as validation data. Use of a validation set is recommended as it helps prevent over-fitting and model memorization. When enabled, 20% of data will be used for validation.

Differential privacy

Differential privacy is a framework for measuring the privacy guarantees provided by an algorithm. This implementation is based on differentially private stochastic gradient descent (DP-SGD).
  • dp (bool, optional, defaults to False) - If True, train model with differential privacy enabled. This setting provides assurances that the models will encode general patterns in data rather than facts about specific training examples. These additional guarantees can usefully strengthen the protections offered for sensitive data and content, at a small loss in model accuracy and synthetic data quality. The differential privacy epsilon and delta values will be printed when training completes.
  • dp_noise_multiplier (float, optional, defaults to 0.01) - The amount of noise sampled and added to gradients during training. Generally, more noise results in better privacy, at the expense of model accuracy.
  • dp_l2_norm_clip (float, optional, defaults to 3.0) - The maximum Euclidean (L2) norm of each gradient is applied to update model parameters. This hyperparameter bounds the optimizer’s sensitivity to individual training points.
  • dp_microbatches (int, optional, defaults to 64) - Each batch of data is split into smaller units called micro-batches. Computational overhead can be reduced by increasing the size of micro-batches to include more than one training example. The number of micro-batches should divide evenly into the overall batch_size.
Training with differential privacy enabled provides measurable guarantees of privacy, helping to mitigate the risk of exposing sensitive training data.

Smart seeding

When using conditional data generation (smart seeding), you must provide the field names that you wish to use as seeds for generating records at model creation time. This is done by specifying a seed task at model training time.
Example configuration to enable Smart seeding:
1
schema_version: 1.0
2
3
models:
4
- synthetics:
5
task:
6
type: seed
7
attrs:
8
fields:
9
- seed_column_X
10
- seed_column_Y
11
- seed_column_Z
12
Copied!

Data generation

Parameters controlling the generation of new records. All Gretel models implement a common interface to generate new data. See the reference command line example for data generation.
1
schema_version: "1.0"
2
3
models:
4
- synthetics:
5
data_source: "optional_seeds.csv"
6
generate:
7
num_records: 5000
8
max_invalid: 5000
9
Copied!
  • generate.num_records (int, optional, defaults to 5000) - The number of text outputs to generate.
  • generate.max_invalid (int, optional) - This is the number of records that can fail the Data Validation process before generation is stopped. This setting helps govern a long running data generation process where the model is not producing optimal data. The default value will be five times the number of records being generated.
  • data_source (str, optional) - Provide a series of seed columns in CSV format to use for conditional data generation, matching the columns provided in the Smart seeding model definition. This will override the num_records parameters, generating one record for each prompt in the data_source param. Must point to a valid and accessible file URL in CSV format.
If your training data is less than 5000 records, we recommend setting the generate.num_records value to null. If this value is null, then the number of records generated will be the lesser of 5000 or the total number of records in the training data.

Automated validators

The Gretel LSTM model provides automatic data semantic validation when generating synthetic data that can be configured using the validators tag. When a model is trained, the following validator models are built, automatically, on a per-field basis:
  • Character Set: For categorical and other string values, the underlying character sets are learned. For example, if the values in a field are all hexadecimal, then the generated data will only contain [0-9a-f] values. This validator is case sensitive.
  • String Length: For categorical and other string values, the maximum and minimum string lengths of a field’s values are learned. When generating data, the generated values will be between the minimum and maximum lengths.
  • Numerical Ranges: For numerical fields, the minimum and maximum values are learned. During generation, numerical values will be between these learned ranges. The following numerical types will be learned and enforced: Float, Base2, Base8, Base10, Base16.
  • Field Data Types: For fields that are entirely integers, strings, and floats, the generated data will be of the data type assigned to the field. If a field has mixed data types, then the field may be one of any of the data types, but the above value-based validators will still be enforced.
The validators above are composed automatically. This to ensure that individual values in the synthetic data are within the basic semantic constraints of the training data.

Configurable validators

In addition to the built-in validators, Gretel offers advanced validators that can be managed in the configuration:
  • in_set_count (int, optional, defaults to 10): This validator accumulates all of the unique values in a field. If the cardinality of the field’s values is less than or equal to the setting, then the validator will enforce generated values being in the set of training values. If the cardinality of the field’s value is greater than the setting, the validator will have no affect. For example, if there is a field called US-State, and it has a cardinality of 50 and in_set_count is set to 50, during generation each value for this field must be one of the original values. If in_set_count was only set to 40, then the generated values will not be enforced.
  • pattern_count (int, optional, defaults to 10): This validator builds a pattern mask for each value in a field. Alphanumeric characters are masked, while retaining other special characters. For example, 867-5309 will be masked to ddd-dddd, and f32-sk-39d would mask to add-aa-dda where a represents any A-Za-z character. Much like the previous validator, if the cardinality of learned patterns is less than or equal to the settings, patterns will be enforced during data generation. If the unique pattern count is above the settings, enforcement will be ignored.
  • use_numeric_iqr (bool, optional, defaults to True): IQR-based validation for all numeric fields. When enabled, it calculates the IQR for values in the field and uses that range to validate generated values. This validator is useful when the training data may contain undesirable outliers that skew the min and max values in a field. Numeric outliers in the synthetic data can impact both quality and privacy. Outlier values can be exploited by Membership Inference and other adversarial attacks.
  • open_close_chars (string or list of strings, optional, defaults to null) This validator may be used when values contain specific open and closing characters around other values. For example, if there is a field named "Age" and a value of 143 (Months) the inclusion of the ( and ) characters around the "Months" string should be enforced. This validator will check for and enforce multiple nested open/close characters. It can check for any 2-tuple combination of open/close characters as well, so a more advanced usage might be to enforce a synthetic value such as Foo [Bar(baz), Fiz(bunch)]. If you wish to utilize this, you have two options:
    • The value for this setting can be set to default which will automatically look for and enforce the following open/close pairs: "", (), [], {}.
    • A list of strings, where each string must be exactly a length of 2. In this mode you can define custom open/close characters. Here is an example of using custom open/close chars: open_close_chars: ["()", "[]", "{}", "$"]

Model information

The underlying model used is a Long Short-Term Memory (LSTM) recurrent neural network. This model is initialized from random weights and trained on a dataset as an autoregressive language model, using cross-entropy loss.

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is required to run basic language models.

Limitations and Biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.