Model Configuration
Learn how to create and modify a synthetic data model configuration before model training to support different data types and privacy protections.

Synthetic Model Configuration

To train a synthetic data model, we can add the models object to our configuration. The models object takes a list of keyed objects, named by the type of model we wish to train. For a synthetic data model, we use synthetics. The minimal configuration required is below:
schema_version: "1.0"
name: "my-awesome-model"
- synthetics:
data_source: gretel_a63772a737c4412f9314fb998fa480e2_foo.csv
It is assumed that a project artifact was already uploaded for this particular configuration, but data_source can be any valid URL that is accessible by the client. By default, no extra objects or parameters are required. Gretel uses default settings that will work well for a variety of datasets.
Next, let’s explore the additional parameters that can be used. The configuration below contains additional options for training a synthetic model, with the default options displayed. Learn more about the hyper-parameter settings in our open-source docs.
schema_version: "1.0"
- synthetics:
data_source: __tmp__
epochs: 100
batch_size: 64
vocab_size: 20000
reset_states: False
learning_rate: 0.001
rnn_units: 256
dropout_rate: 0.2
field_cluster_size: 20
overwrite: True
early_stopping: True
gen_temp: 1.0
predict_batch_size: 64
validation_split: False
dp: False
dp_noise_multiplier: 0.001
dp_l2_norm_clip: 5.0
dp_microbatches: 1
in_set_count: 10
pattern_count: 10
num_records: 5000
max_invalid: 5000
outliers: medium
similarity: medium
There are three primary sections here to be aware of.

Synthetic Model Parameters

The params object contains key-value pairs that represent the available parameters that will be used to train a machine learning model on the data_source. By default, Gretel will start with 100 epochs and automatically stop training when attributes like model loss and accuracy stop improving.
The field_delimiter parameter is a single character that serves as the delimiter between fields in your training data. If this value is null (the default), then Gretel will automatically detect and use a delimiter.
These params are the same params available in Gretel’s Open Source synthetic package. When one of these parameters is null, the default value from the Open Source package will be used. This helps ensure a similar experience when switching between open source and Gretel Cloud.
Please see our open source documentation for a description of the other parameters.
The default model parameters are well suited for a variety of data. However, if your synthetic quality score or data quality is not optimal, please see our deep dives on parameter tuning and data pre-processing to increase performance.
Also, Gretel has configuration templates that may be helpful as starting points for creating your model.

Data Validators

Gretel provides automatic data semantic validation when generating synthetic data. When a model is trained, the following validator models are built, automatically, on a per-field basis:
  • Character Set: For categorical and other string values, the underlying character sets are learned. For example, if the values in a field are all hexadecimal, then the generated data will only contain [0-9a-f] values. This validator is case sensitive.
  • String Length: For categorical and other string values, the maximum and minimum string lengths of a field’s values are learned. When generating data, the generated values will be between the minimum and maximum lengths.
  • Numerical Ranges: For numerical fields, the minimum and maximum values are learned. During generation, numerical values will be between these learned ranges. The following numerical types will be learned and enforced: Float, Base2, Base8, Base10, Base16.
  • Field Data Types: For fields that are entirely integers, strings, and floats, the generated data will be of the data type assigned to the field. If a field has mixed data types, then the field may be one of any of the data types, but the above value-based validators will still be enforced.
The validators above are automatic. This to ensure that individual values in the synthetic data are within the basic semantic constraints of the training data.
In addition to the built-in validators, Gretel offers advanced validators that can be managed in the configuration:
  • in_set_count: This validator accumulates all of the unique values in a field. If the cardinality of the field’s values is less than or equal to the setting, then the validator will enforce generated values being in the set of training values. If the cardinality of the field’s value is greater than the setting, the validator will have no affect.
    • For example, if there is a field called US-State, and it has a cardinality of 50 and in_set_count is set to 50, during generation each value for this field must be one of the original values. If in_set_count was only set to 40, then the generated values will not be enforced.
  • pattern_count: This validator builds a pattern mask for each value in a field. Alphanumeric characters are masked, while retaining other special characters. For example, 867-5309 will be masked to ddd-dddd, and f32-sk-39d would mask to add-aa-dda where a represents any A-Za-z character. Much like the previous validator, if the cardinality of learned patterns is less than or equal to the settings, patterns will be enforced during data generation. If the unique pattern count is above the settings, enforcement will be ignored.
By default, both of the above settings are set to 10.
Very high settings for these validators will require higher memory usage.
  • use_numeric_iqr: When set to true, it enables IQR-based validation for all numeric fields. When enabled, it calculates the IQR for values in the field and uses that range to validate generated values.
    • This validator is useful when the training data may contain undesirable outliers that skew the min and max values in a field. Numeric outliers in the synthetic data can impact both quality and privacy. Outlier values can be exploited by Membership Inference and other adversarial attacks.

Data Generation

After a Synthetic Data model is trained a synthetic dataset will be created. This data set will be used to generate the Synthetic Data Report and create a sample synthetic data set that you can explore on your own.
The number of records to generate is controlled by the generate.num_records key in the synthetic config. The default value is 5000 records.
Additionally, you may specify a generate.max_invalid settings as well. This is the number of records that can fail the Data Validation process before generation is stopped. This setting helps govern a long running data generation process where the model is not producing optimal data. The default value will be five times the number of records being generated.
If your training data is less than 5000 records, we recommend setting the generate.num_records value to null. If this value is null, then the number of records generated will be the lesser of 5000 or the total number of records in the training data.
Once your model is trained. You can retrieve the model artifacts (synthetic report, model archive, and sample data) and also schedule the generation of larger datasets.