Search…
Gretel-DGAN
Model type: Adversarial Model for time series data.
The Gretel DGAN model API provides access to a generative data model for time-series data. This model supports time varying features, fixed attributes, categorical variables, and works well with many time sequence examples to train on.

Model Creation

This model can be selected using the timeseries_dgan model tag. An example configuration is provided below, but note that you will often need to update some of the options to match your input data. The DGAN model supports 2 input formats, wide and long, that we'll explain in detail in the Data format section. These formats and related parameters tell the DGAN model how to parse your data source as time-series. The training data (data source) is a table, for example a csv file, using the common interface to train or fine-tune all Gretel models. See the reference example to train a model.

Data format

The DGAN model will generate synthetic time-series of a particular length, determined by the max_sequnce_len parameter. The training examples must also be that same length. As with all machine learning models, the more examples of these sequences are available to train the model, the higher the accuracy. So we have several config parameters to inform the DGAN model how to convert your input csv to many example sequences.
We support 2 data styles to provide time-series data to the DGAN model: long and wide.

Long data style

This is the most versatile data format to use. We assume the input table has 1 time point per row and use the config options to specify attributes, features, etc. For example, stock price data in this format might look like the following table:
Date
Sector
Symbol
Open
High
Low
Close
Volume
2022-06-01
0
AAPL
125
135
115
126
100000
2022-06-02
0
AAPL
126
140
121
137
500000
...
0
...
...
...
...
...
...
2022-06-30
0
AAPL
185
193
170
177
250000
2022-06-01
1
V
222
233
213
214
50000
2022-06-02
1
V
214
217
200
203
75000
...
1
...
...
...
...
...
...
2022-06-30
1
V
234
261
212
236
150000
...
...
...
...
...
...
...
...
Here, we use each stock (symbol) to split the data into examples. Each example time-series corresponds to max_sequence_len rows in the input. Then each generated example in the synthetic data is like a new stock with a sequence of prices that exhibits similar types of behavior as observed in the training data.
In addition to the price information that changes each day, we also have an attribute, Sector, that is fixed for each example. The model can utilize this if certain sector's stocks tend to be more volatile than others. In this case, the Sector is also a discrete variable, and it must already be ordinal encoded for the input data passed to Gretel's APIs. So 0 might correspond to technology sector, and 1 to financial sector. Consider using sklearn's OrdinalEncoder to convert a string column.
Use the following config snippet for this type of setup, updating the column names as needed for your data:
df_style = "long"
example_id_column = ["Symbol"]
time_column = ["Date"]
attribute_columns = ["Sector"]
discrete_columns = ["Sector"]
Remember that all training sequences must be the same length, so you should have the same number of rows in the input for each stock symbol.
If there's not a good column to split the data into examples, we support automatic splitting when no example_id_column is provided (though attributes are not supported in this mode). We'll split the input data (after sorting on time_column if provided) into chunks of the required length.
When using the auto splitting feature, note that the generated data will have an additional column, called example_id, with integer values. These values show how you should group the generated data for analyses. Temporal correlations within the same example_id value will match the training data, but any comparisons across different example_id values are not meaningful. So it's not recommended to concatenate all the generated examples into one very long sequence. There will be discontinuities every max_sequence_len rows, because each example is generated independently.

Wide data style

An alternative data style if there's exactly 1 feature (time varying variable). We assume each example is 1 row in the input table. Let's use just the closing price, but otherwise the same underlying data as in the long data style example above:
Sector
2022-06-01
2022-06-02
...
2022-06-30
0
126
137
...
177
1
213
203
...
236
...
...
...
...
...
With the sequence being represented as columns, each row is now one training example. Again we have the Sector attribute that is already ordinal encoded. The model doesn't need the Symbol column because no splitting into examples is required, so it should be dropped before sending the data to Gretel. The following config snippet will work with the above input:
df_style = "wide"
attribute_columns = ["Sector"]
discrete_columns = ["Sector"]

Parameters

Full list of configuration options for the DGAN model.
schema_version: "1.0"
name: dgan-all-params
models:
- timeseries_dgan:
data_source: __temp__
params:
# Update with length of sequences in your data, or
# desired length of generated sequences if using
# auto splitting (example_id_column is null).
max_sequence_len: 10
# Update to ensure sample_len evenly divides
# max_sequence_len.
sample_len: 1
attribute_noise_dim: 10
feature_noise_dim: 32
attribute_num_layers: 3
attribute_num_units: 100
feature_num_layers: 1
feature_num_units: 100
use_attribute_discriminator: true
normalization: 1
apply_feature_scaling: true
apply_example_scaling: false
gradient_penalty_coef: 10
attribute_gradient_penalty_coef: 10
attribute_loss_coef: 10
generator_learning_rate: 0.00001
discriminator_learning_rate: 0.00001
attribute_discriminator_learning_rate: 0.00001
batch_size: 100
epochs: 5000
discriminator_rounds: 1
generator_rounds: 1
# Update with column names from your data set, see
# "Data style parameters" section of documentation
# for details on usage.
time_column: null
example_id_column: null
attribute_columns: null
feature_columns: null
discrete_columns: null
df_style: long
Data parameters:
  • df_style (string, required, defaults to 'long') - Either 'wide' or 'long' indicating the format style of the input data.
  • example_id_column (string, optional, defaults to null) - Column name to split data into examples for long style data. Effectively performs a group by operation on this column, and each group becomes an example. If null, the rows are automatically split into training examples based on the max_sequence_len. Note generated synthetic data will contain an example_id column when this automatic splitting is used.
  • attribute_columns (list of strings, optional, defaults to null) - Column names of fixed attributes that do not vary over time for each example sequence. Used by both 'wide' and 'long' formats. If null, the model will not use any attributes. Note that in 'long' format, this column must be constant for each example, so there must be a 1-to-1 mapping from values in the example_id_column and each attribute column. Because of this, auto splitting (when example_id_column is null) does not currently support attribute columns.
  • feature_columns (list of strings, optional, defaults to null) - Column names of features, the variables that vary over time. Used by both 'wide' and 'long' formats. If specified, only these columns will be used for features. If null, then all columns in the input data that are not used in other column parameters will be the features.
  • time_column (string, optional, defaults to null) - Column name of date or time values to sort before creating example sequences in 'long' format. If null, the order from the input data is used. Generated synthetic data will contain this column using an arbitrary set of values from one training example. So if different examples have different time ranges (e.g., because auto splitting was used), one sequence of time values will be used for all synthetic data.
  • discrete_columns (list of strings, optional, defaults to null) - Column names (either attributes or features) to model as categorical variables for DGAN. These must be ordinal (or label) encoded in your input data so the values in [0,1,2,...,k-1] for k categorical values. All attribute and feature columns not listed here are assumed to be continuous.
  • max_sequence_len (int, required) - Length of generated synthetic sequences, length of all training examples. Training requires that all examples are exactly this length.
  • sample_len (int, required) - Number of time points to produce from 1 RNN cell in the generator. Must evenly divide max_sequence_len. When max_sequence_len is smaller (<20), recommended to use sample_len=1. For longer sequences, the model often learns better when max_sequence_len/sample_len is between 10 and 20. Increasing sample_len is also an option if DGAN is running out of memory (receiving sigkill errors from Gretel API) as it should lead to fewer parameters and a smaller memory footprint for the model.
  • data_source (str, required) - Input data, must point to a valid and accessible file URL. Often set automatically by CLI (--in-data) or may use local file with SDK and upload_data_source=True.
Model structure parameters
  • apply_feature_scaling (bool, required, defaults to True) - Automatically scale continuous variables (in attributes or features) to the appropriate range as specified by normalization. If False, the input data must already be scaled to the appropriate range ([-1,1] or [0,1]) or the model will not work.
  • apply_example_scaling (bool, required, defaults to True) - Internally rescale continuous features in each example and model the range for each example. This helps the model learn when different examples have very different value ranges. E.g., for stock price data, there may be penny stocks (prices usually between $0.001 and $1), stocks in the $1-$100 range, others in $100-$1000 range, and the Berkshire Hathaways in the $100,000 to $1,000,000 range.
  • normalization (string, required, defaults to 'MINUSONE_ONE') - Defines internal range of continuous variables. Supported values are 'MINUSONE_ONE'where continuous variables are in [-1,1] and tanh activations are used, and 'ZERO_ONE' where continuous variables are in [0,1] and sigmoid activations are used. Also see apply_feature_scaling.
  • use_attribute_discriminator (bool, required, defaults to True) - Use a second discriminator that only operates on the attributes as part of the GAN. Helps ensure the attribute distributions are accurate. Also see attribute_loss_coef.
  • attribute_noise_dim (int, required, defaults to 10) - Width of noise vector in the GAN generator to create the attributes.
  • feature_noise_dim (int, required, defaults to 10) - Width of noise vector in the GAN generator to create the features.
  • attribute_num_layers (int, required, defaults to 3) - Number of hidden layers in the feed-forward MLP to create attributes in the GAN generator.
  • attribute_num_units (int, required, defaults to 100) - Number of units in each layer of the feed-forward MLP to create attributes in the GAN generator.
  • feature_num_layers (int, required, defaults to 1) - Number of LSTM layers in the RNN to create features in the GAN generator.
  • feature_num_units (int, required, defaults to 100) - Number of units in each LSTM layer to create features in the GAN generator.
Training parameters
  • batch_size (int, required, defaults to 1000) - Size of batches for training and generation. Larger values should run faster, so try increasing if training is taking a long time. If batch_size is too large for the model setup, the memory footprint for training may be too big for available RAM and crashes (sigkill errors from Gretel API).
  • epochs (int, required, defaults to 400) - Number of epochs to train (iterations through the training data while optimizing parameters).
  • gradient_penalty_coef (float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss.
  • attribute_gradient_penalty_coef (float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss for the attribute discriminator (if enabled with use_attribute_discriminator).
  • attribute_loss_coef (float, required, defaults to 1.0) - When use_attribute_discriminator is True, the coefficient on the attribute discriminator loss when combined with the generic discriminator loss. Try increasing this param if attribute discriminator is enabled but the attribute distributions of the generated data do not match the training data.
  • generator_learning_rate (float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train parameters of the GAN generator.
  • discriminator_learning_rate (float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN discriminator.
  • attribute_discriminator_learning_rate (float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN attribute discriminator (if enabled with use_attribute_discriminator).
  • discriminator_rounds (int, required, defaults to 1) - Number of optimization steps of the discriminator(s) to perform for each batch. Some GAN literature mentions using 5 or 10 for this parameter to improve model performance.
  • generator_rounds (int, required, defaults to 1) - Number of optimization steps of the generator to perform for each batch.

Differential privacy

Differential privacy is currently not supported for the Gretel DGAN model.

Smart seeding

Conditional data generation (smart seeding) is currently not supported for the Gretel DGAN model.

Data generation

Sample CLI to generate 1000 additional examples from a trained DGAN model:
gretel models run \
--project <project-name> \
--model-id <model-id> \
--runner cloud \
--param num_records 1000 \
--output .
Also see the reference command line example for data generation.

Model information

The underlying model is DoppelGANger, a generative adversarial network (GAN) specifically constructed for time series data. The model is initialized from random weights and trained on the customer provided dataset using the Wasserstein GAN loss. We use our PyTorch implementation of DoppelGANger in gretelai/gretel-synthetics based on the original paper by Lin et al. Additional details about the model can be found in that paper: http://arxiv.org/abs/1909.13403

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is recommended to run the DGAN model.

Limitations and biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
As an open beta model, there are several technical limitations:
  • Model training is sometimes unstable, so if you see poor performance, retraining a few times with the same data and config can sometimes lead to notably better results from one run.
  • All training and generated sequences must be the exact same length (max_sequence_len).
  • Discrete variables must already be ordinal encoded, that is, use [0,1,2,...,k-1] for a categorical variable with k distinct values. Consider using sklearn's OrdinalEncoder.
  • Synthetic quality report is not supported.
Last modified 13d ago
Copy link
On this page
Model Creation
Data format
Parameters
Differential privacy
Smart seeding
Data generation
Model information
Minimum requirements
Limitations and biases