Gretel DGAN
Adversarial model for time series data.
Last updated
Adversarial model for time series data.
Last updated
The Gretel DGAN model API provides access to a generative data model for time-series data. This model supports time varying features, fixed attributes, categorical variables, and works well with many time sequence examples to train on.
This model can be selected using the timeseries_dgan
model tag. An example configuration is provided below, but note that you will often need to update some of the options to match your input data. The DGAN model supports 2 input formats, wide and long, that we'll explain in detail in the Data format section. These formats and related parameters tell the DGAN model how to parse your data source as time-series. The training data (data source) is a table, for example a csv file, using the common interface to train or fine-tune all Gretel models. See the reference example on how to .
The DGAN model will generate synthetic time-series of a particular length, determined by the max_sequence_len
parameter. The training examples must also be that same length. As with all machine learning models, the more examples of these sequences are available to train the model, the better the model's performance. We have several config parameters to inform the DGAN model how to convert your input csv to these training example sequences.
We support 2 data styles to provide time-series data to the DGAN model: long and wide.
This is the most versatile data format to use. We assume the input table has 1 time point per row and use the config options to specify attributes, features, etc. For example, stock price data in this format might look like the following table:
Date | Sector | Symbol | Open | High | Low | Close | Volume |
---|---|---|---|---|---|---|---|
2022-06-01 | 0 | AAPL | 125 | 135 | 115 | 126 | 100000 |
2022-06-02 | 0 | AAPL | 126 | 140 | 121 | 137 | 500000 |
... | 0 | ... | ... | ... | ... | ... | ... |
2022-06-30 | 0 | AAPL | 185 | 193 | 170 | 177 | 250000 |
2022-06-01 | 1 | V | 222 | 233 | 213 | 214 | 50000 |
2022-06-02 | 1 | V | 214 | 217 | 200 | 203 | 75000 |
... | 1 | ... | ... | ... | ... | ... | ... |
2022-06-30 | 1 | V | 234 | 261 | 212 | 236 | 150000 |
... | ... | ... | ... | ... | ... | ... | ... |
Here, we use each stock (symbol) to split the data into examples. Each example time-series corresponds to max_sequence_len
rows in the input. Then each generated example in the synthetic data is like a new stock with a sequence of prices that exhibits similar types of behavior as observed in the training data.
In addition to the price information that changes each day, we also have an attribute, Sector, that is fixed for each example. The model can utilize this if certain sector's stocks tend to be more volatile than others. In this case, the Sector is also a discrete variable, and it must already be ordinal encoded for the input data passed to Gretel's APIs. So 0 might correspond to technology sector, and 1 to financial sector. Consider using sklearn's OrdinalEncoder to convert a string column.
Use the following config snippet for this type of setup, updating the column names as needed for your data:
If there's not a good column to split the data into examples, we support automatic splitting when no example_id_column
is provided (though attributes are not supported in this mode). We'll split the input data (after sorting on time_column
if provided) into chunks of the required length.
When using the auto splitting feature, note that the generated data will have an additional column, called example_id
, with integer values. These values show how you should group the generated data for analyses. Temporal correlations within the same example_id
value will match the training data, but any comparisons across different example_id
values are not meaningful. So it's not recommended to concatenate all the generated examples into one very long sequence. There will be discontinuities every max_sequence_len
rows, because each example is generated independently.
When using long data style, variable sequence lengths are supported. So, when the number of rows in the input for each stock symbol is variable, data must be supplied in long format. Wide data style (described below) is not compatible with modeling variable sequence lengths.
An alternative data style if there's exactly 1 feature (time varying variable). We assume each example is 1 row in the input table. Let's use just the closing price, but otherwise the same underlying data as in the long data style example above:
Sector | 2022-06-01 | 2022-06-02 | ... | 2022-06-30 |
---|---|---|---|---|
0 | 126 | 137 | ... | 177 |
1 | 213 | 203 | ... | 236 |
... | ... | ... | ... | ... |
With the sequence being represented as columns, each row is now one training example. Again we have the Sector attribute that is already ordinal encoded. The model doesn't need the Symbol column because no splitting into examples is required, so it should be dropped before sending the data to Gretel. The following config snippet will work with the above input:
Full list of configuration options for the DGAN model.
Data parameters:
df_style
(string, required, defaults to 'long') - Either 'wide'
or 'long'
indicating the format style of the input data.
example_id_column
(string, optional, defaults to null) - Column name to split data into examples for long style data. Effectively performs a group by operation on this column, and each group becomes an example. If null, the rows are automatically split into training examples based on max_sequence_len
. Note generated synthetic data will contain an example_id
column when this automatic splitting is used.
attribute_columns
(list of strings, optional, defaults to null) - Column names of fixed attributes that do not vary over time for each example sequence. Used by both 'wide' and 'long' formats. If null, the model will not use any attributes. Note that in 'long' format, this column must be constant for each example, so there must be a 1-to-1 mapping from values in the example_id_column
and each attribute column. Because of this, auto splitting (when example_id_column
is null) does not currently support attribute columns.
feature_columns
(list of strings, optional, defaults to null) - Column names of features, the variables that vary over time. Used by both 'wide'
and 'long'
formats. If specified, only these columns will be used for features. If null, then all columns in the input data that are not used in other column parameters will be the features.
time_column
(string, optional, defaults to null) - Column name of date or time values to sort before creating example sequences in 'long'
format. Will automatically select a column that looks like a date or time if time_column='auto'
. If null, the order from the input data is used. Generated synthetic data will contain this column using an arbitrary set of values from one training example. So if different examples have different time ranges (e.g., because auto splitting was used), one sequence of time values will be used for all synthetic data.
discrete_columns
(list of strings, optional, defaults to null) - Column names (either attributes or features) to model as categorical variables. DGAN will automatically model any string type columns as categorical variables, in addition to columns explicitly listed here.
max_sequence_len
(int, required) - Maximum length of generated synthetic sequences and training example sequences. Sequences may be of variable length (i.e. some sequences may be shorter than max_sequence_len
), and synthetic sequences will follow similar pattern of lengths as in training data. To have DGAN automatically choose a good max_sequence_len
and sample_len
based on the training data (when example_id_column is provided), set both max_sequence_len
and sample_len
to auto
.
sample_len
(int, required) - Number of time points to produce from 1 RNN cell in the generator. Must evenly divide max_sequence_len
. When max_sequence_len
is smaller (<20), recommended to use sample_len=1
. For longer sequences, the model often learns better when max_sequence_len/sample_len
is between 10 and 20. Increasing sample_len
is also an option if DGAN is running out of memory (receiving sigkill errors from Gretel API) as it should lead to fewer parameters and a smaller memory footprint for the model. If using max_sequence_len: auto
, then sample_len
can also be set to auto
.
data_source
(str, required) - Input data, must point to a valid and accessible file URL. Often set automatically by CLI (--in-data
) or may use local file with SDK and upload_data_source=True
.
Model structure parameters
apply_feature_scaling
(bool, required, defaults to True) - Automatically scale continuous variables (in attributes or features) to the appropriate range as specified by normalization
. If False, the input data must already be scaled to the appropriate range ([-1,1]
or [0,1]
) or the model will not work.
apply_example_scaling
(bool, required, defaults to True) - Internally rescale continuous features in each example and model the range for each example. This helps the model learn when different examples have very different value ranges. E.g., for stock price data, there may be penny stocks (prices usually between $0.001 and $1), stocks in the $1-$100 range, others in $100-$1000 range, and the Berkshire Hathaways in the $100,000 to $1,000,000 range.
normalization
(string, required, defaults to 'MINUSONE_ONE'
) - Defines internal range of continuous variables. Supported values are 'MINUSONE_ONE'
where continuous variables are in [-1,1]
and tanh activations are used, and 'ZERO_ONE'
where continuous variables are in [0,1]
and sigmoid activations are used. Also see apply_feature_scaling
.
use_attribute_discriminator
(bool, required, defaults to True) - Use a second discriminator that only operates on the attributes as part of the GAN. Helps ensure the attribute distributions are accurate. Also see attribute_loss_coef.
attribute_noise_dim
(int, required, defaults to 10) - Width of noise vector in the GAN generator to create the attributes.
feature_noise_dim
(int, required, defaults to 10) - Width of noise vector in the GAN generator to create the features.
attribute_num_layers
(int, required, defaults to 3) - Number of hidden layers in the feed-forward MLP to create attributes in the GAN generator.
attribute_num_units
(int, required, defaults to 100) - Number of units in each layer of the feed-forward MLP to create attributes in the GAN generator.
feature_num_layers
(int, required, defaults to 1) - Number of LSTM layers in the RNN to create features in the GAN generator.
feature_num_units
(int, required, defaults to 100) - Number of units in each LSTM layer to create features in the GAN generator.
Training parameters
batch_size
(int, required, defaults to 1000) - Size of batches for training and generation. Larger values should run faster, so try increasing if training is taking a long time. If batch_size
is too large for the model setup, the memory footprint for training may be too big for available RAM and crashes (sigkill errors from Gretel API).
epochs
(int, required, defaults to 400) - Number of epochs to train (iterations through the training data while optimizing parameters).
gradient_penalty_coef
(float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss.
attribute_gradient_penalty_coef
(float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss for the attribute discriminator (if enabled with use_attribute_discriminator
).
attribute_loss_coef
(float, required, defaults to 1.0) - When use_attribute_discriminator
is True, the coefficient on the attribute discriminator loss when combined with the generic discriminator loss. Try increasing this param if attribute discriminator is enabled but the attribute distributions of the generated data do not match the training data.
generator_learning_rate
(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train parameters of the GAN generator.
discriminator_learning_rate
(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN discriminator.
attribute_discriminator_learning_rate
(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN attribute discriminator (if enabled with use_attribute_discriminator
).
discriminator_rounds
(int, required, defaults to 1) - Number of optimization steps of the discriminator(s) to perform for each batch. Some GAN literature mentions using 5 or 10 for this parameter to improve model performance.
generator_rounds
(int, required, defaults to 1) - Number of optimization steps of the generator to perform for each batch.
Differential privacy is currently not supported for the Gretel DGAN model.
Conditional data generation (smart seeding) is currently not supported for the Gretel DGAN model.
Sample CLI to generate 1000 additional examples from a trained DGAN model:
The underlying model is DoppelGANger, a generative adversarial network (GAN) specifically constructed for time series data. The model is initialized from random weights and trained on the provided dataset using the Wasserstein GAN loss. We use our PyTorch implementation of DoppelGANger in gretelai/gretel-synthetics based on the original paper by Lin et al. Additional details about the model can be found in that paper: http://arxiv.org/abs/1909.13403
If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is recommended to run the DGAN model.
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
As an open beta model, there are several technical limitations:
Model training is sometimes unstable, so if you see poor performance, retraining a few times with the same data and config can sometimes lead to notably better results from one run.
All training and generated sequences must be the exact same length (max_sequence_len
).
Synthetic quality report is not supported.
DGAN does not model missing data (NaNs) for continuous variables. DGAN will handle some NaNs in the input data by replacing missing values via interpolation. However, if there are too many missing values, the model may not have enough data or examples to train and will throw an error. NaN or missing values will never be generated for continuous variables. (This does not apply to categorical variables, where missing values are fully supported and modeled as just another category.)
Also see the example on how to .