Adversarial model for time series data.
The Gretel DGAN model API provides access to a generative data model for time-series data. This model supports time varying features, fixed attributes, categorical variables, and works well with many time sequence examples to train on.
This model can be selected using the
timeseries_dganmodel tag. An example configuration is provided below, but note that you will often need to update some of the options to match your input data. The DGAN model supports 2 input formats, wide and long, that we'll explain in detail in the Data format section. These formats and related parameters tell the DGAN model how to parse your data source as time-series. The training data (data source) is a table, for example a csv file, using the common interface to train or fine-tune all Gretel models. See the reference example on how to Create and Train a Model.
The DGAN model will generate synthetic time-series of a particular length, determined by the
max_sequnce_lenparameter. The training examples must also be that same length. As with all machine learning models, the more examples of these sequences are available to train the model, the higher the accuracy. So we have several config parameters to inform the DGAN model how to convert your input csv to many example sequences.
We support 2 data styles to provide time-series data to the DGAN model: long and wide.
This is the most versatile data format to use. We assume the input table has 1 time point per row and use the config options to specify attributes, features, etc. For example, stock price data in this format might look like the following table:
Here, we use each stock (symbol) to split the data into examples. Each example time-series corresponds to
max_sequence_lenrows in the input. Then each generated example in the synthetic data is like a new stock with a sequence of prices that exhibits similar types of behavior as observed in the training data.
In addition to the price information that changes each day, we also have an attribute, Sector, that is fixed for each example. The model can utilize this if certain sector's stocks tend to be more volatile than others. In this case, the Sector is also a discrete variable, and it must already be ordinal encoded for the input data passed to Gretel's APIs. So 0 might correspond to technology sector, and 1 to financial sector. Consider using sklearn's OrdinalEncoder to convert a string column.
Use the following config snippet for this type of setup, updating the column names as needed for your data:
df_style = "long"
example_id_column = ["Symbol"]
time_column = ["Date"]
attribute_columns = ["Sector"]
discrete_columns = ["Sector"]
Remember that all training sequences must be the same length, so you should have the same number of rows in the input for each stock symbol.
If there's not a good column to split the data into examples, we support automatic splitting when no
example_id_columnis provided (though attributes are not supported in this mode). We'll split the input data (after sorting on
time_columnif provided) into chunks of the required length.
When using the auto splitting feature, note that the generated data will have an additional column, called
example_id, with integer values. These values show how you should group the generated data for analyses. Temporal correlations within the same
example_idvalue will match the training data, but any comparisons across different
example_idvalues are not meaningful. So it's not recommended to concatenate all the generated examples into one very long sequence. There will be discontinuities every
max_sequence_lenrows, because each example is generated independently.
An alternative data style if there's exactly 1 feature (time varying variable). We assume each example is 1 row in the input table. Let's use just the closing price, but otherwise the same underlying data as in the long data style example above:
With the sequence being represented as columns, each row is now one training example. Again we have the Sector attribute that is already ordinal encoded. The model doesn't need the Symbol column because no splitting into examples is required, so it should be dropped before sending the data to Gretel. The following config snippet will work with the above input:
df_style = "wide"
attribute_columns = ["Sector"]
discrete_columns = ["Sector"]
Full list of configuration options for the DGAN model.
# Update with length of sequences in your data, or
# desired length of generated sequences if using
# auto splitting (example_id_column is null).
# Update to ensure sample_len evenly divides
# Update with column names from your data set, see
# "Data style parameters" section of documentation
# for details on usage.
df_style(string, required, defaults to 'long') - Either
'long'indicating the format style of the input data.
example_id_column(string, optional, defaults to null) - Column name to split data into examples for long style data. Effectively performs a group by operation on this column, and each group becomes an example. If null, the rows are automatically split into training examples based on the
max_sequence_len. Note generated synthetic data will contain an
example_idcolumn when this automatic splitting is used.
attribute_columns(list of strings, optional, defaults to null) - Column names of fixed attributes that do not vary over time for each example sequence. Used by both 'wide' and 'long' formats. If null, the model will not use any attributes. Note that in 'long' format, this column must be constant for each example, so there must be a 1-to-1 mapping from values in the
example_id_columnand each attribute column. Because of this, auto splitting (when
example_id_columnis null) does not currently support attribute columns.
feature_columns(list of strings, optional, defaults to null) - Column names of features, the variables that vary over time. Used by both
'long'formats. If specified, only these columns will be used for features. If null, then all columns in the input data that are not used in other column parameters will be the features.
time_column(string, optional, defaults to null) - Column name of date or time values to sort before creating example sequences in
'long'format. If null, the order from the input data is used. Generated synthetic data will contain this column using an arbitrary set of values from one training example. So if different examples have different time ranges (e.g., because auto splitting was used), one sequence of time values will be used for all synthetic data.
discrete_columns(list of strings, optional, defaults to null) - Column names (either attributes or features) to model as categorical variables for DGAN. These must be ordinal (or label) encoded in your input data so the values in
[0,1,2,...,k-1]for k categorical values. All attribute and feature columns not listed here are assumed to be continuous.
max_sequence_len(int, required) - Length of generated synthetic sequences, length of all training examples. Training requires that all examples are exactly this length.
sample_len(int, required) - Number of time points to produce from 1 RNN cell in the generator. Must evenly divide
max_sequence_lenis smaller (<20), recommended to use
sample_len=1. For longer sequences, the model often learns better when
max_sequence_len/sample_lenis between 10 and 20. Increasing
sample_lenis also an option if DGAN is running out of memory (receiving sigkill errors from Gretel API) as it should lead to fewer parameters and a smaller memory footprint for the model.
data_source(str, required) - Input data, must point to a valid and accessible file URL. Often set automatically by CLI (
--in-data) or may use local file with SDK and
Model structure parameters
apply_feature_scaling(bool, required, defaults to True) - Automatically scale continuous variables (in attributes or features) to the appropriate range as specified by
normalization. If False, the input data must already be scaled to the appropriate range (
[0,1]) or the model will not work.
apply_example_scaling(bool, required, defaults to True) - Internally rescale continuous features in each example and model the range for each example. This helps the model learn when different examples have very different value ranges. E.g., for stock price data, there may be penny stocks (prices usually between $0.001 and $1), stocks in the $1-$100 range, others in $100-$1000 range, and the Berkshire Hathaways in the $100,000 to $1,000,000 range.
normalization(string, required, defaults to
'MINUSONE_ONE') - Defines internal range of continuous variables. Supported values are
'MINUSONE_ONE'where continuous variables are in
[-1,1]and tanh activations are used, and
'ZERO_ONE'where continuous variables are in
[0,1]and sigmoid activations are used. Also see
use_attribute_discriminator(bool, required, defaults to True) - Use a second discriminator that only operates on the attributes as part of the GAN. Helps ensure the attribute distributions are accurate. Also see
attribute_noise_dim(int, required, defaults to 10) - Width of noise vector in the GAN generator to create the attributes.
feature_noise_dim(int, required, defaults to 10) - Width of noise vector in the GAN generator to create the features.
attribute_num_layers(int, required, defaults to 3) - Number of hidden layers in the feed-forward MLP to create attributes in the GAN generator.
attribute_num_units(int, required, defaults to 100) - Number of units in each layer of the feed-forward MLP to create attributes in the GAN generator.
feature_num_layers(int, required, defaults to 1) - Number of LSTM layers in the RNN to create features in the GAN generator.
feature_num_units(int, required, defaults to 100) - Number of units in each LSTM layer to create features in the GAN generator.
batch_size(int, required, defaults to 1000) - Size of batches for training and generation. Larger values should run faster, so try increasing if training is taking a long time. If
batch_sizeis too large for the model setup, the memory footprint for training may be too big for available RAM and crashes (sigkill errors from Gretel API).
epochs(int, required, defaults to 400) - Number of epochs to train (iterations through the training data while optimizing parameters).
gradient_penalty_coef(float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss.
attribute_gradient_penalty_coef(float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss for the attribute discriminator (if enabled with
attribute_loss_coef(float, required, defaults to 1.0) - When
use_attribute_discriminatoris True, the coefficient on the attribute discriminator loss when combined with the generic discriminator loss. Try increasing this param if attribute discriminator is enabled but the attribute distributions of the generated data do not match the training data.
generator_learning_rate(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train parameters of the GAN generator.
discriminator_learning_rate(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN discriminator.
attribute_discriminator_learning_rate(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN attribute discriminator (if enabled with
discriminator_rounds(int, required, defaults to 1) - Number of optimization steps of the discriminator(s) to perform for each batch. Some GAN literature mentions using 5 or 10 for this parameter to improve model performance.
generator_rounds(int, required, defaults to 1) - Number of optimization steps of the generator to perform for each batch.
Differential privacy is currently not supported for the Gretel DGAN model.
Conditional data generation (smart seeding) is currently not supported for the Gretel DGAN model.
Sample CLI to generate 1000 additional examples from a trained DGAN model:
gretel models run \
--project <project-name> \
--model-id <model-id> \
--runner cloud \
--param num_records 1000 \
The underlying model is DoppelGANger, a generative adversarial network (GAN) specifically constructed for time series data. The model is initialized from random weights and trained on the customer provided dataset using the Wasserstein GAN loss. We use our PyTorch implementation of DoppelGANger in gretelai/gretel-synthetics based on the original paper by Lin et al. Additional details about the model can be found in that paper: http://arxiv.org/abs/1909.13403
If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is recommended to run the DGAN model.
This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.
As an open beta model, there are several technical limitations:
- Model training is sometimes unstable, so if you see poor performance, retraining a few times with the same data and config can sometimes lead to notably better results from one run.
- All training and generated sequences must be the exact same length (
- Synthetic quality report is not supported.