timeseries_dganmodel tag. An example configuration is provided below, but note that you will often need to update some of the options to match your input data. The DGAN model supports 2 input formats, wide and long, that we'll explain in detail in the Data format section. These formats and related parameters tell the DGAN model how to parse your data source as time-series. The training data (data source) is a table, for example a csv file, using the common interface to train or fine-tune all Gretel models. See the reference example to train a model.
max_sequnce_lenparameter. The training examples must also be that same length. As with all machine learning models, the more examples of these sequences are available to train the model, the higher the accuracy. So we have several config parameters to inform the DGAN model how to convert your input csv to many example sequences.
max_sequence_lenrows in the input. Then each generated example in the synthetic data is like a new stock with a sequence of prices that exhibits similar types of behavior as observed in the training data.
example_id_columnis provided (though attributes are not supported in this mode). We'll split the input data (after sorting on
time_columnif provided) into chunks of the required length.
example_id, with integer values. These values show how you should group the generated data for analyses. Temporal correlations within the same
example_idvalue will match the training data, but any comparisons across different
example_idvalues are not meaningful. So it's not recommended to concatenate all the generated examples into one very long sequence. There will be discontinuities every
max_sequence_lenrows, because each example is generated independently.
df_style(string, required, defaults to 'long') - Either
'long'indicating the format style of the input data.
example_id_column(string, optional, defaults to null) - Column name to split data into examples for long style data. Effectively performs a group by operation on this column, and each group becomes an example. If null, the rows are automatically split into training examples based on the
max_sequence_len. Note generated synthetic data will contain an
example_idcolumn when this automatic splitting is used.
attribute_columns(list of strings, optional, defaults to null) - Column names of fixed attributes that do not vary over time for each example sequence. Used by both 'wide' and 'long' formats. If null, the model will not use any attributes. Note that in 'long' format, this column must be constant for each example, so there must be a 1-to-1 mapping from values in the
example_id_columnand each attribute column. Because of this, auto splitting (when
example_id_columnis null) does not currently support attribute columns.
feature_columns(list of strings, optional, defaults to null) - Column names of features, the variables that vary over time. Used by both
'long'formats. If specified, only these columns will be used for features. If null, then all columns in the input data that are not used in other column parameters will be the features.
time_column(string, optional, defaults to null) - Column name of date or time values to sort before creating example sequences in
'long'format. If null, the order from the input data is used. Generated synthetic data will contain this column using an arbitrary set of values from one training example. So if different examples have different time ranges (e.g., because auto splitting was used), one sequence of time values will be used for all synthetic data.
discrete_columns(list of strings, optional, defaults to null) - Column names (either attributes or features) to model as categorical variables for DGAN. These must be ordinal (or label) encoded in your input data so the values in
[0,1,2,...,k-1]for k categorical values. All attribute and feature columns not listed here are assumed to be continuous.
max_sequence_len(int, required) - Length of generated synthetic sequences, length of all training examples. Training requires that all examples are exactly this length.
sample_len(int, required) - Number of time points to produce from 1 RNN cell in the generator. Must evenly divide
max_sequence_lenis smaller (<20), recommended to use
sample_len=1. For longer sequences, the model often learns better when
max_sequence_len/sample_lenis between 10 and 20. Increasing
sample_lenis also an option if DGAN is running out of memory (receiving sigkill errors from Gretel API) as it should lead to fewer parameters and a smaller memory footprint for the model.
data_source(str, required) - Input data, must point to a valid and accessible file URL. Often set automatically by CLI (
--in-data) or may use local file with SDK and
apply_feature_scaling(bool, required, defaults to True) - Automatically scale continuous variables (in attributes or features) to the appropriate range as specified by
normalization. If False, the input data must already be scaled to the appropriate range (
[0,1]) or the model will not work.
apply_example_scaling(bool, required, defaults to True) - Internally rescale continuous features in each example and model the range for each example. This helps the model learn when different examples have very different value ranges. E.g., for stock price data, there may be penny stocks (prices usually between $0.001 and $1), stocks in the $1-$100 range, others in $100-$1000 range, and the Berkshire Hathaways in the $100,000 to $1,000,000 range.
normalization(string, required, defaults to
'MINUSONE_ONE') - Defines internal range of continuous variables. Supported values are
'MINUSONE_ONE'where continuous variables are in
[-1,1]and tanh activations are used, and
'ZERO_ONE'where continuous variables are in
[0,1]and sigmoid activations are used. Also see
use_attribute_discriminator(bool, required, defaults to True) - Use a second discriminator that only operates on the attributes as part of the GAN. Helps ensure the attribute distributions are accurate. Also see
attribute_noise_dim(int, required, defaults to 10) - Width of noise vector in the GAN generator to create the attributes.
feature_noise_dim(int, required, defaults to 10) - Width of noise vector in the GAN generator to create the features.
attribute_num_layers(int, required, defaults to 3) - Number of hidden layers in the feed-forward MLP to create attributes in the GAN generator.
attribute_num_units(int, required, defaults to 100) - Number of units in each layer of the feed-forward MLP to create attributes in the GAN generator.
feature_num_layers(int, required, defaults to 1) - Number of LSTM layers in the RNN to create features in the GAN generator.
feature_num_units(int, required, defaults to 100) - Number of units in each LSTM layer to create features in the GAN generator.
batch_size(int, required, defaults to 1000) - Size of batches for training and generation. Larger values should run faster, so try increasing if training is taking a long time. If
batch_sizeis too large for the model setup, the memory footprint for training may be too big for available RAM and crashes (sigkill errors from Gretel API).
epochs(int, required, defaults to 400) - Number of epochs to train (iterations through the training data while optimizing parameters).
gradient_penalty_coef(float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss.
attribute_gradient_penalty_coef(float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss for the attribute discriminator (if enabled with
attribute_loss_coef(float, required, defaults to 1.0) - When
use_attribute_discriminatoris True, the coefficient on the attribute discriminator loss when combined with the generic discriminator loss. Try increasing this param if attribute discriminator is enabled but the attribute distributions of the generated data do not match the training data.
generator_learning_rate(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train parameters of the GAN generator.
discriminator_learning_rate(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN discriminator.
attribute_discriminator_learning_rate(float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN attribute discriminator (if enabled with
discriminator_rounds(int, required, defaults to 1) - Number of optimization steps of the discriminator(s) to perform for each batch. Some GAN literature mentions using 5 or 10 for this parameter to improve model performance.
generator_rounds(int, required, defaults to 1) - Number of optimization steps of the generator to perform for each batch.