Gretel DGAN

Adversarial model for time series data.

The Gretel DGAN model API provides access to a generative data model for time-series data. This model supports time varying features, fixed attributes, categorical variables, and works well with many time sequence examples to train on.

Model Creation

Data format

The DGAN model will generate synthetic time-series of a particular length, determined by the max_sequence_len parameter. The training examples must also be that same length. As with all machine learning models, the more examples of these sequences are available to train the model, the better the model's performance. We have several config parameters to inform the DGAN model how to convert your input csv to these training example sequences.

We support 2 data styles to provide time-series data to the DGAN model: long and wide.

Long data style

This is the most versatile data format to use. We assume the input table has 1 time point per row and use the config options to specify attributes, features, etc. For example, stock price data in this format might look like the following table:

Here, we use each stock (symbol) to split the data into examples. Each example time-series corresponds to max_sequence_len rows in the input. Then each generated example in the synthetic data is like a new stock with a sequence of prices that exhibits similar types of behavior as observed in the training data.

In addition to the price information that changes each day, we also have an attribute, Sector, that is fixed for each example. The model can utilize this if certain sector's stocks tend to be more volatile than others. In this case, the Sector is also a discrete variable, and it must already be ordinal encoded for the input data passed to Gretel's APIs. So 0 might correspond to technology sector, and 1 to financial sector. Consider using sklearn's OrdinalEncoder to convert a string column.

Use the following config snippet for this type of setup, updating the column names as needed for your data:

df_style: "long"
example_id_column: "Symbol"
time_column: "Date"
attribute_columns: ["Sector"]
discrete_columns: ["Sector"]

If there's not a good column to split the data into examples, we support automatic splitting when no example_id_column is provided (though attributes are not supported in this mode). We'll split the input data (after sorting on time_column if provided) into chunks of the required length.

When using the auto splitting feature, note that the generated data will have an additional column, called example_id, with integer values. These values show how you should group the generated data for analyses. Temporal correlations within the same example_id value will match the training data, but any comparisons across different example_id values are not meaningful. So it's not recommended to concatenate all the generated examples into one very long sequence. There will be discontinuities every max_sequence_len rows, because each example is generated independently.

When using long data style, variable sequence lengths are supported. So, when the number of rows in the input for each stock symbol is variable, data must be supplied in long format. Wide data style (described below) is not compatible with modeling variable sequence lengths.

Wide data style

An alternative data style if there's exactly 1 feature (time varying variable). We assume each example is 1 row in the input table. Let's use just the closing price, but otherwise the same underlying data as in the long data style example above:

With the sequence being represented as columns, each row is now one training example. Again we have the Sector attribute that is already ordinal encoded. The model doesn't need the Symbol column because no splitting into examples is required, so it should be dropped before sending the data to Gretel. The following config snippet will work with the above input:

df_style: "wide"
attribute_columns: ["Sector"]
discrete_columns: ["Sector"]

Parameters

Full list of configuration options for the DGAN model.

schema_version: "1.0"
name: dgan-all-params
models:
  - timeseries_dgan:
      data_source: __temp__
      params:
        # Update with length of sequences in your data, or 
        # desired length of generated sequences if using
        # auto splitting (example_id_column is null).
        max_sequence_len: 10
        # Update to ensure sample_len evenly divides
        # max_sequence_len.
        sample_len: 1        
        attribute_noise_dim: 10
        feature_noise_dim: 32
        attribute_num_layers: 3
        attribute_num_units: 100
        feature_num_layers: 1
        feature_num_units: 100
        use_attribute_discriminator: true
        normalization: 1
        apply_feature_scaling: true
        apply_example_scaling: false
        gradient_penalty_coef: 10
        attribute_gradient_penalty_coef: 10
        attribute_loss_coef: 10
        generator_learning_rate: 0.00001
        discriminator_learning_rate: 0.00001
        attribute_discriminator_learning_rate: 0.00001
        batch_size: 100
        epochs: 5000
        discriminator_rounds: 1
        generator_rounds: 1
      # Update with column names from your data set, see
      # "Data style parameters" section of documentation
      # for details on usage.
      time_column: null
      example_id_column: null
      attribute_columns: null
      feature_columns: null
      discrete_columns: null
      df_style: long

Data parameters:

  • df_style (string, required, defaults to 'long') - Either 'wide' or 'long' indicating the format style of the input data.

  • example_id_column (string, optional, defaults to null) - Column name to split data into examples for long style data. Effectively performs a group by operation on this column, and each group becomes an example. If null, the rows are automatically split into training examples based on max_sequence_len. Note generated synthetic data will contain an example_id column when this automatic splitting is used.

  • attribute_columns (list of strings, optional, defaults to null) - Column names of fixed attributes that do not vary over time for each example sequence. Used by both 'wide' and 'long' formats. If null, the model will not use any attributes. Note that in 'long' format, this column must be constant for each example, so there must be a 1-to-1 mapping from values in the example_id_column and each attribute column. Because of this, auto splitting (when example_id_column is null) does not currently support attribute columns.

  • feature_columns (list of strings, optional, defaults to null) - Column names of features, the variables that vary over time. Used by both 'wide' and 'long' formats. If specified, only these columns will be used for features. If null, then all columns in the input data that are not used in other column parameters will be the features.

  • time_column (string, optional, defaults to null) - Column name of date or time values to sort before creating example sequences in 'long' format. Will automatically select a column that looks like a date or time if time_column='auto'. If null, the order from the input data is used. Generated synthetic data will contain this column using an arbitrary set of values from one training example. So if different examples have different time ranges (e.g., because auto splitting was used), one sequence of time values will be used for all synthetic data.

  • discrete_columns (list of strings, optional, defaults to null) - Column names (either attributes or features) to model as categorical variables. DGAN will automatically model any string type columns as categorical variables, in addition to columns explicitly listed here.

  • max_sequence_len (int, required) - Maximum length of generated synthetic sequences and training example sequences. Sequences may be of variable length (i.e. some sequences may be shorter than max_sequence_len), and synthetic sequences will follow similar pattern of lengths as in training data. To have DGAN automatically choose a good max_sequence_len and sample_len based on the training data (when example_id_column is provided), set both max_sequence_len and sample_len to auto.

  • sample_len (int, required) - Number of time points to produce from 1 RNN cell in the generator. Must evenly divide max_sequence_len. When max_sequence_len is smaller (<20), recommended to use sample_len=1. For longer sequences, the model often learns better when max_sequence_len/sample_len is between 10 and 20. Increasing sample_len is also an option if DGAN is running out of memory (receiving sigkill errors from Gretel API) as it should lead to fewer parameters and a smaller memory footprint for the model. If using max_sequence_len: auto, then sample_len can also be set to auto.

  • data_source (str, required) - Input data, must point to a valid and accessible file URL. Often set automatically by CLI (--in-data) or may use local file with SDK and upload_data_source=True.

Model structure parameters

  • apply_feature_scaling (bool, required, defaults to True) - Automatically scale continuous variables (in attributes or features) to the appropriate range as specified by normalization. If False, the input data must already be scaled to the appropriate range ([-1,1] or [0,1]) or the model will not work.

  • apply_example_scaling (bool, required, defaults to True) - Internally rescale continuous features in each example and model the range for each example. This helps the model learn when different examples have very different value ranges. E.g., for stock price data, there may be penny stocks (prices usually between $0.001 and $1), stocks in the $1-$100 range, others in $100-$1000 range, and the Berkshire Hathaways in the $100,000 to $1,000,000 range.

  • normalization (string, required, defaults to 'MINUSONE_ONE') - Defines internal range of continuous variables. Supported values are 'MINUSONE_ONE'where continuous variables are in [-1,1] and tanh activations are used, and 'ZERO_ONE' where continuous variables are in [0,1] and sigmoid activations are used. Also see apply_feature_scaling.

  • use_attribute_discriminator (bool, required, defaults to True) - Use a second discriminator that only operates on the attributes as part of the GAN. Helps ensure the attribute distributions are accurate. Also see attribute_loss_coef.

  • attribute_noise_dim (int, required, defaults to 10) - Width of noise vector in the GAN generator to create the attributes.

  • feature_noise_dim (int, required, defaults to 10) - Width of noise vector in the GAN generator to create the features.

  • attribute_num_layers (int, required, defaults to 3) - Number of hidden layers in the feed-forward MLP to create attributes in the GAN generator.

  • attribute_num_units (int, required, defaults to 100) - Number of units in each layer of the feed-forward MLP to create attributes in the GAN generator.

  • feature_num_layers (int, required, defaults to 1) - Number of LSTM layers in the RNN to create features in the GAN generator.

  • feature_num_units (int, required, defaults to 100) - Number of units in each LSTM layer to create features in the GAN generator.

Training parameters

  • batch_size (int, required, defaults to 1000) - Size of batches for training and generation. Larger values should run faster, so try increasing if training is taking a long time. If batch_size is too large for the model setup, the memory footprint for training may be too big for available RAM and crashes (sigkill errors from Gretel API).

  • epochs (int, required, defaults to 400) - Number of epochs to train (iterations through the training data while optimizing parameters).

  • gradient_penalty_coef (float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss.

  • attribute_gradient_penalty_coef (float, required, defaults to 10.0) - Coefficient for the gradient penalty term in the Wasserstein GAN loss for the attribute discriminator (if enabled with use_attribute_discriminator).

  • attribute_loss_coef (float, required, defaults to 1.0) - When use_attribute_discriminator is True, the coefficient on the attribute discriminator loss when combined with the generic discriminator loss. Try increasing this param if attribute discriminator is enabled but the attribute distributions of the generated data do not match the training data.

  • generator_learning_rate (float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train parameters of the GAN generator.

  • discriminator_learning_rate (float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN discriminator.

  • attribute_discriminator_learning_rate (float, required, defaults to 0.001) - Learning rate for Adam optimizer used to train the parameters of the GAN attribute discriminator (if enabled with use_attribute_discriminator).

  • discriminator_rounds (int, required, defaults to 1) - Number of optimization steps of the discriminator(s) to perform for each batch. Some GAN literature mentions using 5 or 10 for this parameter to improve model performance.

  • generator_rounds (int, required, defaults to 1) - Number of optimization steps of the generator to perform for each batch.

Differential privacy

Differential privacy is currently not supported for the Gretel DGAN model.

Smart seeding

Conditional data generation (smart seeding) is currently not supported for the Gretel DGAN model.

Data generation

Sample CLI to generate 1000 additional examples from a trained DGAN model:

gretel models run \
  --project <project-name> \
  --model-id <model-id> \
  --runner cloud \
  --param num_records 1000 \
  --output .

Model information

The underlying model is DoppelGANger, a generative adversarial network (GAN) specifically constructed for time series data. The model is initialized from random weights and trained on the provided dataset using the Wasserstein GAN loss. We use our PyTorch implementation of DoppelGANger in gretelai/gretel-synthetics based on the original paper by Lin et al. Additional details about the model can be found in that paper: http://arxiv.org/abs/1909.13403

Minimum requirements

If running this system in local mode (on-premises), the following instance types are recommended.

CPU: Minimum 4 cores, 32GB RAM.

GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is recommended to run the DGAN model.

Limitations and biases

This model is trained entirely on the examples provided in the training dataset and will therefore capture and likely repeat any biases that exist in the training set. We recommend having a human review the data set used to train models before using in production.

As an open beta model, there are several technical limitations:

  • Model training is sometimes unstable, so if you see poor performance, retraining a few times with the same data and config can sometimes lead to notably better results from one run.

  • All training and generated sequences must be the exact same length (max_sequence_len).

  • Synthetic quality report is not supported.

  • DGAN does not model missing data (NaNs) for continuous variables. DGAN will handle some NaNs in the input data by replacing missing values via interpolation. However, if there are too many missing values, the model may not have enough data or examples to train and will throw an error. NaN or missing values will never be generated for continuous variables. (This does not apply to categorical variables, where missing values are fully supported and modeled as just another category.)

Last updated