Conditional Generation FAQ
Conditional data generation (sometimes called seeding or prompting) is a technique where a generative model is asked to generate data according to some pre-specified conditioning, such as a topic, sentiment, or using one or more field values in a tabular, text, or image-based dataset.
Some of the primary use cases for conditional data generation include:
- 1.Retaining key fields such as primary keys, labels, or classes in a synthetically generated dataset. This is a key technology to enabling synthesis of relational databases.
- 2.Oversampling of minority classes in a dataset to address class imbalance for machine learning training sets. For example, creating additional “fraudulent”-type records from a limited set of samples to help train a financial model to better detect fraud.
- 3.Filling in missing data or asking a generative model to “fill in” missing data fields in a dataset based on data observed in the dataset or from public data, e.g., filling in synthetic user details for users who did not consent or opted out from data collection.
Gretel’s ACTGAN dataset offers state of the art accuracy on tabular data as well as support for privacy filters. ACTGAN's conditional generation is based on rejection sampling, which samples from the model and then accepts generated values that match the user’s requested conditions. This means that ACTGAN should work well for conditionally sampling from up to a few categorical attributes (such as IDs, labels), but may fail to generate conditional data requests that rarely exist in the input data. ACTGAN can train and generation of up to thousands of columns, with 10x less GPU memory consumption than other GAN based models, allowing it to run on much larger and more varied datasets. Training a GAN can take minutes or hours depending on the dataset size, but generation is quite fast.
tabular, text, time-series
Gretel’s LSTM is a language model and works by essentially predicting the next token (or field) in a dataset over and over. As it based on a language model, it works quite well for conditional data generation- and can even be prompted with inputs outside of the training distribution and return meaningful results.
The LSTM may struggle with high column count datasets and has high compute requirements versus our other models. Use the LSTM if you are working with mixed tabular and text data, and less than 25 columns of data.
Gretel’s Amplify model excels at working with high dimensionality tabular datasets, with support for 1000+ columns and millions of rows. It trains and generates data quickly, especially on multi-core instances, and does not require a GPU. Amplify has quite resilient conditional data generation abilities. Amplify does not offer the accuracy of ACTGAN or the LSTM, but the accuracy is quite good and the flexibility with seeding multiple values, including numeric values, make it a good choice to start with for conditional data generation use cases.
Gretel’s GPT (Generative Pretrained Transformer) models are language models designed around conditional data generation use cases, such as following a prompt to generate data that matches a topic or sentimentthat the model has been fine-tuned on or prompted with. Gretel GPT is not designed to work with tabular data specifically- inputs to the model are a single column CSV. But labels can be encoded into the prompt and separated with a special character. For example, if you were fine-tuning a GPT model to generate movie reviews matching a sentiment, you could construct a training set like this:
After fine-tuning, the model can be conditioned to generate new movie reviews matching a certain sentiment or topic using the following prompt:
If you’re working with tabular data and it’s less than 20 columns of mixed data (text, categorical, numeric), try starting with the LSTM. It provides the most flexibility and great accuracy but takes longer for training and generation.
If you are working with millions of records or Terabytes of data, start with Gretel Amplify. It provides the fastest training and generation, no need for a GPU, and conditional data generation with the ability to prompt off multiple column values- e.g.
age: 50, height: 5'2, location: Austin, favorite_food: , favorite_band:
For maximum accuracy with large tabular datasets, start with ACTGAN. Often this is a great step to improve accuracy on tabular datasets after trying Amplify. If you need to prompt the model with numeric values, use Amplify, or consider converting the values to a string value.
If you’re working with natural language text, start with GPT. It has been pretrained on millions of English-language docs, and also supports GPT models fine-tuned on other languages from the Hugging Face model hub by changing the
pretrained_modelparameter in the model config.
GPT is capable of few-shot learning (e.g., summarizing, translating, or creating new text on a certain topic from only a few examples on a topic and no model fine-tuning). An example few-shot learning prompt:
Review: 'This is the best movie ever!'
Review: 'Poor acting and a boring script. '
Review: 'I wish the movie never ended. I can't wait for the sequel.'
Review: 'Two thumbs down. I almost turned it off after 15 minutes.'
Review: 'Sublime acting and writing. I will definitely watch this again.'
GPT also supports zero-shot learning, answering or completing a prompt with no examples provided. Zero-shot learning can be challenging for current GPT models. When possible, consider providing a few examples like above to get the most consistent results. An example zero-shot learning prompt:
The coldest month of the year in Tokyo, Japan is
GPT models have a
max_token_size. Depending on the GPT model used, requests can use from 512 or more tokens shared between prompt and completion. For example, if your prompt is 400 tokens, your completion can be 112 tokens at most. The limit is currently a technical limitation, but there are often creative ways to solve problems within the limit, e.g., condensing your prompt, breaking the text into smaller pieces, etc. As a general rule of thumb, each token in the model equates to an average of 4 input characters.
If you are working with an unsupported language or domain specific text, such as tweets, medical documents, or chat logs, the LSTM can work quite well. The LSTM supports any language, encoding, or character set, but it requires a larger volume of text (10k samples+) as it has not been pre-trained.
Start with one of our reference examples for conditional generation:
- Reference example for conditional generation with LSTM, Amplify, or ACTGAN - gretel-blueprints/retain_values_with_conditional_data_generation.ipynb at main · gretelai/gretel-blueprints
- Balancing gender bias in a medical dataset (while improving accuracy) - gretel-blueprints/balance_uci_heart_disease.ipynb at main · gretelai/gretel-blueprints
- Advanced example using GPT with encoded labels - gretel-blueprints/conditional_text_generation_with_gpt.ipynb at main · gretelai/gretel-blueprints
- Basic conditional generation with GPT - (Coming soon)