Conditional Generation FAQ
Conditional data generation (sometimes called seeding or prompting) is a technique where a generative model is asked to generate data according to some pre-specified conditioning, such as a topic, sentiment, or using one or more field values in a tabular, text, or image-based dataset.
Some of the primary use cases for conditional data generation include:
- 1.Retaining key fields such as primary keys, labels, or classes in a synthetically generated dataset. This is a key technology to enabling synthesis of relational databases.
- 2.Oversampling of minority classes in a dataset to address class imbalance for machine learning training sets. For example, creating additional “fraudulent”-type records from a limited set of samples to help train a financial model to better detect fraud.
- 3.Filling in missing data or asking a generative model to “fill in” missing data fields in a dataset based on data observed in the dataset or from public data, e.g., filling in synthetic user details for users who did not consent or opted out from data collection.
If you’re working with natural language text, start with GPT. It has been pretrained on millions of English-language docs, and also supports GPT models fine-tuned on other languages from the Hugging Face model hub by changing the
pretrained_modelparameter in the model config.
GPT is capable of few-shot learning (e.g., summarizing, translating, or creating new text on a certain topic from only a few examples on a topic and no model fine-tuning). An example few-shot learning prompt:
Review: 'This is the best movie ever!'
Review: 'Poor acting and a boring script. '
Review: 'I wish the movie never ended. I can't wait for the sequel.'
Review: 'Two thumbs down. I almost turned it off after 15 minutes.'
Review: 'Sublime acting and writing. I will definitely watch this again.'
GPT also supports zero-shot learning, answering or completing a prompt with no examples provided. Zero-shot learning can be challenging for current GPT models. When possible, consider providing a few examples like above to get the most consistent results. An example zero-shot learning prompt:
The coldest month of the year in Tokyo, Japan is
If you are working with an unsupported language or domain specific text, such as tweets, medical documents, or chat logs, the LSTM can work quite well. The LSTM supports any language, encoding, or character set, but it requires a larger volume of text (10k samples+) as it has not been pre-trained.
If you’re working with tabular data and it’s less than 20 columns of mixed data (text, categorical, numeric), try starting with the LSTM. It provides the most flexibility and great accuracy but takes longer for training and generation.
If you are working with millions of records or Terabytes of data, start with Gretel Amplify. It provides the fastest training and generation, no need for a GPU, and conditional data generation with the ability to prompt off multiple column values- e.g.
age: 50, height: 5'2, location: Austin, favorite_food: , favorite_band:
For maximum accuracy with large tabular datasets, start with ACTGAN. Often this is a great step to improve accuracy on tabular datasets after trying Amplify. If you need to prompt the model with numeric values, use Amplify, or consider converting the values to a string value.
Start with one of our reference examples for conditional generation:
- Basic conditional generation with GPT - (Coming soon)