Model type: Generative Pre-trained Transformer for natural language text generation.
The Gretel GPT API is designed to enable conditional generation of state of the art natural language text from an input prompt. The underlying model is a generative pre-trained transformer designed using an open-source implementation of OpenAI's GPT-3 architecture, pre-trained on 825GB+ of mostly English text, code, and links from Wikipedia, Reddit and other sources. GPT can be fine-tuned on domain-specific texts and datasets, and used to create high quality, coherent text. Since GPT-3 is high performing in few-shot settings, it only needs a few examples to provide a relevant response.
This API is currently offered as preview and may change. Please contact us at [email protected] if you have any questions or would like to discuss natural language text generation in more detail.
This model can be selected using the
gpt_xmodel tag. Below is an example configuration that may be used to create and fine-tune a GPT model. All Gretel models implement a common interface to train or fine-tune synthetic data models from the model-specific config. See the reference example on how to Create and Train a Model.
Parameters that may be used to configure model training.
data_source(str, required) -
__tmp__or point to a valid and accessible file in CSV, JSON, or JSONL format.
pretrained_model(str, optional, defaults to
EleutherAI/gpt-neo-125M)- Must be a valid model from the HuggingFace model repository with the letters `gpt` in the model name.
batch_size(int, optional, defaults to
4)- The batch size per GPU/TPU core/CPU for training. Note: if you hit OOM (out of memory) errors with your GPU, try lowering the batch size.
epochs(float, optional, defaults to
3)- Total number of training epochs to perform while fine-tuning the model.
weight_decay(float, optional, defaults to
0.01) - The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer. Must be between 0 and 1 (inclusive.
warmup_steps(int, optional, defaults to
100) - The number of steps used for a linear warmup from
lr_scheduler(str, optional, defaults to
linear) - The scheduler type to use. See the HuggingFace documentation of
SchedulerTypefor all possible values and details.
- Possible values include:
learning_rate(float, optional, defaults to
0.0002) - The initial learning rate for the
max_tokens(int, optional, defaults to
512) - The maximum length (in number of tokens) for any input record. The tokenizer used corresponds to the pretrained model selected.
column_name(str, optional, defaults to
null) - The name of the column used for training. For a multi-column training data input, this parameter is used to specify which column contains the natural language text that should be used for training.
validation([bool, int], optional, defaults to
null) - The test size to use for validation. The integer value represents the absolute number of test samples.
generate(dict, optional, defaults to generating 10 records) - Section that controls the output generated during model training.
num_records(int, optional, defaults to
10)- The number of sample text outputs to generate during model training.
- All the parameters from the Data generation section (excluding
data_source) can be used in
For training data inputs with multiple columns, use the
column_nameparameter to specify which column to train on.
column_nameshould be set as the field name of the natural language text column, e.g.
Parameters controlling the generation of new records. All Gretel models implement a common interface to generate new data. See the example on how to Generate data from a model.
data_source(str, optional) - Provide a series of prompts in single-column CSV, JSON, or JSONL format. If specified, this will override the
num_recordsparameter, generating one record for each prompt in the
data_sourceparam. Must point to a valid and accessible file in single-column CSV, JSON, or JSONL format.
num_records(int, optional, defaults to
10) - The number of text outputs to generate.
maximum_text_length(int, optional, defaults to
42) - Maximum number of tokens to generate (not including the prompt) in output text.
top_p(float, optional, defaults to
0.89876) - If set to a float value < 1, only the most probable tokens with probabilities that add up to
top_por higher are kept for generation.
top_k(float, optional, defaults to
43) - Number of highest probability vocabulary tokens to keep for top_k filtering. Set to
0to disable top_k filtering.
num_beams(int, optional, defaults to
1) - Number of beams for beam search. Set to
1to disable beam search.
do_sample(bool, optional, defaults to
True) - Whether or not to use sampling, otherwise use greedy decoding.
do_early_stopping(bool, optional, defaults to
True) - Whether to stop the beam search when at least
num_beamssentences are finished per batch or not.
typical_p(float, optional, defaults to
0.8) - The amount of probability mass from the original distribution that we wish to consider.
temperature(float, optional, defaults to
1.0) - The value used to module the next token probabilities. Higher temperatures lead to more randomness in the output.
By default, the GPT task uses the GPT-Neo 125M model created by Eleuther.AI. GPT-Neo refers to the class of models, while 125M represents the number of parameters of this particular pre-trained model. GPT-Neo 125M was trained on the Pile, a large scale dataset for 300 billion tokens over 572,300 steps. It was trained as a masked autoregressive language model, using cross-entropy loss.
If running this system in local mode (on-premises), the following instance types are recommended.
CPU: Minimum 4 cores, 32GB RAM.
GPU (Required). Minimum Nvidia T4 or similar CUDA compliant GPU with 16GB+ RAM is required to run basic language models. For more advanced 1 billion+ parameter pre-trained language models, an Nvidia V100 or A100 GPU or better with 40GB+ RAM is required.
Large-scale language models such as GPT-X may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results. For more information and examples please see OpenAI and EleutherAI's docs for more details.