"Improving Language Understanding by Generative Pre-Training" — the paper that launched the GPT series. A 117M parameter decoder-only Transformer (12 layers, 768 hidden, 12 heads) pre-trained on BookCorpus with a causal language modeling objective, then fine-tuned for downstream tasks. 512 token context.

Demonstrated that generative pre-training followed by discriminative fine-tuning could achieve strong results across diverse NLP tasks, establishing the paradigm that would scale to GPT-2, GPT-3, and beyond. By Radford, Narasimhan, Salimans, and Sutskever.

Model Details

Architecture DENSE
Parameters 117M
Context window 512
open-sourcefoundational

Related