GPT-1 | Lab Index

"Improving Language Understanding by Generative Pre-Training" — the paper that launched the GPT series. A 117M parameter decoder-only Transformer (12 layers, 768 hidden, 12 heads) pre-trained on BookCorpus with a causal language modeling objective, then fine-tuned for downstream tasks. 512 token context.

Demonstrated that generative pre-training followed by discriminative fine-tuning could achieve strong results across diverse NLP tasks, establishing the paradigm that would scale to GPT-2, GPT-3, and beyond. By Radford, Narasimhan, Salimans, and Sutskever.

Paper (PDF)Announcement

Model Details

Architecture DENSE

Parameters 117M

Context window 512

open-sourcefoundational

Model Details

Related