GPT-2 | Lab Index

"Language Models are Unsupervised Multitask Learners" — 1.5B parameter decoder-only Transformer (48 layers, 1600 hidden, 25 heads) trained on WebText (40GB from Reddit links with 3+ karma). 1024 token context. Released in stages due to concerns about misuse — the first major AI safety-motivated staged release.

Demonstrated that sufficiently large language models perform downstream tasks in a zero-shot setting without explicit fine-tuning, achieving SOTA on several benchmarks. GPT-2 became a foundational building block for the open-source community and remains one of the most-used models on HuggingFace. By Radford, Wu, Child, Luan, Amodei, and Sutskever. MIT License.

Paper (PDF)Announcement GitHub HuggingFace

Model Details

Architecture DENSE

Parameters 1.5B

Context window 1,024

Variants

Name	Parameters	Notes
GPT-2 Small	124M	—
GPT-2 Medium	355M	—
GPT-2 Large	774M	—
GPT-2 XL	1.5B	—

open-sourceopen-weightfoundational

Model Details

Variants

Related