GPT-2
model"Language Models are Unsupervised Multitask Learners" — 1.5B parameter decoder-only Transformer (48 layers, 1600 hidden, 25 heads) trained on WebText (40GB from Reddit links with 3+ karma). 1024 token context. Released in stages due to concerns about misuse — the first major AI safety-motivated staged release.
Demonstrated that sufficiently large language models perform downstream tasks in a zero-shot setting without explicit fine-tuning, achieving SOTA on several benchmarks. GPT-2 became a foundational building block for the open-source community and remains one of the most-used models on HuggingFace. By Radford, Wu, Child, Luan, Amodei, and Sutskever. MIT License.
Model Details
Architecture DENSE
Parameters 1.5B
Context window 1,024
Variants
| Name | Parameters | Notes |
|---|---|---|
| GPT-2 Small | 124M | — |
| GPT-2 Medium | 355M | — |
| GPT-2 Large | 774M | — |
| GPT-2 XL | 1.5B | — |