"Language Models are Few-Shot Learners" — 175B parameter dense Transformer (96 layers, 12288 hidden, 96 heads) trained on 300B tokens from a filtered CommonCrawl mix. 2048 token context. Demonstrated that scaling alone enables in-context few-shot learning without gradient updates.

GPT-3 was a landmark: it showed that a single model could perform translation, question answering, and code generation via prompting alone. NeurIPS 2020. One of the most-cited AI papers ever (~30K+ citations). Powered the initial ChatGPT and API products. By Brown, Mann, Ryder, Subbiah et al. Proprietary.

Model Details

Architecture DENSE
Parameters 175B
Context window 2,048
Training tokens 300B

Paper

Venue NeurIPS 2020
frontierfoundational

Related