GPT-3

"Language Models are Few-Shot Learners" — 175B parameter dense Transformer (96 layers, 12288 hidden, 96 heads) trained on 300B tokens from a filtered CommonCrawl mix. 2048 token context. Demonstrated that scaling alone enables in-context few-shot learning without gradient updates.

GPT-3 was a landmark: it showed that a single model could perform translation, question answering, and code generation via prompting alone. NeurIPS 2020. One of the most-cited AI papers ever (~30K+ citations). Powered the initial ChatGPT and API products. By Brown, Mann, Ryder, Subbiah et al. Proprietary.

Paper (arXiv)

Model Details

Architecture DENSE

Parameters 175B

Context window 2,048

Training tokens 300B

Paper

Venue NeurIPS 2020

Citations 3,029

arXiv HTML

frontierfoundational

Your notes

Model Details

Paper

Related