"Evaluating Large Language Models Trained on Code" — GPT-3 fine-tuned on publicly available code from GitHub. The largest variant, Codex-12B, solves 28.8% of HumanEval problems (72.3% with 100 samples). Introduced the HumanEval benchmark (164 hand-written Python programming problems) that became an industry standard for code evaluation.

Codex powered GitHub Copilot, the first widely-adopted AI coding assistant, transforming how millions of developers write code. The HumanEval benchmark it introduced is still used by virtually every coding model. By Chen, Tworek et al. Proprietary.

Model Details

Architecture DENSE
Parameters 12B

Paper

arXiv: 2107.03374

codingfoundational

Related