MLE-bench | Lab Index

75 ML engineering competitions from Kaggle testing whether AI agents can train models, prepare datasets, and run experiments autonomously. Each task is a real Kaggle competition with a defined metric and test set. Agents operate in sandboxed environments with code execution and file I/O. Scored by Kaggle medal thresholds (bronze/silver/gold).

o1-preview with AIDE scaffolding achieves bronze in 16.9% of competitions. ICLR 2025. A key benchmark for measuring the ML research automation capability that labs are racing toward. By Chan, Fishman, Korinek et al. (OpenAI).

Paper (arXiv)GitHub

Paper

Venue ICLR 2025

Citations 9

arXiv HTML

Evaluation Details

Tasks 75

Domains 6

Scoring Kaggle medal thresholds vs. human private leaderboards; headline metric is % of attempts awarded any medal (bronze or above)

Human baseline Per-competition human baselines from Kaggle private leaderboards; agents earn medals as if competing against the human field (no single aggregate %)

Domains: natural language processing, computer vision, signal processing, tabular, audio classification, forecasting

View Leaderboard →

benchmarkevaluationcodingagentic