MLE-bench
eval75 ML engineering competitions from Kaggle testing whether AI agents can train models, prepare datasets, and run experiments autonomously. Each task is a real Kaggle competition with a defined metric and test set. Agents operate in sandboxed environments with code execution and file I/O. Scored by Kaggle medal thresholds (bronze/silver/gold).
o1-preview with AIDE scaffolding achieves bronze in 16.9% of competitions. ICLR 2025. A key benchmark for measuring the ML research automation capability that labs are racing toward. By Chan, Fishman, Korinek et al. (OpenAI).
Paper
Evaluation Details
Tasks 75
Domains 6
Scoring Kaggle medal thresholds vs. human private leaderboards; headline metric is % of attempts awarded any medal (bronze or above)
Human baseline Per-competition human baselines from Kaggle private leaderboards; agents earn medals as if competing against the human field (no single aggregate %)
Domains: natural language processing, computer vision, signal processing, tabular, audio classification, forecasting