MLE-bench
eval75 ML engineering competitions from Kaggle testing whether AI agents can train models, prepare datasets, and run experiments autonomously. Each task is a real Kaggle competition with a defined metric and test set. Agents operate in sandboxed environments with code execution and file I/O. Scored by Kaggle medal thresholds (bronze/silver/gold).
o1-preview with AIDE scaffolding achieves bronze in 16.9% of competitions. ICLR 2025. A key benchmark for measuring the ML research automation capability that labs are racing toward. By Chan, Fishman, Korinek et al. (OpenAI).