1,320 real-world professional tasks across 44 occupations in the 9 largest US GDP-contributing industries (real estate, government, manufacturing, healthcare, finance, etc.). Tasks produce actual deliverables: legal briefs, financial analyses, presentations, spreadsheets. Created by industry professionals averaging 14 years of experience. Scored via blind pairwise comparison against human expert baselines. 220-task gold subset open-sourced.

Carries the highest weight (16.7%) in the AA Intelligence Index v4.0 (via GDPval-AA, Artificial Analysis' independent re-evaluation using their Stirrup agentic harness). Represents a shift toward measuring economic value of AI capability rather than academic task performance. Inter-rater reliability is a known weakness (~66-71% agreement). By Patwardhan, Dias, Proehl et al. (OpenAI).

Paper

benchmarkevaluationagentic