1,320 real-world professional tasks across 44 occupations in the 9 largest US GDP-contributing industries (real estate, government, manufacturing, healthcare, finance, etc.). Tasks produce actual deliverables: legal briefs, financial analyses, presentations, spreadsheets. Created by industry professionals averaging 14 years of experience. Scored via blind pairwise comparison against human expert baselines. 220-task gold subset open-sourced.

Carries the highest weight (20%) in the AA Intelligence Index v4.1 (via GDPval-AA v2, Artificial Analysis' independent re-evaluation using their Stirrup agentic harness). Represents a shift toward measuring economic value of AI capability rather than academic task performance. Inter-rater reliability is a known weakness (~66-71% agreement). By Patwardhan, Dias, Proehl et al. (OpenAI).

Paper

Citations 1

Evaluation Details

Tasks 1,320
Domains 9
Scoring blinded expert pairwise comparisons vs. human expert deliverables (win/tie rate) on the 220-task open-sourced gold subset; experimental automated grader (66% agreement with expert graders, vs. 71% human inter-rater agreement)
Used in: AA Intelligence Index v4.1 (as GDPval-AA v2)
Domains: real estate, manufacturing, professional/scientific/technical services, government, healthcare, finance & insurance, retail trade, wholesale trade, information
benchmarkevaluationagentic