OfficeComprehensionBenchmark
evalBenchmark for document comprehension and grounded reasoning over Microsoft Office files (Word, Excel, PowerPoint). Two evaluation tracks: File Fidelity Q&A (922 queries over 244 files testing structural/visual perception of text, tables, charts, formulas, and embedded objects) and Domain Q&A (120 queries with 8,450 atomic assertions over 124 files spanning 12 industries, testing expert-level reasoning on real business documents).
1,319 rows / ~3 GB. CDLA-Permissive-2.0 for OCB-authored content; upstream source files retain their original licenses. Shaik et al. (2026).
Evaluation Details
Questions 1,042
Tasks 2
Domains 12
Scoring LLM-as-judge against weighted atomic assertion rubrics (single Azure OpenAI judge or GPT+Gemini+Claude majority vote)
Domains: finance, accounting, waste management & environmental services, healthcare & social assistance, information, manufacturing, government, educational services, wholesale/retail trade, corporate governance, energy, supply chain