CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

10-million-token benchmark for corpus-level analysis and reasoning — answers require global integration, comparison, and statistical aggregation across dispersed evidence, not retrieval of a few relevant chunks. Uses a novel data synthesis framework that decouples reasoning from textual representation, with programmatically guaranteed ground-truth answers eliminating human annotation bias.

Key finding: state-of-the-art long-context LLMs show degrading performance as input length increases, and standard RAG systems collapse entirely on corpus-level tasks. Memory-augmented agentic architectures offer more robust alternatives. The framework also demonstrates utility beyond evaluation through fine-tuning improvements. By Lu, Li, Shi, Shen, Yan, and Huang (Tongyi Lab, Alibaba Group).

Paper (arXiv)

Paper

arXiv HTML

Evaluation Details

Scoring programmatic verification

Saturation Far from saturated — LLMs degrade as input length grows to 10M tokens

benchmarkevaluationlong-context