CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning
eval10-million-token benchmark for corpus-level analysis and reasoning — answers require global integration, comparison, and statistical aggregation across dispersed evidence, not retrieval of a few relevant chunks. Uses a novel data synthesis framework that decouples reasoning from textual representation, with programmatically guaranteed ground-truth answers eliminating human annotation bias.
Key finding: state-of-the-art long-context LLMs show degrading performance as input length increases, and standard RAG systems collapse entirely on corpus-level tasks. Memory-augmented agentic architectures offer more robust alternatives. The framework also demonstrates utility beyond evaluation through fine-tuning improvements. By Lu, Li, Shi, Shen, Yan, and Huang (Tongyi Lab, Alibaba Group).