RULER: What's the Real Context Size of Your LLM?

Long-context evaluation framework with 13 tasks across 4 categories: retrieval (single/multi-key needle-in-a-haystack), multi-hop composition, aggregation (counting, frequency), and question answering. Tests genuine context utilization at lengths from 4K to 1M+ tokens, going well beyond simplistic needle-in-a-haystack tests.

Revealed that many models claiming 128K+ context windows degrade significantly on real tasks at those lengths. Became the de facto standard for long-context evaluation — used by model developers (Anthropic, Meta, NVIDIA) to report context-window capabilities. RULER@128K and RULER@1M scores appear on major model cards. By Hsieh et al. (NVIDIA Research).

Paper (arXiv)

Paper

Citations 11

arXiv HTML

Evaluation Details

Tasks 13

Domains 4

Scoring recall-based accuracy (string match checking presence of the target output in the model's response)

Saturation Effectively saturated at the standard 4K-128K configuration: top models in the official README results table average ~96% (Jamba-1.5-Large 96.0 avg / 95.1 at 128K; Gemini-1.5 Pro 95.8 avg), and the repo has moved on to a RULERv2 pipeline

Domains: retrieval, multi-hop tracing, aggregation, question answering

View Leaderboard →

benchmarkevaluationlong-context

Your notes

Paper

Evaluation Details