RHELM | Lab Index

Benchmark for realistic, heterogeneous, and evolving long-horizon memory in AI assistants. Unlike static-dialogue memory benchmarks, RHELM provides synthetic multi-source memory streams — 629 conversation sessions, 625 emails, and 1,053 attachments (Markdown/HTML) — organized around 10 character personas with temporally evolving context.

1,305 QA pairs requiring multi-hop reasoning, temporal synthesis, and hallucination detection over heterogeneous memory sources. CC BY 4.0.

HuggingFace Paper (arXiv)GitHub (eval harness)

Paper

arXiv HTML

Evaluation Details

Questions 1,305

Tasks 7

Domains 3

Scoring LLM-as-judge graded accuracy (%)

Domains: dialogue history QA, external source QA, hybrid context QA

View Leaderboard →

evalbenchmarklong-contextagentic

Your notes

Paper

Evaluation Details