RHELM
evalBenchmark for realistic, heterogeneous, and evolving long-horizon memory in AI assistants. Unlike static-dialogue memory benchmarks, RHELM provides synthetic multi-source memory streams — 629 conversation sessions, 625 emails, and 1,053 attachments (Markdown/HTML) — organized around 10 character personas with temporally evolving context.
1,305 QA pairs requiring multi-hop reasoning, temporal synthesis, and hallucination detection over heterogeneous memory sources. CC BY 4.0.
Paper
Evaluation Details
Questions 1,305
Tasks 7
Domains 3
Scoring LLM-as-judge graded accuracy (%)
Domains: dialogue history QA, external source QA, hybrid context QA