HakushoBench

Japanese chart-and-table VQA benchmark built from 33 governmental white papers (hakusho), designed to assess deep, holistic chart understanding rather than local-cue lookups. 2,053 manually validated VQA pairs over 2,053 images spanning 10+ image types (bar/line/pie charts, tables, dashboards, infographics) drawn from security, economics, and society domains.

From the LLM-jp consortium via NII. Authors: Sugiura, Kurita, Oda, Okazaki. Apache 2.0 (text); images covered by Article 30-4 of the Japanese Copyright Act. Baseline results show a substantial gap between Gemini 3 Pro (top proprietary) and Qwen3-VL-8B (top open-weight) on chart and table understanding.

HuggingFace Paper (arXiv)GitHub (eval harness)

Paper

arXiv HTML

Evaluation Details

Questions 2,053

Domains 6

Scoring free-form short-answer accuracy, LLM-judged (GPT-5.1, gpt-5.1-2025-11-13, scores each output correct/incorrect; no multiple-choice or yes/no questions)

Saturation Gemini 3 Pro scores 93.5%; authors note 'limited headroom to discriminate among frontier models' (but GPT-5.1: 67.9%, best open-weight Qwen3-VL 8B: 58.6%)

Domains: security, economy, society, infrastructure, energy & environment, diplomacy

evalbenchmarkmultimodaljapanese

Your notes

Paper

Evaluation Details

Related