Addresses when agentic multimodal models should use external tools versus relying on internal knowledge. Proposes HDPO (Hybrid Decoupled Policy Optimization), which separates accuracy optimization from tool efficiency into independent channels — avoiding the single weighted-objective tradeoff that makes agents either tool-dependent or tool-avoidant.

Uses a curriculum learning progression: agents first master task resolution, then develop self-reliance. The resulting Metis model reduces tool invocations by orders of magnitude while simultaneously improving reasoning accuracy. A practical contribution for making agentic systems cheaper and faster by eliminating unnecessary tool calls. By the Accio Team at Alibaba Group + HUST.

Paper

agenticmultimodalefficiencyalignment