How to Train Your LLM Web Agent

Empirical investigation into optimal compute distribution for web-agent post-training. Combines supervised fine-tuning of Llama 3.1 8B with a Llama 3.3 70B teacher, followed by on-policy reinforcement learning. Tested 1,370 hyperparameter configurations with statistical bootstrapping to identify effective settings.

The combined SFT + RL strategy consistently outperforms either approach alone on web-agent benchmarks while requiring substantially less compute than pure SFT, achieving parity with proprietary systems. By Vattikonda, Ravichandran, Penaloza, Drouin, Caccia et al.

Paper (arXiv)

Paper

arXiv HTML

agentsreinforcement-learningresearch

Paper

Related