Empirical investigation into optimal compute distribution for web-agent post-training. Combines supervised fine-tuning of Llama 3.1 8B with a Llama 3.3 70B teacher, followed by on-policy reinforcement learning. Tested 1,370 hyperparameter configurations with statistical bootstrapping to identify effective settings.

The combined SFT + RL strategy consistently outperforms either approach alone on web-agent benchmarks while requiring substantially less compute than pure SFT, achieving parity with proprietary systems. By Vattikonda, Ravichandran, Penaloza, Drouin, Caccia et al.

Paper

arXiv: 2507.04103

agentsreinforcement-learningresearch

Related