Post-training method for aligning flow matching generative models with human preferences at any generation step count. Compresses long ODE trajectories into two-step shortcuts via strategic leaps, enabling direct gradient propagation from reward signals to early generation steps — a bottleneck for existing methods like GRPO that only optimize final outputs.

Randomized timestep selection stabilizes updates across the full generation schedule. Weighted training favors trajectories most consistent with full-length generation paths. Outperforms GRPO and other direct-gradient baselines on the Flux model. CVPR 2026. By Zhanhao Liang, Tao Yang, Jie Wu, Chengjian Feng (ByteDance Seed), and Liang Zheng (ANU).

Paper

Venue CVPR 2026
foundationalgeneration