Speculative Decoding
paper"Fast Inference from Transformers via Speculative Decoding." Uses a small draft model to propose tokens verified in parallel by the large model, achieving 2–3x speedup with mathematically identical outputs.
Now a standard inference optimization used by virtually every LLM serving system (vLLM, TGI, TensorRT-LLM). ICML 2023. By Leviathan, Kalman, and Matias.
Paper
arXiv: 2211.17192
Venue: ICML 2023