TMA-Adaptive FP8 Grouped GEMM
paperEliminates padding requirements in FP8 grouped GEMM on NVIDIA Hopper GPUs using a TMA descriptor pool approach. Achieves 1.7-20.4% speedup with up to 23.8% memory reduction compared to padded implementations, improving efficiency for MoE training and inference.
Paper
arXiv: 2508.16584