Multimodal diffusion model with representation alignment for high-fidelity Foley audio generation synced to video content.

Paper

arXiv: 2508.16930

audiovideogeneration