"A foundation model of vision, audition, and language for in-silico neuroscience." Tri-modal (video + audio + text) brain-encoding foundation model that predicts high-resolution fMRI responses for novel stimuli, tasks, and subjects. Trained on 1,000+ hours of fMRI across 720 subjects; reports several-fold accuracy improvements over linear encoding baselines.

Recovers established findings from classic visual and neuro-linguistic experiments and reveals fine-grained topography of multisensory integration — positioning AI as a unifying framework for brain organization. Another scientific FM in the same vein as Meta's Sapiens2 human-vision family. By d'Ascoli, Rapin, Benchetrit, Brooks, Begany, Raugel, Banville, and Jean-Rémi King (Meta FAIR).

Paper

Authors: Stéphane d'Ascoli · Jérémy Rapin · Yohann Benchetrit · Teon Brooks · Katelyn Begany · Joséphine Raugel · Hubert Banville · Jean-Rémi King
sciencemultimodalfoundationalresearch