"Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers." Introduces an asymmetric Mixture-of-Transformers architecture for Vision-Language-Action (VLA) models that enables generalist vision-language models to perform embodied tasks effectively.

Part of ZGCI's embodied intelligence research direction, alongside BayesianVLA and the RLinf training framework.

visionmultimodalagentsresearch