"Zero-Shot Text-to-Image Generation" — a 12B parameter autoregressive Transformer that generates images from text descriptions. Uses a discrete variational autoencoder (dVAE) to compress images into 32×32 grids of visual tokens, then models text and image tokens jointly.

DALL-E pioneered text-to-image generation at scale, demonstrating that Transformers could generate coherent, creative images from natural language prompts — from "an armchair in the shape of an avocado" to photorealistic scenes. ICML 2021. By Ramesh, Pavlov, Goh, Gray et al. Proprietary.

Model Details

Architecture DENSE
Parameters 12B

Paper

Venue ICML 2021
visionfoundationalmultimodal

Related