DALL-E
model"Zero-Shot Text-to-Image Generation" — a 12B parameter autoregressive Transformer that generates images from text descriptions. Uses a discrete variational autoencoder (dVAE) to compress images into 32×32 grids of visual tokens, then models text and image tokens jointly.
DALL-E pioneered text-to-image generation at scale, demonstrating that Transformers could generate coherent, creative images from natural language prompts — from "an armchair in the shape of an avocado" to photorealistic scenes. ICML 2021. By Ramesh, Pavlov, Goh, Gray et al. Proprietary.
Model Details
Architecture DENSE
Parameters 12B
Paper
arXiv: 2102.12092
Venue: ICML 2021