The Vision Transformer (ViT) showed that the attention mechanism behind language models works for images too — without convolutions. By cutting an image into fixed patches and treating each as a token, ViT lets the same architecture model text and vision, which is exactly why it underpins today's multimodal systems.
Working principle
An image is divided into non-overlapping patches (e.g. 16×16 px). Each patch is flattened and linearly projected into a patch embedding; a learned positional embedding encodes location, and a special [CLS] token aggregates global information. The token sequence passes through transformer encoder blocks where multi-head self-attention lets every patch attend to every other, capturing long-range relationships a CNN's local receptive field misses.
Bridging vision and language
Multimodal models pair a ViT image encoder with a text encoder. CLIP trains both with a contrastive objective so matching image–text pairs sit close in a shared embedding space; vision-language models then feed ViT patch tokens directly into an LLM, letting it 'see'.
| Property | CNN | ViT |
|---|---|---|
| Core operation | Local convolution | Global self-attention |
| Inductive bias | Strong (locality) | Weak — learns from data |
| Data appetite | Moderate | Large (or strong pretraining) |
| Long-range context | Indirect (deep stacks) | Direct, from layer 1 |
| Multimodal fit | Needs adaptation | Natural token interface |
Key insightViTs are data-hungry because they lack the locality bias of CNNs. Large-scale pretraining (or hybrid conv-stem designs) is essential for them to match or beat CNNs.
Applications
- Image–text retrieval and zero-shot classification (CLIP-style)
- Vision-language assistants that answer questions about images
- Medical imaging, remote sensing and document understanding
References & further reading
- Dosovitskiy et al., “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale,” ICLR 2021.
- Radford et al., “Learning Transferable Visual Models from Natural Language Supervision (CLIP),” ICML 2021.
- Vaswani et al., “Attention Is All You Need,” NeurIPS 2017.