Home/ CSE/ Vision Transformers (ViT) in Multimodal AI
CSE · Seminar 06 · Treating images as sequences of patches

Vision Transformers (ViT) in Multimodal AI

Vision Transformers split an image into patch tokens and process them with self-attention, providing the unified backbone that powers modern image–text multimodal models.

ViTself-attentionCLIPmultimodalpatch embedding

The Vision Transformer (ViT) showed that the attention mechanism behind language models works for images too — without convolutions. By cutting an image into fixed patches and treating each as a token, ViT lets the same architecture model text and vision, which is exactly why it underpins today's multimodal systems.

Working principle

An image is divided into non-overlapping patches (e.g. 16×16 px). Each patch is flattened and linearly projected into a patch embedding; a learned positional embedding encodes location, and a special [CLS] token aggregates global information. The token sequence passes through transformer encoder blocks where multi-head self-attention lets every patch attend to every other, capturing long-range relationships a CNN's local receptive field misses.

Image1Split into patches2Patch + position embed3Transformer encoder4[CLS] → head5ViT pipeline: an image as a sequence of patch tokens
Figure 1. ViT processing path. Global self-attention over patches replaces the local convolutions of a CNN.

Bridging vision and language

Multimodal models pair a ViT image encoder with a text encoder. CLIP trains both with a contrastive objective so matching image–text pairs sit close in a shared embedding space; vision-language models then feed ViT patch tokens directly into an LLM, letting it 'see'.

Table 1. CNN vs. Vision Transformer
PropertyCNNViT
Core operationLocal convolutionGlobal self-attention
Inductive biasStrong (locality)Weak — learns from data
Data appetiteModerateLarge (or strong pretraining)
Long-range contextIndirect (deep stacks)Direct, from layer 1
Multimodal fitNeeds adaptationNatural token interface
Key insightViTs are data-hungry because they lack the locality bias of CNNs. Large-scale pretraining (or hybrid conv-stem designs) is essential for them to match or beat CNNs.

Applications

  • Image–text retrieval and zero-shot classification (CLIP-style)
  • Vision-language assistants that answer questions about images
  • Medical imaging, remote sensing and document understanding

References & further reading

  1. Dosovitskiy et al., “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale,” ICLR 2021.
  2. Radford et al., “Learning Transferable Visual Models from Natural Language Supervision (CLIP),” ICML 2021.
  3. Vaswani et al., “Attention Is All You Need,” NeurIPS 2017.