Vision Transformer

Alexey Dosovitskiy et al., 2020

O(n²·d)

Proposed by Dosovitskiy et al. in 2020, the Vision Transformer (ViT) splits an image into fixed-size patches, linearly embeds each patch into a token, adds positional embeddings, and processes the sequence through standard Transformer encoder layers with self-attention. A special classification (CLS) token aggregates information and produces the final prediction. The visualization shows the image splitting into patches, their linearization into tokens, position embedding addition, multi-head self-attention connections, and the final classification output.