GPT (Autoregressive Transformer)

Alec Radford et al. (OpenAI), 2020

O(n²·d)

GPT (Generative Pre-trained Transformer) by OpenAI uses causal self-attention for autoregressive language modeling. The key constraint is a triangular attention mask: each token can only attend to itself and previous tokens, never future ones. This enables left-to-right generation where each new token is predicted from all preceding context. The visualization shows the causal mask as a lower-triangular heatmap, with the upper triangle blocked, and tokens generated one at a time with leftward-only attention arrows.