BERT (Bidirectional Encoder)

Jacob Devlin et al., 2018

O(n²·d)

Introduced by Devlin et al. at Google in 2018, BERT (Bidirectional Encoder Representations from Transformers) pre-trains a deep bidirectional Transformer by masking random tokens and predicting them from context. Unlike autoregressive models, BERT's attention is fully bidirectional — every token attends to every other token. The visualization shows a full (non-triangular) attention heatmap and the masked language modeling task, where the model predicts the hidden word from surrounding context.