Wing

Neural Networks & Deep Learning

1950s

Perceptron

Frank Rosenblatt, 1958

The simplest neural network: a single neuron that learns a linear decision boundary by adjusting weights based on classification errors.

1980s

Backpropagation (MLP)

Geoffrey Hinton, David Rumelhart & Ronald Williams, 1986

Trains a multi-layer perceptron by propagating errors backward through the network, adjusting weights to minimize the loss function.

LeNet (CNN)

Yann LeCun, 1989

A pioneering convolutional neural network that introduced learnable convolution filters and pooling layers for image recognition.

1990s

LSTM

Sepp Hochreiter & Jürgen Schmidhuber, 1997

A recurrent neural network architecture that uses gated memory cells to learn long-range dependencies in sequential data.

2010s

AlexNet

Alex Krizhevsky, Ilya Sutskever & Geoffrey Hinton, 2012

The deep CNN that won ImageNet 2012 and ignited the deep learning revolution, featuring ReLU activations and dropout regularization.

GAN (Generative Adversarial Network)

Ian Goodfellow, 2014

Two neural networks compete: a Generator creates fake samples while a Discriminator tries to distinguish real from fake, driving both to improve.

Variational Autoencoder

Diederik Kingma & Max Welling, 2014

A generative model that learns a compressed latent representation by encoding inputs into a distribution (μ, σ) and decoding samples back to data.

ResNet (Residual Network)

Kaiming He, Xiangyu Zhang, Shaoqing Ren & Jian Sun, 2015

Deep residual learning uses skip connections that bypass layers, allowing gradients to flow directly and enabling training of very deep networks.

Transformer (Self-Attention)

Ashish Vaswani et al., 2017

The self-attention mechanism computes pairwise attention weights between all tokens, allowing each position to attend to every other position in the sequence.

Mixture of Experts

Noam Shazeer et al., 2017

A sparse architecture where a gating network routes each input to a subset of specialized expert networks, enabling massive model capacity with efficient computation.

BERT (Bidirectional Encoder)

Jacob Devlin et al., 2018

BERT uses bidirectional self-attention to understand context from both directions, predicting masked tokens using the full surrounding context.

2020s

GPT (Autoregressive Transformer)

Alec Radford et al. (OpenAI), 2020

GPT generates tokens left-to-right using a causal (triangular) attention mask, where each token can only attend to previous tokens in the sequence.

Vision Transformer

Alexey Dosovitskiy et al., 2020

Applies the Transformer architecture directly to image patches, treating them as a sequence of tokens for image classification.

Diffusion Model (DDPM)

Jonathan Ho, Ajay Jain & Pieter Abbeel, 2020

Generates data by learning to reverse a gradual noising process, denoising random noise step by step into a coherent pattern.