Wing
Neural Networks & Deep Learning
Backpropagation (MLP)
Geoffrey Hinton, David Rumelhart & Ronald Williams, 1986
Trains a multi-layer perceptron by propagating errors backward through the network, adjusting weights to minimize the loss function.
LeNet (CNN)
Yann LeCun, 1989
A pioneering convolutional neural network that introduced learnable convolution filters and pooling layers for image recognition.
AlexNet
Alex Krizhevsky, Ilya Sutskever & Geoffrey Hinton, 2012
The deep CNN that won ImageNet 2012 and ignited the deep learning revolution, featuring ReLU activations and dropout regularization.
GAN (Generative Adversarial Network)
Ian Goodfellow, 2014
Two neural networks compete: a Generator creates fake samples while a Discriminator tries to distinguish real from fake, driving both to improve.
Variational Autoencoder
Diederik Kingma & Max Welling, 2014
A generative model that learns a compressed latent representation by encoding inputs into a distribution (μ, σ) and decoding samples back to data.
ResNet (Residual Network)
Kaiming He, Xiangyu Zhang, Shaoqing Ren & Jian Sun, 2015
Deep residual learning uses skip connections that bypass layers, allowing gradients to flow directly and enabling training of very deep networks.
Transformer (Self-Attention)
Ashish Vaswani et al., 2017
The self-attention mechanism computes pairwise attention weights between all tokens, allowing each position to attend to every other position in the sequence.
Mixture of Experts
Noam Shazeer et al., 2017
A sparse architecture where a gating network routes each input to a subset of specialized expert networks, enabling massive model capacity with efficient computation.
BERT (Bidirectional Encoder)
Jacob Devlin et al., 2018
BERT uses bidirectional self-attention to understand context from both directions, predicting masked tokens using the full surrounding context.
GPT (Autoregressive Transformer)
Alec Radford et al. (OpenAI), 2020
GPT generates tokens left-to-right using a causal (triangular) attention mask, where each token can only attend to previous tokens in the sequence.
Vision Transformer
Alexey Dosovitskiy et al., 2020
Applies the Transformer architecture directly to image patches, treating them as a sequence of tokens for image classification.
Diffusion Model (DDPM)
Jonathan Ho, Ajay Jain & Pieter Abbeel, 2020
Generates data by learning to reverse a gradual noising process, denoising random noise step by step into a coherent pattern.