Mixture of Experts

Noam Shazeer et al., 2017

O(n·d·k)

Popularized by Shazeer et al. in 2017, Mixture of Experts (MoE) layers use a learned gating/router network to dynamically select a small subset (top-K) of expert sub-networks for each input token. Only the selected experts process the input, keeping computation sparse while allowing the total model capacity to be very large. The visualization shows an input token flowing through the router, gating weights assigned to each expert, the top-2 experts activating and computing, and their outputs being combined via a weighted sum to produce the final output.