ReLU & GELU — How a Neuron Decides
A neuron has to make one decision, billions of times a second: fire, or don't fire.
For most of AI's history, that decision was made by a smooth curve — sigmoid, or its cousin tanh. Sigmoid takes any number, squeezes it into a value between 0 and 1, and outputs that. Beautiful math. Catastrophic in practice. The problem: at the extremes, the curve flattens. The gradient — the signal that tells the neuron "you should adjust" — shrinks to almost nothing. Stack 50 of these layers, and the learning signal at the bottom is essentially zero. Networks refused to train. The field hit a wall around 2006 and stayed there for years.
In 2010, two researchers named Vinod Nair and Geoffrey Hinton proposed something brutally simple. Replace the curve with a hinge. If the input is positive, pass it through unchanged. If the input is negative, output zero. That's it. That was the entire contribution. They called it ReLU — Rectified Linear Unit.
The math is one line. The effect was enormous. The gradient on the positive side stays at exactly 1. No vanishing. Signals flow through 50 layers, 100 layers, 1,000 layers. The deeper you stack, the more ReLU outperforms the smooth curves that came before. By 2015 it was the default. By 2020 it was everywhere.
But ReLU had a quiet flaw. A neuron that only ever sees negative inputs will output zero forever. Its gradient is also zero. It never recovers. Engineers called this the "dying ReLU" problem. A bad initialization, a bad batch, and 30% of your network could just go silent and never come back.
In 2016, Dan Hendrycks and Kevin Gimpel asked: what if the decision to fire is probabilistic, not binary? They took the standard normal curve — the bell curve — and asked, for each input, "what's the probability that this value is positive?" Multiply the input by that probability. If the input is large and positive, fire fully. If it's slightly negative, fire a little. If it's deeply negative, don't fire. Smooth where ReLU is sharp. Gentle where ReLU kills. They called it GELU — Gaussian Error Linear Unit.
Every BERT, every GPT, every modern transformer uses GELU. The hinge was good. The probabilistic hinge was better.
The lesson: a neuron doesn't need a smooth decision. It needs the right decision. For a decade, the right answer was "yes or no." Now it's "yes, probably, by this much."
— 算子次元 · One-minute AI · #13 ReLU & GELU
#AI #MachineLearning #DeepLearning #NeuralNetworks #ActivationFunction