
Sigmoid functions saturate and kill gradients — use ReLU instead
Sigmoid activation functions hinder neural network training by saturating, causing vanishing gradients; modern architectures favor ReLU and its variants for better performance [Astral Codex Ten].
Sigmoid functions, once standard in neural networks, are now known to degrade training by saturating at extreme inputs, pushing gradients toward zero and halting learning [Astral Codex Ten]. When a sigmoid neuron outputs near 0 or 1, its gradient becomes so small that backpropagation effectively stops—this is the vanishing gradient problem. Tomte’s analysis on Astral Codex Ten demonstrates how this flaw breaks deep network training, especially in early layers.
The issue isn’t theoretical. In practice, sigmoids fail on tasks with high-dimensional or noisy data, like image recognition or language modeling, where stable gradient flow is essential. Even slight input shifts can push sigmoid units into saturated zones, making learning slow or impossible. The article cites cases where networks with sigmoids plateau early, while identical models using ReLU converge faster and achieve higher accuracy.
ReLU (Rectified Linear Unit) avoids saturation for positive inputs, allowing gradients to flow freely during training. Variants like Leaky ReLU and GELU further refine this behavior. These functions have replaced sigmoids in nearly all modern architectures—from ResNets to Transformers—because they train faster and scale better.
Despite this, sigmoids persist in tutorials and legacy systems, creating a gap between education and practice. Engineers building AI systems should default to ReLU or its successors unless there’s a specific need for probabilistic outputs, such as in binary classification heads.
One—sigmoid saturation cripples deep learning; two—ReLU-based activations are now standard for reliable training; three—understanding this shift prevents wasted effort on outdated designs.
Subscribe to the broadcast.
Daily digest of the day's most important tech news. No fluff. Engineering signal only.
// delivered via substack · double-opt-in confirmation


