Deep Learning

Activation Functions

Think of a volume knob that doesn’t just multiply the signal linearly: it might squash loud sounds (saturation), cut off negative values (ReLU), or smoothly compress everything into a fixed range (sigmoid).

Admin User
March 15, 2026
40 min read
processed

Activation Functions Study

Interview-Ready Notes

Organization: DataLogos
Date: 15 Mar, 2026

Activation Functions – Study Notes (Deep Learning)

Target audience: Beginners | Goal: End-to-end learning and interview-ready
Difficulty: Beginner-friendly (assumes artificial neuron / perceptron and basic calculus)
Estimated time: ~40 min read / ~1 hour with self-checks and exercises


Pre-Notes: Learning Objectives & Prerequisites

Learning Objectives

By the end of this note you will be able to:

  • Define an activation function and explain why neural networks need non-linearity.
  • Describe the core role of activations: bounded output, gradient flow, and expressiveness.
  • Compare Sigmoid, Tanh, ReLU, Leaky ReLU, and Softmax (formula, range, use case, pros/cons).
  • Explain vanishing gradient, saturation, and dying ReLU and how different activations address them.
  • Choose an appropriate activation per layer type (hidden vs output) and task (classification vs regression).
  • Answer common interview questions and avoid wrong vs correct traps (e.g., “Why not sigmoid everywhere?”).

Prerequisites

Before starting, you should know:

  • Artificial neuron: weighted sum z = w·x + b and that output = activation(z).
  • Basic calculus: derivative of a function (for gradient flow and backprop).
  • Idea of linear vs non-linear: stacking linear layers without activation still gives a linear map.

If you don’t, review Artificial Neuron & Perceptron and Calculus for Deep Learning first.

Where This Fits

  • Builds on: Artificial Neuron (where activation sits in the pipeline), Calculus (gradients for backprop).
  • Needed for: Backpropagation (gradients flow through activations), all architectures (MLP, CNN, RNN, Transformers).

This topic is part of Phase 2: Core Neural Network Internals and is used in every layer of every modern deep learning model.


1. What is an Activation Function?

An activation function is a non-linear function applied to the output of a neuron’s weighted sum (the pre-activation or net input). It takes the value z = w·x + b and produces the neuron’s output a = activation(z).

In simple words: the neuron first computes a single number (the weighted sum plus bias); the activation function transforms that number into the final output that gets passed to the next layer (or to the loss). This transformation is what allows the network to learn non-linear relationships.

Simple Intuition

Think of a volume knob that doesn’t just multiply the signal linearly: it might squash loud sounds (saturation), cut off negative values (ReLU), or smoothly compress everything into a fixed range (sigmoid). The activation function decides how each neuron “responds” to its net input—whether it stays off, fires a little, or saturates at a maximum.

Formal Definition (Interview-Ready)

An activation function is a (usually non-linear) map σ : ℝ → ℝ (or to a bounded interval) applied element-wise to the pre-activation z of a layer, giving a = σ(z). It introduces non-linearity so that stacked layers can approximate non-linear functions; without it, any depth would be equivalent to a single linear layer.


In a Nutshell

An activation function turns the neuron’s weighted sum z into its output a = σ(z). It is non-linear so that the whole network can learn complex, non-linear patterns instead of just linear combinations.


2. Why Do We Need Activation Functions?

Without activation functions, every layer would compute only linear transformations (Wx + b). Composing many linear maps gives one linear map: y = W_L … W_2 W_1 x + (bias terms). So depth would add no expressiveness—the network would behave like a single linear model.

We need activations to:

  1. Introduce non-linearity so the network can approximate non-linear decision boundaries and functions.
  2. Control output range (e.g., probabilities in [0,1] for sigmoid, or class probabilities that sum to 1 for softmax).
  3. Affect gradient flow during backpropagation; some activations avoid vanishing or exploding gradients better than others.

Old vs New Paradigm

Paradigm Role of activation
Linear model No activation (or identity); one linear map
Single neuron + step Step (threshold) for binary yes/no
Deep networks Non-linear σ at each layer for expressiveness

Key Reasons

  • Universal approximation: With non-linear activations, a network with one hidden layer can approximate a wide class of continuous functions (universal approximation theorem); depth improves efficiency and generalization.
  • Gradient flow: Choice of activation affects whether gradients vanish (e.g., sigmoid in deep nets) or stay healthy (e.g., ReLU).
  • Interpretability of outputs: Sigmoid → probability; Softmax → class probabilities; identity → regression.

Real-World Relevance

Domain Why activation matters
Hidden layers ReLU/Leaky ReLU/GELU enable deep networks to train without vanishing gradients.
Output layer Sigmoid (binary), Softmax (multi-class), identity (regression) match the task.
Attention / NLP Softmax on attention scores; GELU in Transformers.

3. Core Building Block: How It Works

Where It Fits in a Neuron

For one neuron (or one unit in a layer):

  1. Inputs x₁, x₂, …, xₙ and weights w₁, …, wₙ, bias b.
  2. Pre-activation: z = Σᵢ wᵢ xᵢ + b = w·x + b.
  3. Activation: a = σ(z) → this is the output of the neuron.

For a layer, the same is applied element-wise: z = Wx + b, then a = σ(z) (σ applied to each component of z).

Mathematical Role

  • Linearity: f(ax + by) = a f(x) + b f(y). Matrix multiplication and adding bias are linear/affine. Composing them gives still linear/affine.
  • Non-linearity: σ breaks this: σ(W₂ σ(W₁ x + b₁) + b₂) is not equivalent to one linear map. So the network can represent curves, decision boundaries, and complex functions.

Interview-ready: The core building block is a = σ(z) where z = w·x + b. σ must be non-linear (for hidden layers) so that depth adds expressiveness; the derivative σ’(z) determines how gradients flow in backprop.


Diagram: Role of Activation in a Layer

flowchart LR
  subgraph layer [One layer]
    X[Input x] --> Z["z = Wx + b"]
    Z --> A["a = σ(z)"]
    A --> Next[Next layer / output]
  end

Caption: Each layer: linear part z = Wx + b, then non-linear a = σ(z). Without σ, the whole stack would be one linear map.


In a Nutshell

One layer = linear (Wx + b) then non-linear (σ(z)). The activation σ is what makes the function of the network non-linear and allows it to learn complex patterns.


Think about it: If we use σ(z) = z (identity) in every hidden layer, what kind of function does a 10-layer network compute?


4. Process: Forward Pass and Gradient Flow

Forward Pass

For each layer :

  1. z^(ℓ) = W^(ℓ) a^(ℓ−1) + b^(ℓ) (pre-activation).
  2. a^(ℓ) = σ(z^(ℓ)) (activation).

The output of the last layer is used for the loss (e.g., logits passed to cross-entropy, or a directly for MSE).

Backward Pass (Gradient Flow)

During backpropagation, the gradient of the loss L with respect to z^(ℓ) is:

∂L / ∂z^(ℓ) = (∂L / ∂a^(ℓ)) ⊙ σ’(z^(ℓ))

where is element-wise product and σ’(z) is the derivative of the activation. So:

  • If σ’(z) is small (e.g., sigmoid saturating), gradients shrinkvanishing gradient.
  • If σ’(z) is zero (e.g., ReLU for z < 0), that neuron passes no gradient → “dead” neuron if it never activates again.
Activation Derivative (key behavior) Effect on gradient
Sigmoid σ’(z) = σ(z)(1−σ(z)); small when |z| large Vanishing in deep nets
Tanh 1 − tanh²(z); same saturation Vanishing in deep nets
ReLU 1 if z > 0, 0 if z < 0 No gradient for z < 0; can “die”
Leaky ReLU 1 or small slope (e.g. 0.01) Small gradient for z < 0; fewer dead
GELU Smooth; non-zero for z < 0 Better gradient flow than ReLU in practice

In a Nutshell

Forward: a = σ(z). Backward: gradient is scaled by σ’(z). Small or zero σ’(z) causes vanishing or dead neurons; that’s why ReLU/Leaky ReLU/GELU are preferred in deep hidden layers.


5. Key Sub-Topics: Saturation, Vanishing Gradient, Dying ReLU

Saturation

Saturation means the activation function flattens out (derivative ≈ 0) when the input is very positive or very negative. Sigmoid and tanh saturate at both tails; ReLU saturates only for z < 0 (exactly 0 gradient).

  • Problem: When σ’(z) ≈ 0, gradients in backprop become tiny → weights in earlier layers barely update → vanishing gradient.

Vanishing Gradient

In very deep networks, gradients can become extremely small as they are multiplied by σ’(z) at each layer. Early layers get almost no update signal, so training stalls.

  • Mitigations: Use activations with non-saturating or non-zero gradient in the active region (ReLU, Leaky ReLU, GELU); Batch Normalization; Residual connections.

Dying ReLU

ReLU: a = max(0, z); derivative is 0 for z < 0. If a neuron’s weights shift so that z is always negative for all training inputs, it never fires and never gets a gradient → it stays “dead.”

  • Mitigations: Leaky ReLU (small negative slope); GELU; better initialization; lower learning rate.

6. Comparison: Linear vs Non-Linear; Hidden vs Output

Linear vs Non-Linear

Aspect No activation (linear) With non-linear activation
Expressiveness One linear map, regardless of depth Can approximate non-linear functions
Depth Useless (equivalent to 1 layer) Each layer adds non-linearity
Use Output layer for regression (sometimes) Hidden layers; output for classification

Hidden Layer vs Output Layer

Layer type Typical activations Purpose
Hidden ReLU, Leaky ReLU, GELU, Tanh Non-linearity; good gradient flow
Output Sigmoid, Softmax, Identity Match task: probability, class probs, real value

7. Common Types and Variants

1. Sigmoid (σ(z) = 1 / (1 + e^{-z}))

  • Range: (0, 1). Use case: Binary classification output (probability of positive class). Historically used in hidden layers but avoid there in deep nets.
  • Pros: Smooth, bounded; interpretable as probability.
  • Cons: Saturates; vanishing gradient; outputs not zero-centered.

Example: Last layer of a binary classifier: P(y=1|x) = sigmoid(z).


2. Tanh (tanh(z) = (e^z − e^{−z}) / (e^z + e^{−z}))

  • Range: (−1, 1). Use case: Hidden layers when zero-centered output is desired; RNNs (historically).
  • Pros: Zero-centered; stronger gradients than sigmoid near 0.
  • Cons: Still saturates for large |z|; vanishing gradient in very deep nets.

Example: Classic choice in LSTM/RNN gates (often with sigmoid for gate, tanh for cell candidate).


3. ReLU (Rectified Linear Unit): a = max(0, z)

  • Range: [0, ∞). Use case: Default for hidden layers in most feedforward and CNN architectures.
  • Pros: Simple; no saturation for z > 0; sparse activations; fast.
  • Cons: Not zero-centered; “dying ReLU” (neurons stuck at 0); gradient = 0 for z < 0.

Example: Almost every modern CNN/MLP hidden layer.


4. Leaky ReLU: a = max(αz, z) with small α (e.g. 0.01)

  • Range: (−∞, ∞) in principle; negative side small. Use case: When dying ReLU is a concern.
  • Pros: Small gradient for z < 0 → fewer dead neurons.
  • Cons: Extra hyperparameter α; usually small gain over ReLU in practice.

Example: Some GANs and deeper MLPs where dead ReLUs are observed.


5. GELU (Gaussian Error Linear Unit)

  • Formula: GELU(z) ≈ z Φ(z) (Φ = standard normal CDF); often approximated as 0.5 z (1 + tanh(√(2/π)(z + 0.044715 z³))).
  • Use case: Transformers (BERT, GPT); increasingly used as default in NLP.
  • Pros: Smooth; non-zero for z < 0; often better performance than ReLU in large models.
  • Cons: More expensive than ReLU.

Example: BERT, GPT-2/3 use GELU in feedforward blocks.


6. Softmax (for a vector z): a_i = e^{z_i} / Σ_j e^{z_j}

  • Range: (0, 1); Σᵢ aᵢ = 1. Use case: Output layer for multi-class classification (class probabilities).
  • Pros: Outputs are valid probability distribution; works with cross-entropy loss.
  • Cons: For hidden layers, usually not used (prefer ReLU/GELU); can be numerically unstable (use log-sum-exp trick).

Example: Last layer of a 10-class image classifier: 10 logits → softmax → 10 probabilities.


Quick Reference Table

Name Formula (typical) Range Typical use Gradient issue
Sigmoid 1/(1+e^{-z}) (0,1) Binary out Vanishing
Tanh (ez−e{−z})/(ez+e{−z}) (−1,1) Hidden / RNN Vanishing
ReLU max(0,z) [0,∞) Hidden (default) Dying ReLU
Leaky ReLU max(αz,z) (−∞,∞) Hidden (if dying) Fewer dead
GELU z Φ(z) (approx) (−∞,∞) Transformer hidden Generally good
Softmax e^{z_i}/Σ_j e^{z_j} (0,1), Σ=1 Multi-class out Use with cross-entropy

8. FAQs & Common Student Struggles

Q1. Why do we need non-linearity?

Without non-linear activations, the entire network is one linear transformation. We need σ so that stacking layers can represent non-linear decision boundaries and functions (e.g., curves, XOR, images).

Interview tip: “Without activation functions, depth is useless—the network is equivalent to a single linear layer.”


Q2. Can we use different activations in different layers?

Yes. Hidden layers typically use ReLU, Leaky ReLU, or GELU. Output layer depends on the task: Sigmoid (binary), Softmax (multi-class), identity (regression).


Q3. Why is ReLU preferred over sigmoid/tanh in hidden layers?

ReLU avoids vanishing gradient in the positive region (derivative = 1), trains faster, and is cheap to compute. Sigmoid and tanh saturate for large |z|, so gradients vanish in deep networks.


Q4. What is the “dying ReLU” problem?

When z < 0 for all inputs to a ReLU neuron, it outputs 0 and its gradient is 0, so weights never update. The neuron stays “dead.” Leaky ReLU or GELU give a small non-zero gradient for z < 0.


Q5. When do we use Sigmoid vs Softmax?

Sigmoid: Binary classification; one output neuron; probability of one class. Softmax: Multi-class classification; one output per class; probabilities sum to 1.


Q6. Why is Softmax only used at the output layer?

Softmax converts logits to probabilities and is designed to pair with cross-entropy loss. In hidden layers we want non-linearity and gradient flow, not probability normalization; ReLU/GELU are better there.


Q7. What is saturation?

Saturation is when the activation function is in a flat region (derivative ≈ 0), e.g. sigmoid for very large or small z. Gradients then become very small and learning slows or stops (vanishing gradient).


Q8. Is GELU always better than ReLU?

Not always. GELU is often better in Transformers and large NLP models. For many CNNs and smaller MLPs, ReLU is still standard and cheaper. It’s a task and scale choice.


9. Applications (With How They Are Achieved)

1. Image Classification (e.g. CNNs)

Applications: Object recognition, face recognition, medical image labeling.

How activation functions achieve this:

  • Hidden layers: ReLU (or GELU) after each conv/linear layer provide non-linearity and stable gradients so deep stacks can learn hierarchies (edges → textures → parts → objects).
  • Output: Softmax turns final logits into class probabilities; cross-entropy loss trains the model.

Example: ResNet uses ReLU after every conv block; final layer is linear + Softmax for 1000-class ImageNet.


2. Binary Classification (e.g. spam, fraud)

Applications: Spam detection, click-through prediction, fraud detection.

How activation functions achieve this:

  • Hidden layers: ReLU/Leaky ReLU for non-linearity.
  • Output: Sigmoid on a single logit gives P(positive class); binary cross-entropy loss is used.

Example: A feedforward network with two hidden layers (ReLU) and one output neuron with sigmoid for “is this email spam?”


3. Natural Language Processing (Transformers)

Applications: BERT, GPT, machine translation, summarization.

How activation functions achieve this:

  • Hidden layers: GELU in feedforward blocks (and sometimes in attention) for smooth non-linearity and good gradient flow in very deep models.
  • Attention: Softmax on attention scores so they form a probability distribution over keys.
  • Output: Softmax for classification; linear or softmax for language modeling (next-token distribution).

Example: BERT uses GELU in the MLP blocks and Softmax in multi-head attention.


4. Regression (e.g. house price, demand forecasting)

Applications: Price prediction, demand forecasting, sensor prediction.

How activation functions achieve this:

  • Hidden layers: ReLU (or similar) for non-linearity.
  • Output: Identity (no activation) or a bounded activation if outputs must stay in a range; loss is usually MSE or MAE.

Example: MLP with ReLU hidden layers and linear (identity) output for scalar prediction.


10. Advantages and Limitations (With Examples)

Advantages

1. Non-linearity and expressiveness
Activations let the network approximate non-linear functions and complex decision boundaries.

Example: XOR cannot be learned by a linear model; one hidden layer with non-linear activation (e.g. tanh) can.


2. Controlled output range
Sigmoid and Softmax produce valid probabilities; ReLU keeps activations non-negative when that helps (e.g. sparse codes).

Example: Softmax output layer gives interpretable class probabilities for a doctor-facing diagnostic model.


3. Gradient flow (with good choice)
ReLU/GELU avoid saturation in the active region and help gradients reach earlier layers.

Example: Training a 50-layer ResNet is feasible because ReLU (and skip connections) keep gradients flowing.


4. Computational simplicity (ReLU)
ReLU is a max and a threshold; very fast and simple to implement.

Example: Inference on edge devices benefits from ReLU’s low cost compared to exp-based activations.


Limitations

1. Vanishing gradient (saturating activations)
Sigmoid and tanh saturate; gradients shrink in deep networks and early layers learn slowly.

Example: Old MLPs with many sigmoid hidden layers were hard to train beyond a few layers.


2. Dying ReLU
ReLU can permanently turn off neurons (zero gradient for z < 0), reducing effective capacity.

Example: In a GAN discriminator, too many dead ReLUs can make training unstable; Leaky ReLU is often used instead.


3. Non-zero-centered (Sigmoid, ReLU)
Sigmoid outputs are positive only; ReLU outputs are non-negative. This can affect optimization (less of an issue with BatchNorm).

Example: Before BatchNorm was common, tanh was sometimes preferred in hidden layers for being zero-centered.


4. Choice and tuning
Wrong choice can hurt performance or training stability (e.g. sigmoid in deep hidden layers; identity everywhere).

Example: Using sigmoid in all layers of a 10-layer network often leads to vanishing gradients and poor convergence.


11. Interview-Oriented Key Takeaways

  • An activation function σ maps pre-activation z to output a = σ(z); it must be non-linear in hidden layers so depth adds expressiveness.
  • Without activations, any deep network is equivalent to a single linear layer.
  • Hidden layers: ReLU (default), Leaky ReLU (if dying ReLU), GELU (Transformers). Output: Sigmoid (binary), Softmax (multi-class), identity (regression).
  • Vanishing gradient comes from saturation (small σ’(z)); dying ReLU from σ’(z) = 0 for z < 0.
  • Sigmoid/tanh in deep hidden layers are avoided due to saturation; ReLU/GELU preferred for gradient flow.

12. Common Interview Traps

Trap 1: “Why not use sigmoid in every layer?”

Wrong: Sigmoid is a good activation, so we can use it everywhere.

Correct: Sigmoid saturates (derivative ≈ 0 for large |z|), causing vanishing gradients in deep networks. Use ReLU (or GELU) in hidden layers; reserve sigmoid for the output of binary classification.


Trap 2: “ReLU has no drawbacks.”

Wrong: ReLU is perfect for hidden layers.

Correct: ReLU can cause dying neurons (zero gradient for z < 0). For z < 0 the gradient is 0, so if a neuron never fires again, it never gets updates. Leaky ReLU or GELU mitigate this.


Trap 3: “We don’t need activation in the output layer.”

Wrong: Output layer doesn’t need an activation.

Correct: The output layer should match the task: Sigmoid for binary probability, Softmax for multi-class probability, identity for regression. So we do use (the right) activation at the output for classification.


Trap 4: “More layers always need more expressive activations.”

Wrong: Deeper networks need fancier activations.

Correct: Depth adds expressiveness when we have non-linear activations; ReLU is already non-linear and is the default. “Fancier” (e.g. GELU) can help in specific architectures (e.g. Transformers) but isn’t strictly required for depth.


Trap 5: “Softmax can be used in hidden layers.”

Wrong: Softmax is a good non-linearity for hidden layers.

Correct: Softmax normalizes to a probability distribution and is designed for the output with cross-entropy. In hidden layers we want non-linearity and good gradient flow; ReLU/GELU are appropriate. Using Softmax in hidden layers is unusual and can hurt gradient flow.


Trap 6: “If the gradient is zero, the activation is bad.”

Wrong: Zero gradient always means a bad activation.

Correct: ReLU intentionally gives zero gradient for z < 0 (sparsity). The issue is when too many neurons are stuck at zero (dying ReLU) or when the derivative is small everywhere (saturation, e.g. sigmoid in deep nets). Context matters.


13. Simple Real-Life Analogy

An activation function is like a volume limiter and shaper: the raw signal (weighted sum) can be anything; the limiter squashes it into a range (e.g. 0–1 for probability), cuts off the negative part (ReLU), or smooths it (GELU). Without it, every layer would only “turn the volume up or down” linearly—you’d never get the rich, non-linear “sound” (decision boundaries) that the network needs.


14. Activation Functions in System Design – Interview Traps (If Applicable)

Trap 1: Same activation everywhere for simplicity

Wrong thinking: Use ReLU everywhere (including output) to keep the code simple.

Correct thinking: Output layer must match the task: Sigmoid/Softmax for classification, identity for regression. Using ReLU at the output for a classifier would give non-probability outputs and wrong loss (e.g. cross-entropy expects logits or probabilities).

Example: A binary classifier must use sigmoid (or logits + BCE) at the output, not ReLU.


Trap 2: Ignoring numerical stability (Softmax)

Wrong thinking: Implement Softmax as e^{z_i} / Σ_j e^{z_j} without care for large z.

Correct thinking: For large z, e^z overflows. Use the log-sum-exp trick: subtract max(z) before exponentiating, then Softmax is exp(z_i − max(z)) / Σ_j exp(z_j − max(z)).

Example: In production, always use a numerically stable Softmax (e.g. log_softmax for cross-entropy) to avoid NaNs on large logits.


Trap 3: Changing activation without re-tuning

Wrong thinking: Swap ReLU for GELU and expect the same hyperparameters to work.

Correct thinking: Different activations can change effective learning rate and gradient scale. You may need to re-tune learning rate or initialization when changing activation in a production model.

Example: Switching from ReLU to GELU in a Transformer might require a small learning rate or warmup adjustment.


15. Interview Gold Line

Activation functions are the non-linear “shape” of each layer: without them, depth is useless and the network is just one linear map; with the right choice (ReLU/GELU in hidden, Sigmoid/Softmax/identity at output), they make deep networks both expressive and trainable.


16. Code Snippets (Python)

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))  # clip for stability

def relu(z):
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def softmax(z):
    z_stable = z - np.max(z, axis=-1, keepdims=True)
    return np.exp(z_stable) / np.sum(np.exp(z_stable), axis=-1, keepdims=True)

Interview tip: For Softmax, always subtract max(z) before exp to avoid overflow. In PyTorch/TensorFlow use built-in F.softmax / tf.nn.softmax which are stable.


17. Visual: Activation Comparison (Conceptual)

Comparison of common activation functions: Sigmoid, Tanh, ReLU, Softmax and their ranges and typical uses.

Caption: Same z can be passed through different σ; choice depends on layer (hidden vs output) and task (classification vs regression).


18. Self-Check and “Think About It” Prompts

Self-check 1: Why can’t a network with 10 linear layers (no activation) learn a non-linear decision boundary?

Self-check 2: What is the main disadvantage of using Sigmoid in all hidden layers of a deep network?

Self-check 3: When would you choose Leaky ReLU over ReLU?

Think about it: For a 3-class classification problem, what should the output layer activation be? How many output units?

Self-check answers (concise):
- 1: Composing 10 linear maps gives one linear map; the whole network is linear, so it can only learn linear boundaries.
- 2: Sigmoid saturates for large |z|, so σ’(z) ≈ 0 and gradients vanish; early layers get almost no update.
- 3: When you observe many “dead” ReLU neurons (e.g. activations always zero); Leaky ReLU gives a small gradient for z < 0 so neurons can recover.
- 4: Softmax; 3 output units (one logit per class), then Softmax gives 3 probabilities that sum to 1.


19. Likely Interview Questions

  • What is an activation function and why do we need it?
  • Why would a deep network without activations be equivalent to a single linear layer?
  • Compare Sigmoid, Tanh, and ReLU (formula, range, pros/cons).
  • What is the vanishing gradient problem and how do ReLU/Leaky ReLU help?
  • What is the dying ReLU problem? How do we mitigate it?
  • When do we use Sigmoid vs Softmax?
  • Why is Softmax used only at the output layer?
  • What is saturation? Which activations saturate?
  • Why is GELU used in Transformers?

20. Elevator Pitch

30 seconds:
An activation function turns a neuron’s weighted sum z into its output a = σ(z). It’s non-linear so that stacking many layers actually adds expressiveness instead of collapsing to one linear map. In hidden layers we usually use ReLU or GELU for good gradient flow; at the output we use Sigmoid for binary classification, Softmax for multi-class, or identity for regression. Sigmoid and tanh in deep hidden layers cause vanishing gradients; ReLU can cause “dying” neurons, which Leaky ReLU or GELU can mitigate.

2 minutes:
Activation functions apply a non-linear transform to each neuron’s pre-activation. Without them, any depth is equivalent to a single linear layer, so we couldn’t learn non-linear decision boundaries. For hidden layers, ReLU is the default: it’s simple, doesn’t saturate for z > 0, and avoids the vanishing gradient problem that sigmoid and tanh suffer from in deep nets. The downside of ReLU is dying neurons when z stays negative; Leaky ReLU or GELU address that. For the output layer, we match the task: Sigmoid for binary probability, Softmax for multi-class probability, identity for regression. In Transformers, GELU is common in feedforward blocks. When answering “why ReLU over sigmoid,” focus on gradient flow and saturation.


21. One-Page Cheat Sheet (Quick Revision)

Concept Definition / rule
Activation function a = σ(z); non-linear map applied to pre-activation z.
Why non-linearity Without σ, depth = one linear layer; no expressiveness.
Sigmoid 1/(1+e^{-z}); (0,1); binary output; saturates → vanishing gradient.
Tanh (ez−e{−z})/(ez+e{−z}); (−1,1); zero-centered; saturates.
ReLU max(0,z); [0,∞); default hidden; no saturation for z>0; dying ReLU for z<0.
Leaky ReLU max(αz,z); small gradient for z<0; fewer dead neurons.
GELU z Φ(z); smooth; used in Transformers.
Softmax e^{z_i}/Σ_j e^{z_j}; (0,1), Σ=1; multi-class output only.
Hidden layer ReLU / Leaky ReLU / GELU.
Output Sigmoid (binary), Softmax (multi-class), identity (regression).
Vanishing gradient Saturating σ (sigmoid/tanh) → small σ’(z) → tiny gradients in deep nets.
Dying ReLU z<0 always → gradient 0 → neuron never updates.

22. Formula Card

Name Formula / range
Pre-activation z = w·x + b (or z = Wx + b for a layer)
Activation a = σ(z)
Sigmoid σ(z) = 1 / (1 + e^{-z}), range (0, 1)
Tanh tanh(z) = (e^z − e^{−z}) / (e^z + e^{−z}), range (−1, 1)
ReLU a = max(0, z), range [0, ∞)
Leaky ReLU a = max(αz, z) (α ≈ 0.01)
Softmax a_i = e^{z_i} / Σ_j e^{z_j}, Σᵢ aᵢ = 1
Gradient (backprop) ∂L/∂z = (∂L/∂a) ⊙ σ’(z)
Sigmoid derivative σ’(z) = σ(z)(1 − σ(z))

23. What’s Next and Revision Checklist

What’s Next

  • Backpropagation: Gradients ∂L/∂z depend on σ’(z); you’ll see how activation derivatives are used in the chain rule at every layer.
  • Loss Functions: Output activations (Sigmoid, Softmax) pair with cross-entropy; identity pairs with MSE for regression.
  • Optimizers: Training stability depends on gradient magnitude; activation choice affects how large or small those gradients are.
  • Architectures (CNN, RNN, Transformers): Each uses activations in specific places (e.g. ReLU in CNN, GELU in Transformer blocks).

Revision Checklist

Before an interview, ensure you can:

  1. Define activation function in one sentence (non-linear σ applied to z so that depth adds expressiveness).
  2. State why we need non-linearity (without it, network = one linear layer).
  3. Compare Sigmoid vs Tanh vs ReLU (formula, range, saturation, use case).
  4. Explain vanishing gradient and dying ReLU and which activations help.
  5. Choose activation for hidden vs output (ReLU/GELU hidden; Sigmoid/Softmax/identity output).
  6. Correct the trap: “sigmoid everywhere” (saturation in deep nets).
  7. Correct the trap: “ReLU has no drawbacks” (dying ReLU).
  8. Give a numerically stable Softmax trick (subtract max(z) before exp).
  • Artificial Neuron & Perceptron (where a = σ(z) lives)
  • Calculus for Deep Learning (derivatives for σ’(z))
  • Backpropagation (gradient flow through σ)
  • Loss Functions (Sigmoid/Softmax + cross-entropy)
  • Optimization Fundamentals (gradient magnitude and training)

End of Activation Functions study notes.