Deep Learning

Loss Functions

A loss function (or cost function) is a scalar function that measures how wrong the model’s predictions are compared to the true targets. It takes predictions and targets as inputs and outputs a single number: the higher the loss, the worse the model is doing.

Admin User
March 15, 2026
40 min read
processed

Loss Functions Study

Interview-Ready Notes

Organization: DataLogos
Date: 15 Mar, 2026

Loss Functions – Study Notes (Deep Learning)

Target audience: Beginners | Goal: End-to-end learning and interview-ready
Difficulty: Beginner-friendly (assumes basic algebra and idea of “prediction vs target”)
Estimated time: ~40 min read / ~1 hour with self-checks and exercises


Pre-Notes: Learning Objectives & Prerequisites

Learning Objectives

By the end of this note you will be able to:

  • Define a loss function and explain why we need it in training (feedback signal, optimization objective).
  • Describe the core role of the loss: measuring prediction error and providing gradients for backpropagation.
  • Compare regression losses (MSE, MAE, Huber) and classification losses (BCE, CCE, hinge) with formulas and use cases.
  • Explain the probabilistic view: MSE ↔︎ Gaussian, cross-entropy ↔︎ MLE for classification.
  • Choose an appropriate loss for task type (regression vs binary vs multi-class) and data properties (outliers, imbalance).
  • Answer common interview questions and avoid wrong vs correct traps (e.g., “MSE for classification,” “loss vs metric”).

Prerequisites

Before starting, you should know:

  • Model and prediction: A model takes input and produces a prediction (e.g., a number or a class probability).
  • Training goal: We want the model’s predictions to match the target (ground truth) as well as possible.
  • Basic algebra: Sums, squares, logarithms (for formulas).

If you don’t, review Artificial Neuron & Perceptron and the Deep Learning intro (forward pass, training loop) first.

Where This Fits

  • Builds on: Forward pass (model outputs predictions), idea of “right vs wrong” (targets).
  • Needed for: Backpropagation (gradient of the loss drives updates), Optimizers (minimize the loss), Activation functions (output layer + loss pair: e.g., Softmax + cross-entropy).

This topic is part of Phase 1: Mathematical Foundations and sits between forward pass and optimization.


1. What is a Loss Function?

A loss function (or cost function) is a scalar function that measures how wrong the model’s predictions are compared to the true targets. It takes predictions and targets as inputs and outputs a single number: the higher the loss, the worse the model is doing.

In simple words: the loss is the “score for how bad the model is” on the current batch of data. Training tries to minimize this score.

Simple Intuition

Think of a teacher grading an exam: each wrong answer adds to the “error score.” The loss function is like that grading scheme—it adds up all the mistakes (squared error, log error, etc.) into one number. The model’s job is to reduce that number by adjusting its parameters (weights).

Formal Definition (Interview-Ready)

A loss function L(ŷ, y) is a scalar-valued function that quantifies the discrepancy between the model’s predictions ŷ and the true targets y. Training minimizes the expected loss (or its empirical average over data) with respect to the model parameters, providing the objective for optimization and the gradient that backpropagation propagates.


In a Nutshell

The loss function turns “predictions vs targets” into one number we want to make smaller. It is the objective we minimize during training and the source of gradients for updating weights.


2. Why Do We Need Loss Functions?

Without a loss function, we would have no objective to optimize. The model would not know what “better” means—we need a numerical measure of error that (1) tells us how bad the current predictions are, and (2) gives us gradients so we can update weights in the right direction.

Old vs New Paradigm

Paradigm Role of loss
Rule-based systems No learning; no loss
Classical ML Explicit loss (e.g., SVM hinge, log-loss)
Deep Learning Same idea: loss drives all parameter updates

Key Reasons

  1. Objective for optimization: Gradient-based training (SGD, Adam) minimizes the loss; the loss defines what “good” means.
  2. Gradient signal: Backpropagation computes ∂L/∂(parameters); without a differentiable loss, we couldn’t train with gradient descent.
  3. Task alignment: Different tasks need different notions of error (e.g., squared error for regression, cross-entropy for classification).
  4. Probabilistic interpretation: Many losses correspond to maximum likelihood estimation under a chosen probability model (e.g., MSE ↔︎ Gaussian noise, cross-entropy ↔︎ categorical model).

Real-World Relevance

Domain Why loss matters
Regression MSE, MAE, or Huber match the noise model and robustness needs.
Binary classification BCE with Sigmoid output gives valid gradients and MLE.
Multi-class Categorical cross-entropy with Softmax is the standard.
Imbalanced data Weighted loss or focal loss can emphasize rare classes.

3. Core Building Block: How the Loss Fits in Training

Where the Loss Sits

In one training step:

  1. Forward pass: Input x → model → prediction ŷ (e.g., logits or probabilities).
  2. Loss computation: L = L(ŷ, y) using the chosen loss function.
  3. Backpropagation: Compute ∂L/∂(all parameters).
  4. Update: Optimizer uses these gradients to update parameters (e.g., θ ← θ − η ∇L).

The loss is the bridge between “what the model predicted” and “how we change the model.”

Mathematical Role

  • Inputs: ŷ (predictions) and y (targets). For a batch, we usually average the per-sample loss: L = (1/N) Σᵢ L(ŷᵢ, yᵢ).
  • Output: A single scalar L ≥ 0 (for standard losses). We minimize L w.r.t. model parameters.
  • Gradient: ∂L/∂ŷ is used by backpropagation; the chain rule then gives ∂L/∂θ.

Interview-ready: The core building block is L(ŷ, y). It must be differentiable (at least where we evaluate it) so that ∂L/∂θ exists. The choice of loss (MSE, BCE, CCE) determines both the optimization objective and the probabilistic model we implicitly assume.


Diagram: Loss in the Training Loop

Loss in the training loop: Forward pass → Loss computation → Backpropagation → Parameter update.

Caption: The loss is computed after the forward pass and before backprop; its gradient drives the parameter update.


In a Nutshell

Loss = L(ŷ, y). It is computed after the forward pass, then backpropagation uses it to get ∂L/∂θ, and the optimizer minimizes L by updating θ.


Think about it: Why do we usually use the average loss over a batch instead of the sum? What would change if we used the sum?


4. Process: Where Loss Fits in the Training Pipeline

Step-by-Step (One Iteration)

  1. Sample a batch of (x, y).
  2. Forward pass: Compute ŷ = model(x).
  3. Loss: Compute L = (1/batch_size) Σ L(ŷᵢ, yᵢ) (e.g., MSE or cross-entropy).
  4. Backward pass: Compute ∂L/∂θ (backpropagation).
  5. Optimizer step: Update θ to decrease L (e.g., θ ← θ − η ∇L).

Loss–Output Pairing (Critical for Classification)

Task Output layer Loss function
Regression Identity (linear) MSE, MAE, or Huber
Binary classification Sigmoid Binary Cross-Entropy (BCE)
Multi-class classification Softmax Categorical Cross-Entropy (CCE)

Interview tip: Using MSE for classification (with 0/1 targets) is possible but not recommended: cross-entropy has better gradient behavior and matches the probabilistic interpretation (MLE for classification).


In a Nutshell

Training loop: Forward → Loss → Backprop → Update. Match the loss to the output and task: regression → MSE/MAE; binary → BCE + Sigmoid; multi-class → CCE + Softmax.


5. Key Sub-Topics: Regression vs Classification, Probabilistic View, Gradient Behavior

Regression vs Classification Losses

Aspect Regression losses Classification losses
Target Continuous (real number) Discrete (class label)
Prediction Scalar ŷ Probability vector or logits
Examples MSE, MAE, Huber BCE, CCE, hinge
Use Price, demand, sensor Spam, image class, NER

Probabilistic View (MLE)

  • MSE: Minimizing MSE is equivalent to maximum likelihood under a Gaussian noise model: y = ŷ + ε, ε ~ N(0, σ²).
  • Cross-entropy: Minimizing cross-entropy is MLE for a categorical (classification) model: we maximize the log-probability of the correct class.

This links loss choice to assumptions about the data distribution.

Gradient Behavior

  • MSE: Gradient ∂L/∂ŷ is proportional to (ŷ − y). Large errors → large gradients (sensitive to outliers).
  • MAE: Gradient magnitude is constant (more robust to outliers; less smooth at zero).
  • Cross-entropy: With Softmax, ∂L/∂logits has a simple form (p − y) (probability minus one-hot target), which gives stable gradients for classification.

In a Nutshell

Regression losses (MSE, MAE) measure distance from a continuous target; classification losses (BCE, CCE) measure probability assigned to the correct class. Many standard losses correspond to MLE under a specific probabilistic model.


6. Comparison: Loss Functions at a Glance

Loss Formula (per sample, idea) Task Pros Cons / notes
MSE (y − ŷ)² Regression Smooth, differentiable; MLE for Gaussian Sensitive to outliers
MAE |y − ŷ| Regression Robust to outliers Non-smooth at 0
Huber Squared for small |e|, linear for large Regression Robust + smooth Extra hyperparameter δ
BCE −[y log p + (1−y) log(1−p)] Binary class MLE; good gradients Only for binary
CCE −log p_true or −Σ yₖ log pₖ Multi-class Standard for classification Needs Softmax output
Hinge max(0, 1 − y·ŷ) Binary/margin SVM-style; sparse Not a probability model

7. Common Types and Variants

1. Mean Squared Error (MSE)

  • What: Average of squared differences between prediction and target.
  • Formula: L = (1/N) Σᵢ (yᵢ − ŷᵢ)².
  • Use case: Regression when errors are roughly Gaussian; default for scalar outputs.
  • Example: House price prediction, demand forecasting.

2. Mean Absolute Error (MAE)

  • What: Average of absolute differences.
  • Formula: L = (1/N) Σᵢ |yᵢ − ŷᵢ|.
  • Use case: Regression when outliers are a concern; more robust than MSE.
  • Example: Sensor readings with occasional spikes.

3. Huber Loss

  • What: Squared error for small |e|, linear for large |e| (smooth blend of MSE and MAE).
  • Formula: For residual e = y − ŷ, L(e) = ½e² if |e| ≤ δ, else δ(|e| − ½δ).
  • Use case: Regression with possible outliers but smooth gradients.
  • Example: Robust regression in control or finance.

4. Binary Cross-Entropy (BCE)

  • What: Negative log-likelihood of the correct binary class; p = P(positive).
  • Formula: L = −[y log(p) + (1−y) log(1−p)]; batch = average over samples.
  • Use case: Binary classification with Sigmoid output.
  • Example: Spam detection, click prediction.

5. Categorical Cross-Entropy (CCE)

  • What: Negative log-probability of the true class; p = Softmax(logits).
  • Formula: L = −Σₖ yₖ log(pₖ) (one-hot y) or L = −log(p_true).
  • Use case: Multi-class classification with Softmax output.
  • Example: Image classification (e.g., ImageNet 1000 classes).

6. Focal Loss (Advanced)

  • What: Down-weights easy examples (high confidence correct); emphasizes hard examples.
  • Use case: Heavily imbalanced detection/classification (e.g., object detection).
  • Example: RetinaNet for object detection with many background boxes.

7. Hinge Loss

  • What: max(0, 1 − y·ŷ) for binary labels y ∈ {−1, +1}.
  • Use case: Margin-based learning (SVM-style); not a probability loss.
  • Example: Linear SVM; sometimes in neural nets for margin.

8. FAQs & Common Student Struggles

Q1. What is the difference between loss and metric?

Loss is what we optimize (minimize) during training; it must be differentiable so we can compute gradients. Metric (e.g., accuracy, F1, RMSE) is what we report and use for model selection; it may be non-differentiable or discrete.

Example: We minimize cross-entropy loss; we report accuracy on the validation set.

Interview tip: We often choose a metric for business success (e.g., recall) but train on a loss that is differentiable and correlated with that metric (e.g., cross-entropy or weighted BCE).


Q2. Can we use MSE for classification?

We can (e.g., MSE with 0/1 targets), but we shouldn’t as the default. Cross-entropy gives better gradients and corresponds to MLE for classification. MSE for 0/1 targets has flat gradients when the model is very wrong, slowing learning.

Correct choice: Binary classification → BCE + Sigmoid; multi-class → CCE + Softmax.


Q3. Why is log used in cross-entropy?

The log turns the product of per-sample likelihoods into a sum (log-likelihood), which is easier to optimize and numerically stable. Minimizing negative log-likelihood is equivalent to maximizing likelihood (MLE). The log also produces strong gradients when the model is confident and wrong (large penalty), which helps learning.


Q4. What is the relationship between loss and activation at the output?

The output activation must produce values that the loss expects:

  • Regression (MSE/MAE): Output = identity (any real number).
  • Binary (BCE): Output = Sigmoid (probability in (0, 1)).
  • Multi-class (CCE): Output = Softmax (probabilities summing to 1).

Using the wrong pair (e.g., ReLU output + BCE) gives incorrect gradients or semantics.


Q5. How do we handle imbalanced classes?

  • Weighted loss: Weight each sample by inverse frequency or a custom weight (e.g., higher weight for rare class).
  • Focal loss: Down-weight easy examples so the model focuses on hard/rare ones.
  • Resampling: Oversample minority or undersample majority (affects the effective loss distribution).
  • Threshold tuning: Keep cross-entropy loss but choose a decision threshold to optimize a metric (e.g., recall).

Q6. Why is MSE sensitive to outliers?

MSE uses squared error, so large errors contribute much more than small ones (e.g., 10² = 100 vs 1² = 1). A few outliers can dominate the loss and pull the model toward them. MAE or Huber reduce this effect.


Q7. Do we minimize loss on training or validation data?

We minimize loss on the training data (or minibatches). Validation loss (and metrics) are used to monitor generalization and early stopping or model selection—we do not update parameters to minimize validation loss directly (that would be training on the validation set).


Q8. What is label smoothing?

Label smoothing replaces one-hot targets (e.g., [0, 0, 1, 0]) with smoothed values (e.g., [0.025, 0.025, 0.9, 0.025]) so the model doesn’t push probabilities to exactly 0 or 1. It can improve calibration and generalization and is common in modern NLP/vision.


9. Applications (With How They Are Achieved)

1. Regression (e.g., house price, demand)

Applications: Price prediction, demand forecasting, sensor prediction, time-series forecasting.

How the loss achieves this:

  • MSE (or MAE/Huber) defines “error” as squared (or absolute) difference; minimizing it pushes ŷ toward y.
  • Gradient of the loss w.r.t. ŷ (and then w.r.t. θ) tells the optimizer how to change parameters to reduce error.

Example: An MLP with linear output and MSE loss trained on (features, price) pairs to predict house prices.


2. Binary Classification (e.g., spam, fraud)

Applications: Spam detection, click-through prediction, fraud detection, medical screening.

How the loss achieves this:

  • BCE with Sigmoid output gives a valid probability; minimizing BCE is MLE for the Bernoulli model.
  • Gradients are well-behaved (no saturation like MSE for 0/1), so training is stable.

Example: Binary classifier: two-layer MLP with ReLU, one output with Sigmoid, trained with BCE for “is this email spam?”


3. Multi-Class Classification (e.g., image class, intent)

Applications: Image classification (ImageNet), intent classification, document tagging.

How the loss achieves this:

  • CCE with Softmax output assigns a probability to each class; minimizing CCE maximizes the probability of the correct class.
  • One loss value per sample (negative log of the correct class probability) is averaged over the batch.

Example: ResNet for ImageNet: conv layers + linear layer with 1000 units + Softmax, trained with CCE.


4. Imbalanced Detection (e.g., rare objects)

Applications: Object detection (many background, few objects), defect detection, rare disease screening.

How the loss achieves this:

  • Focal loss or weighted cross-entropy reduces the contribution of easy (e.g., background) examples so the model focuses on hard/rare positives.
  • Loss design directly addresses the imbalance instead of relying only on sampling.

Example: RetinaNet uses focal loss to train on images with many background anchor boxes and few object boxes.


10. Advantages and Limitations (With Examples)

Advantages

1. Single objective for training
The loss summarizes all errors into one number we can minimize with gradient descent.

Example: We don’t optimize “accuracy” directly (it’s not differentiable); we minimize cross-entropy, which is correlated with accuracy and differentiable.


2. Probabilistic interpretation
MSE and cross-entropy correspond to MLE under Gaussian and categorical models, linking loss choice to assumptions.

Example: Using MSE for regression assumes additive Gaussian noise; using CCE for classification assumes a categorical distribution over classes.


3. Task-specific design
We can choose a loss that matches the task (regression vs classification) and data (robustness to outliers, imbalance).

Example: Huber for regression with outliers; focal loss for detection with many negatives.


4. Gradient signal
A well-chosen loss gives useful gradients (e.g., cross-entropy for classification) so that backpropagation updates parameters effectively.

Example: BCE with Sigmoid gives (p − y)-like gradients; MSE for 0/1 gives flat gradients when very wrong.


Limitations

1. Loss ≠ metric
We optimize the loss, but the business cares about metrics (accuracy, recall, etc.). They don’t always align perfectly.

Example: Minimizing cross-entropy doesn’t guarantee the best F1 or recall; we may need threshold tuning or weighted loss.


2. Sensitive to scaling and outliers (e.g., MSE)
MSE is sensitive to outliers; MAE is non-smooth at zero. Choice of loss (and scaling of targets) affects stability and robustness.

Example: Predicting house prices in dollars (large numbers) vs thousands; scaling targets can change gradient scale and learning.


3. Assumptions (e.g., Gaussian, categorical)
Standard losses assume a particular noise or distribution model. Wrong assumption can hurt (e.g., MSE with heavy-tailed noise).

Example: If regression errors have heavy tails, MSE can be suboptimal; MAE or Huber may be better.


4. Class imbalance and rare events
Plain cross-entropy can be dominated by the majority class; we need weighting, focal loss, or resampling.

Example: Fraud detection with 0.1% positive rate: unweighted BCE lets the model ignore positives; weighted BCE or focal loss helps.


11. Interview-Oriented Key Takeaways

  • A loss function L(ŷ, y) measures prediction error; training minimizes it and uses its gradient for backpropagation.
  • Regression: MSE (default), MAE (robust), Huber (robust + smooth). Classification: BCE + Sigmoid (binary), CCE + Softmax (multi-class).
  • Loss is what we optimize (must be differentiable); metric is what we report (accuracy, F1, etc.).
  • MSE ↔︎ MLE under Gaussian noise; cross-entropy ↔︎ MLE for classification. Don’t use MSE for classification by default—use cross-entropy.
  • Output layer and loss must match: identity + MSE/MAE for regression; Sigmoid + BCE for binary; Softmax + CCE for multi-class.
  • For imbalance, use weighted loss, focal loss, or resampling; for outliers in regression, consider MAE or Huber.

12. Common Interview Traps

Trap 1: “We use MSE for all tasks.”

Wrong: MSE is the best loss for everything.

Correct: MSE is for regression. For classification, use BCE (binary) or CCE (multi-class) with Sigmoid or Softmax. MSE for 0/1 targets has poor gradient behavior and doesn’t match the probabilistic model for classification.


Trap 2: “Loss and metric are the same.”

Wrong: We optimize accuracy (or F1) directly.

Correct: We optimize a differentiable loss (e.g., cross-entropy). We evaluate and report a metric (accuracy, F1, recall). The metric may be non-differentiable; the loss is chosen to be correlated with the metric and to provide good gradients.


Trap 3: “Cross-entropy is only for multi-class.”

Wrong: Cross-entropy means categorical (many classes) only.

Correct: Binary cross-entropy (BCE) is for binary classification (one Sigmoid output). Categorical cross-entropy (CCE) is for multi-class (Softmax over many classes). Both are “cross-entropy” in the sense of negative log-likelihood of the correct class.


Trap 4: “We should minimize validation loss.”

Wrong: Training should minimize the validation loss to generalize better.

Correct: We minimize training loss (on training data). Validation loss is for monitoring and model selection (e.g., early stopping). Minimizing validation loss directly would mean training on the validation set and can cause overfitting to it.


Trap 5: “MAE is always better than MSE for regression.”

Wrong: MAE is better, so always use it.

Correct: MAE is more robust to outliers; MSE is smooth and corresponds to Gaussian noise. Use MAE (or Huber) when outliers are a concern; use MSE when errors are roughly Gaussian and you want smooth gradients.


Trap 6: “The output layer doesn’t need to match the loss.”

Wrong: Any output activation is fine as long as we have a loss.

Correct: The output must match what the loss expects: Sigmoid for BCE, Softmax for CCE, identity for MSE/MAE. Wrong pairing (e.g., ReLU + BCE) gives incorrect gradients or invalid probabilities.


13. Simple Real-Life Analogy

A loss function is like a scoreboard for mistakes: every wrong prediction adds a “penalty” (squared error, log error, etc.) to the total. The model’s job is to lower that score by adjusting its parameters. The gradient of the loss is like the coach’s feedback—it tells each parameter whether to go up or down and by how much.


14. Loss Functions in System Design – Interview Traps (If Applicable)

Trap 1: Optimizing the wrong objective in production

Wrong thinking: We use the same loss in production as in the paper (e.g., only cross-entropy).

Correct thinking: Business success may be a metric (e.g., recall at fixed precision, fairness). The loss we train on should align with that (e.g., weighted BCE, focal loss, or threshold tuning). Define what “good” means in production, then choose or tune the loss (and decision rule) accordingly.

Example: Fraud detection may require high recall; train with weighted BCE (higher weight on positive class) or tune the decision threshold after training.


Trap 2: Ignoring scale of targets (regression)

Wrong thinking: Use raw targets (e.g., price in dollars) with MSE without normalization.

Correct thinking: Large target values can make the loss and gradients large, affecting learning rate and stability. Normalize or standardize targets (or use a scaled loss), or tune learning rate for the scale.

Example: Predicting revenue in millions vs in dollars changes gradient magnitude; normalization or scaling avoids numerical issues.


Trap 3: No handling of class imbalance in production data

Wrong thinking: Train with default cross-entropy; production has a different class balance.

Correct thinking: If production has different class balance (or we care more about the rare class), use weighted loss, focal loss, or resampling during training so the model doesn’t ignore the minority class. Re-evaluate metrics on a distribution that reflects production.

Example: Training on 50–50 balanced data while production is 95–5 can hurt; use weighted loss or balance-aware evaluation.


15. Interview Gold Line

Loss functions define what “wrong” means: they turn predictions and targets into a single number we minimize, and their gradient drives every parameter update. Match the loss to the task (regression vs classification) and to the output (MSE with linear, BCE with Sigmoid, CCE with Softmax)—wrong pairing leads to bad gradients or wrong semantics.


16. Code Snippets (Python)

import numpy as np

def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def bce(y_true, p, eps=1e-7):
    p = np.clip(p, eps, 1 - eps)
    return -np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))

def cce_one_hot(y_true, p_probs, eps=1e-7):
    p = np.clip(p_probs, eps, 1 - eps)
    return -np.mean(np.sum(y_true * np.log(p), axis=-1))

# For single-label multi-class: y_true is integer class index
def cce_sparse(y_true_int, p_probs, eps=1e-7):
    p = np.clip(p_probs, eps, 1 - eps)
    return -np.mean(np.log(p[np.arange(len(y_true_int)), y_true_int]))

Interview tip: In PyTorch use F.mse_loss, F.binary_cross_entropy, F.cross_entropy (logits or probs); in TensorFlow use tf.keras.losses.MSE, binary_crossentropy, sparse_categorical_crossentropy. Clip probabilities in [ε, 1−ε] to avoid log(0).


17. Self-Check and “Think About It” Prompts

Self-check 1: Why do we minimize the loss instead of maximizing accuracy during training?

Self-check 2: For binary classification, which pair is correct: (Sigmoid + BCE) or (Sigmoid + MSE)?

Self-check 3: What is one advantage of MAE over MSE for regression? One disadvantage?

Think about it: You have a 10-class problem with 90% of samples in class 0. What loss or data strategy might you use?

Self-check answers (concise):
- 1: Accuracy is not differentiable (discrete 0/1); we need a smooth loss (e.g., cross-entropy) so we can compute gradients and use gradient descent.
- 2: Sigmoid + BCE is correct for binary classification. Sigmoid + MSE is possible but not recommended (worse gradients, wrong probabilistic interpretation).
- 3: MAE is robust to outliers (linear penalty). Disadvantage: non-smooth at zero (gradient undefined at residual = 0).
- 4: Use weighted cross-entropy (higher weight for minority classes), focal loss, or oversample minority classes / undersample class 0 so the model doesn’t collapse to predicting class 0 always.


18. Likely Interview Questions

  • What is a loss function and why do we need it?
  • What is the difference between loss and metric?
  • When do we use MSE vs MAE vs Huber for regression?
  • Why use cross-entropy instead of MSE for classification?
  • What output activation do we use with BCE? With CCE?
  • Explain the relationship between minimizing cross-entropy and MLE.
  • How would you handle imbalanced classes in the loss?
  • Why is MSE sensitive to outliers?
  • What is label smoothing and why use it?

19. Elevator Pitch

30 seconds:
A loss function measures how wrong the model’s predictions are compared to the targets and gives one number we minimize during training. For regression we use MSE (or MAE/Huber); for binary classification we use binary cross-entropy with Sigmoid output; for multi-class we use categorical cross-entropy with Softmax. The loss must be differentiable so we get gradients for backpropagation. We optimize the loss; we report metrics like accuracy or F1. Don’t use MSE for classification—use cross-entropy.

2 minutes:
The loss function is the objective we minimize when training a model. It takes predictions and targets and outputs a scalar: the higher, the worse. Training computes this loss after the forward pass, then backpropagation uses it to get gradients and the optimizer updates parameters. For regression, MSE is standard (smooth, MLE under Gaussian noise); MAE and Huber are used when we care about robustness to outliers. For classification, we use cross-entropy: binary cross-entropy with Sigmoid for two classes, categorical cross-entropy with Softmax for many classes. Cross-entropy corresponds to maximum likelihood and gives better gradients than MSE for 0/1 targets. The output layer must match the loss—Sigmoid for BCE, Softmax for CCE, identity for MSE. Loss is what we optimize; metrics (accuracy, F1) are what we report and may be non-differentiable. For imbalanced data we use weighted or focal loss; for regression with outliers we use MAE or Huber.


20. One-Page Cheat Sheet (Quick Revision)

Concept Definition / rule
Loss function L(ŷ, y); scalar measure of prediction error; we minimize it.
Role Objective for optimization; source of gradients for backprop.
MSE (1/N) Σ (y − ŷ)²; regression; smooth; sensitive to outliers.
MAE (1/N) Σ |y − ŷ|; regression; robust; non-smooth at 0.
Huber Squared for small |e|, linear for large; regression; robust + smooth.
BCE −[y log p + (1−y) log(1−p)]; binary classification; Sigmoid output.
CCE −Σ yₖ log pₖ or −log p_true; multi-class; Softmax output.
Loss vs metric Loss = optimize (differentiable); metric = report (e.g., accuracy).
Output + loss Regression: identity + MSE/MAE. Binary: Sigmoid + BCE. Multi-class: Softmax + CCE.
MSE for classification Avoid; use cross-entropy (better gradients, MLE).
Imbalance Weighted loss, focal loss, resampling.
Outliers (regression) MAE or Huber.

21. Formula Card

Name Formula
MSE L = (1/N) Σᵢ (yᵢ − ŷᵢ)²
MAE L = (1/N) Σᵢ |yᵢ − ŷᵢ|
BCE (one sample) L = −[y log(p) + (1−y) log(1−p)]
CCE (one sample, one-hot y) L = −Σₖ yₖ log(pₖ)
CCE (one sample, true class c) L = −log(p_c)
Hinge (one sample) L = max(0, 1 − y·ŷ), y ∈ {−1, +1}
Batch loss L = (1/N) Σᵢ L(ŷᵢ, yᵢ)
MLE link Minimize NLL ⇔ maximize likelihood; MSE ⇔ Gaussian, CE ⇔ categorical.

22. What’s Next and Revision Checklist

What’s Next

  • Backpropagation: The gradient of the loss ∂L/∂θ is computed by backpropagation; you’ll see how the loss gradient flows backward through the network.
  • Optimizers: They minimize the loss using θ ← θ − η ∇L (or variants like Adam); loss is the objective they consume.
  • Activation Functions: Output activations (Sigmoid, Softmax, identity) must pair with the correct loss (BCE, CCE, MSE).
  • Optimization Fundamentals: Convex vs non-convex; gradient descent minimizes the loss surface.

Revision Checklist

Before an interview, ensure you can:

  1. Define loss function in one sentence (scalar measure of prediction error we minimize).
  2. State why we need it (objective for optimization; gradient source for backprop).
  3. Compare MSE vs MAE vs Huber (formula, robustness, smoothness).
  4. Compare BCE vs CCE (binary vs multi-class; Sigmoid vs Softmax).
  5. Explain loss vs metric (optimize vs report; differentiability).
  6. Match output and loss (identity+MSE, Sigmoid+BCE, Softmax+CCE).
  7. Correct the trap: “MSE for classification” (use cross-entropy).
  8. Correct the trap: “minimize validation loss” (minimize training loss; use validation for monitoring).
  9. Suggest handling for imbalanced classes (weighted loss, focal, resampling) and regression outliers (MAE, Huber).
  • Artificial Neuron & Perceptron (output ŷ)
  • Activation Functions (Sigmoid, Softmax at output)
  • Backpropagation (∂L/∂θ from loss)
  • Optimization Fundamentals (minimizing L(θ))
  • Optimizers (SGD, Adam—how θ is updated using ∇L)

End of Loss Functions study notes.