Gradients for multi class classification with Softmax

In a multi class classification problem, the output (also called the label or class) takes a finite set of discrete values . In this post, system model for a multi class classification with a linear layer followed by softmax layer is defined. The softmax function transforms the output of a linear layer into values lying between 0 and 1, which can be interpreted as probability scores.

Next, the loss function using categorical cross entropy is explained and derive the gradients for model parameters using the chain rule. The analytically computed gradients are then compared with those obtained from the deep learning framework PyTorch. Finally, we implement a training loop using gradient descent for a toy multi-class classification task with 2D Gaussian-distributed data.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Model

Let us take an example of estimating based on feature vector having features i.e. and there are examples.

Linear Layer

Let us assume that the variable is defined as linear function of . For a single training example, this can be written as :

where,

  • is the vector of size i.e. ,
  • is the parameter matrix of size i.e. ,
  • is the feature vector of size i.e. and
  • is the parameter vector of size i.e.

Note :

This is the definition of Linear layer in PyTorch (refer entry on Linear layer). This is alternatively called as Dense Layer in Tensorflow (refer entry on Dense) and as Fully Connected layer in deep learning literature.

Softmax layer

To map the real valued vector to a probability vector with elements of summing up to 1, we use the softmax function (refer wiki entry on SoftMax) . Softmax function is defined as,

Equivalently, this can be written as,

where, each represents the normalized exponential of the corresponding . This ensures that

  • each element lies in the range [0,1] i.e.
  • sum of all the elements add upto 1, i.e. .

This makes interpretable as a probability distribution over the classes.

Derivatives

Derivative of Softmax layer

To compute the derivative of the softmax output with respect to its input , we need to find the Jacobian matrix . The Jacobian contains all partial derivatives of each output component with respect to each input component .

To find the derivative for all cases, let us split into two scenarios i.e.

Derivative for case i=j

Using the product rule of derivatives,

Derivative for case i ≠ j

Final output (matrix form)

Based on the above derivations, the derivative is defined as :

In matrix form,

Code

Python code comparing the derivative of softmax using the derivation above vs computed by PyTorch autograd function below.

Derivative of Linear layer

To find the derivative of the linear layer , with respect to parameters and , we must compute two partial derivatives:

  • \frac{\partial \mathbf{z}}{\partial \mathbf{W}} – how the output changes with respect to the weight matrix.
  • \frac{\partial \mathbf{z}}{\partial \mathbf{b}} – how the output changes with respect to the bias vector.

Derivative of Weights

To compute the derivative , we evaluate how each weight parameter affects each output dimension . The i-th component of \mathbf{z} is:

.

The partial derivative of z_i with respect to W_{ij} is :

where,

  • index over the output vector
  • indexes the elements of the input vector

Since each output depends only on the weights in i-th row , the Jacobian simplifies to a matrix where each row is . This can be represented as

For all the rows of , the derivative is

Derivative of Bias

The bias vector is added element-wise to the output of the linear transformation . That is, each output component is given by:

So the partial derivative of with respect to is:

This implies that the Jacobian matrix of with respect to is an identity matrix:

This tells us that the bias only affects its corresponding output component (i.e., only affects ).

Loss for multi-class classification

Maximum Likelihood Estimate

The likelihood of observing the true class c, given input , under the model is:

The log-likelihood over a dataset with examples is:

where,

is the model’s predicted probability for the correct class for the i-th example.

Maximizing this log-likelihood is equivalent to minimizing the negative log-likelihood:

Connecting to Cross Entropy Loss

To map the ground truth class label to be represented as a target vector , a common choice is the one-hot encoding scheme, where the true class is indicated by a 1 in the corresponding position and 0 elsewhere. For example, suppose we have k=3 classes, and the correct label is class 2, i.e., y = 2, then the one-hot encoded vector \mathbf{y} becomes :

\mathbf{y} = [0, 1, 0]^T

Cross Entropy

To compare the model’s predicted probability vector \mathbf{a} with the one-hot encoded true label \mathbf{y}, we use a metric called cross-entropy (refer wiki entry on Cross Entropy). The cross-entropy of the distribution \mathbf{y} relative to a distribution\mathbf{p} over a given set is defined as follows:

where,

is the expected value operator with respect to the distribution \mathbf{y}.

For discrete probability distributions \mathbf{p}and \mathbf{y} with set of all possible outcomes or classes.

Cross Entropy Loss

In the context of training classification models, we use the cross-entropy loss as a cost function to minimize. For a single training example, to evaluate how well the predicted probability vector matches the ground truth vector , the cross-entropy loss is defined as:

where,

  • is the true probability of class i and
  • is the predicted probability for class i.

The loss encourages the model to assign higher probability to the correct class which indirectly lowers the probabilities to the incorrect classes. The smaller the cross-entropy loss, the closer the predicted probabilities are to the true labels.

Loss across all examples is,

When is one-hot coded, as only the correct class is non-zero, the equation reduces to

We can see that this cross entropy loss is same as the maximum likelihood estimate derived earlier.

Note :

  • Function for cross entropy loss is available in PyTorch library as torch.nn.CrossEntropyLoss (refer entry on CELoss in PyTorch).
  • In the torch.nn.CrossEntropyLoss definition, we only need to provide the output of the linear layer (called logits) and the class indices as an integer. The softmax and logarithm of probabilities are computed internally, so we do not need to apply softmax before passing logits to this function.

Gradients with Cross Entropy (CE) Loss

The system model for binary classification involves multiple steps:

  • firstly, the vector is defined as linear function of using parameters and ,
  • then gets transformed into a estimated probability score using softmax function.
  • lastly, using the true label and estimated probability score , cross entropy loss is computed

For performing gradient descent of the parameters, the goal is to the find the gradients of the loss w.r.t to the parameters and . To find the gradients, we go in the reverse order i.e.

  • first, gradients of the loss w.r.t to the estimated probability score , is computed.
  • then gradients of the probability score w.r.t to the output of linear function is multiplied with the gradients of loss with respect to i.e.
  • lastly, to find the gradients of output of linear function w.r.t to parameters and , the product of all the individual gradients is used. This is written as,

The steps described, calculating gradients in the reverse order from the loss back to the parameters is an application of the chain rule from calculus (refer wiki entry on Chain Rule). This method is the foundation of back propagation used in training models (refer wiki entry on Backpropagation).

Gradients of Loss with respect to Probability (dL/da)

As defined earlier, for a multi-class classification setting, the cross-entropy loss is given by:

Derivative of w.r.t is,

So, the gradient is large if the predicted probability is small for the correct class — this penalises the model for incorrect predictions, which is desired during training. The vectorized form of the loss gradient w.r.t. the probability vector is:

Equivalently,

Gradients of Loss with respect to z (dL/dz)

Using chain rule, to the find the gradient of loss with respect to z i.e. , we multiply the derivative of softmax output which is a matrix with which is of dimension ,

In vectorized form, can be represented as

Gradients of loss with respect to Parameters (dL/dW, dL/db)

Gradients of Weights (W)

Based on the chain rule, to find the gradient of loss with respect to parameter , we multiply each row of the with the corresponding row from

This is equivalent to the outer product,

Gradients of bias (b)

Recall the linear transformation: . The gradients are:

The intuition from above equations is :

if the estimated probability is close to the true value then the gradient is small, and the update to the parameters is also correspondingly smaller. If you recall, the gradients for binary classification (refer post on Gradients for Binary Classification with Sigmoid), linear regression (refer post on Gradients for Linear Regression) follows a similar intuitive explanation.

These gradients are then used in the optimizer (e.g., SGD) to update parameters and reduce the loss.

Vectorised operations (with m examples)

The training examples each having features is represented as,

The output, which is a probability matrix across classes for each of the examples, is:

The linear transformation before applying the activation function (e.g., softmax) is given by:

where, the parameters

  • and

The softmax activation is applied column-wise to the matrix to obtain the probability outputs:

In matrix form, this is written as,

The cross-entropy loss compares the predicted probabilities with the ground truth one-hot encoded labels :

The derivative of the cross-entropy loss with softmax activation, with respect to the input (logits), simplifies to:

The gradient of the loss with respect to the weight matrix is:

As the input matrix has shape , the inner product results in a matrix of shape . This captures the total gradient of the loss over all examples. Averaging over the examples is done by multiplying with .

The gradient of the loss with respect to the bias vector is computed by summing the gradient over all examples using a row vector of ones:

Here, and sums the gradients across all examples. The result is a vector, which matches the shape of .

Code (gradients)

Example code comparing the gradients computed using the derivation with autograd from PyTorch.

Training for toy example with 3 classes

Below is an example of training a multi class classifier based on the model and gradient descent. Synthetic training data data is generated from two independent Gaussian random variables with zero mean and unit variance. Mean is shifted by (-2,-2), (+2,+2 ), (-2,+2 ) corresponding to class 0class 1, class 2 respectively.

The training loop is done using the numerically computed gradients and using the torch.autograd provided by PyTorch, and can see that both are numerically very close.

Training with Label Smoothing

In the previous section, we derived the gradients for multi-class classification using one-hot encoded targets. In the paper “Rethinking the Inception Architecture for Computer Vision” by Szegedy et al. (2016) (arXiv:1512.00567), the idea of label smoothing was introduced. The key observation is that one-hot targets, which drives the predicted probability for the correct class toward 1 and ignore the other classes in the loss function, encourage models to become overconfident.

Label smoothing combats this by replacing the hard 1 in the true class with a slightly lower value, such as , and distributing the remaining equally among the other classes. So, instead of teaching the model that one class is absolutely correct, we teach it that one class is very likely correct — allowing for some uncertainty.

For a classification problem with classes and smoothing parameter , the smoothed label vector becomes:

For an example with =4 classes,

Even though we modify the target labels using label smoothing, the sum of the smoothed probabilities still adds up to 1. Because of this, the gradient derivations from the previous section remain valid.

Training code

For the toy training example earlier, we compare the training with smoothed labels vs one-hot coded. In PyTorch function torch.nn.CrossEntropyLoss (refer entry on CELoss in PyTorch) has an optional argument label_smoothing which implements the label smoothing as defined earlier.

In the training results on the toy example we can see that the loss is higher for the training with label smoothing and correspondingly misclassification rate is also slightly higher.

However, label smoothing has been shown to improve generalization in larger models trained on complex datasets. The concept was first introduced in Rethinking the Inception Architecture for Computer Vision (Szegedy et al., 2016), and was later used in the foundational paper Attention is All You Need (Vaswani et al., 2017). A broader study, When Does Label Smoothing Help? (Müller et al., 2019), analyzed its effectiveness in large models like ResNets and Transformers.

Summary

The post covers the following key aspects

  • System model for multi class classification with linear layer and softmax
  • Loss function based on categorical cross entropy and showing that this is Maximum Likelihood Estimate
  • Computation of the gradient based on chain rule of derivatives
  • Vectorized Operations for batch of examples which implements computations using efficient matrix and vector math.
  • Training loop for the classification using both manual and PyTorch based gradients
  • Explains the concept of label smoothing and implements a training loop for explaining the concept

Have any questions or feedback? Feel free to drop your feedback in the comments section. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *