Gradients for multi class classification with Softmax

In a multi class classification problem, the output (also called the label or class) takes a finite set of discrete values $y \in \{1, 2, \ldots, k\}$ . In this post, system model for a multi class classification with a linear layer followed by softmax layer is defined. The softmax function transforms the output of a linear layer into values lying between 0 and 1, which can be interpreted as probability scores.

Next, the loss function using categorical cross entropy is explained and derive the gradients for model parameters using the chain rule. The analytically computed gradients are then compared with those obtained from the deep learning framework PyTorch. Finally, we implement a training loop using gradient descent for a toy multi-class classification task with 2D Gaussian-distributed data.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Model

Let us take an example of estimating $y \in \{1, 2, \ldots, k\}$ based on feature vector $\mathbf{x}$ having $n$ features i.e. $\mathbf{x} \in \mathbb{R}^{n \times 1}$ and there are $m$ examples.

$\begin{tabular}{|c|c|c|c|c|} \hline&{example^1}&{example^2}&\ldots&{example^m}\\ \hline{feature_1}&{x_1}^{1}&{x_1}^{2}&\ldots&{x_1}^{m}\\ \hline{feature_2}&{x_2}^{1}&{x_2}^{2}&\ldots&{x_2}^{m}\\ \hline&\vdots&\vdots&\ldots&\vdots\\ \hline{feature_n}&{x_n}^{1}&{x_n}^{2}&\ldots&{x_n}^{m}&\\ \hline{output}&{y}^{1}&{y}^{2}&\ldots&{y}^{m}\end{tabular}$

Linear Layer

Let us assume that the variable ${z}$ is defined as linear function of $\mathbf{x}$ . For a single training example, this can be written as :

$\begin{align*} \mathbf{z} &= \mathbf{W} \mathbf{x} + \mathbf{b} \\ \end{align*}$

where,

$\mathbf{z} = \begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_k \end{bmatrix}$ is the vector of size $k \times 1$ i.e. $\mathbf{z} \in \mathbb{R}^{k \times 1}$ ,
$\mathbf{W} = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{k1} & w_{k2} & \cdots & w_{kn} \end{bmatrix}$ is the parameter matrix of size $k \times n$ i.e. $\mathbf{W} \in \mathbb{R}^{k \times n}$ ,
$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}$ is the feature vector of size $n \times 1$ i.e. $\mathbf{x} \in \mathbb{R}^{n \times 1}$ and
$\mathbf{b} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_k \end{bmatrix}$ is the parameter vector of size $k \times 1$ i.e. $\mathbf{b} \in \mathbb{R}^{k \times 1}$

Note :

This is the definition of Linear layer in PyTorch^{(refer entry on Linear layer)}. This is alternatively called as Dense Layer in Tensorflow ^{(refer entry on Dense)} and as Fully Connected layer in deep learning literature.

Softmax layer

To map the real valued vector $\mathbf{z} \in \mathbb{R}^{k \times 1}$ to a probability vector $\mathbf{a} \in \mathbb{R}^{k \times 1}$ with elements of $\mathbf{a}$ summing up to 1, we use the softmax function $\mathbf{S}(\cdot)$ ^{(refer wiki entry on SoftMax)}. Softmax function is defined as,

$\mathbf{a}=\mathbf{S}(\mathbf{z}) = \begin{bmatrix} \frac{e^{z_1}}{\sum_{j=1}^{k} e^{z_j}} \\ \frac{e^{z_2}}{\sum_{j=1}^{k} e^{z_j}} \\ \vdots \\ \frac{e^{z_k}}{\sum_{j=1}^{k} e^{z_j}} \end{bmatrix}, \quad \in \mathbb{R}^{k \times 1}$

Equivalently, this can be written as,

$\mathbf{a} = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_k \end{bmatrix}, \quad \in \mathbb{R}^{k \times 1}$

where, each $a_i$ represents the normalized exponential of the corresponding $z_i$ . This ensures that

each element $a_i$ lies in the range [0,1] i.e. $0 \leq a_i \leq 1$
sum of all the elements add upto 1, i.e. $\sum_{i=1}^k a_i = 1$ .

This makes $\mathbf{a}$ interpretable as a probability distribution over the $k$ classes.

Derivatives

Derivative of Softmax layer

To compute the derivative of the softmax output $\mathbf{a} = \mathbf{S}(\mathbf{z})$ with respect to its input $\mathbf{z}$ , we need to find the Jacobian matrix $\frac{\partial \mathbf{a}}{\partial \mathbf{z}} \in \mathbb{R}^{k \times k}$ . The Jacobian contains all partial derivatives of each output component ${a_i}$ with respect to each input component $z_j$ .

To find the derivative for all cases, let us split into two scenarios i.e.

$\begin{array}\frac{\partial a_i}{\partial z_j}& \text{where, } i = j \\ \end{array}$
$\begin{array}\frac{\partial a_i}{\partial z_j}& \text{where, } i \ne j \\ \end{array}$

Derivative for case i=j

Using the product rule of derivatives,

$\begin{align*} \frac{\partial a_i}{\partial z_i} &= \frac{\partial}{\partial z_i} \left( e^{z_i} \cdot \frac{1}{\sum_{j=1}^{k} e^{z_j}} \right) \\ &= \frac{\partial }{\partial z_i}\left(e^{z_i}\right) \cdot \frac{1}{\sum_{j=1}^{k} e^{z_j}} + e^{z_i} \cdot \frac{\partial}{\partial z_i} \left( \frac{1}{\sum_{j=1}^{k} e^{z_j}} \right) \\ &= e^{z_i} \cdot \frac{1}{\sum_{j=1}^{k} e^{z_j}} - e^{z_i} \cdot \frac{e^{z_i}}{\left( \sum_{j=1}^{k} e^{z_j} \right)^2} \\ &= \frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}} - \frac{(e^{z_i})^2}{\left( \sum_{j=1}^{k} e^{z_j} \right)^2} \\ &= a_i - a_i^2 \\ &= a_i(1 - a_i) \end{align*}$

Derivative for case i ≠ j

$\begin{align*} \frac{\partial a_i}{\partial z_j} &= \frac{\partial}{\partial z_j} \left( e^{z_i} \cdot \frac{1}{\sum_{l=1}^{k} e^{z_l}} \right) \\ &= \frac{\partial }{\partial z_j} \left(e^{z_i}\right)\cdot \frac{1}{\sum_{l=1}^{k} e^{z_l}} + e^{z_i} \cdot \frac{\partial}{\partial z_j} \left( \frac{1}{\sum_{l=1}^{k} e^{z_l}} \right) \\ &= 0 \cdot \frac{1}{\sum_{l=1}^{k} e^{z_l}} - e^{z_i} \cdot \frac{e^{z_j}}{\left( \sum_{l=1}^{k} e^{z_l} \right)^2} \\ &= - \frac{e^{z_i} \cdot e^{z_j}}{\left( \sum_{l=1}^{k} e^{z_l} \right)^2} \\ &= - a_i \cdot a_j \end{align*}$

Final output (matrix form)

Based on the above derivations, the derivative is defined as :

$\frac{\partial a_i}{\partial z_j} = \begin{cases} a_i (1 - a_i), & \text{if } i = j \\ - a_i a_j, & \text{if } i \ne j \end{cases}$

In matrix form,

$\frac{\partial \mathbf{a}}{\partial \mathbf{z}} = \begin{bmatrix} a_1(1 - a_1) & -a_1 a_2 & -a_1 a_3 & \cdots & -a_1 a_k \\ -a_2 a_1 & a_2(1 - a_2) & -a_2 a_3 & \cdots & -a_2 a_k \\ -a_3 a_1 & -a_3 a_2 & a_3(1 - a_3) & \cdots & -a_3 a_k \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ -a_k a_1 & -a_k a_2 & -a_k a_3 & \cdots & a_k(1 - a_k) \end{bmatrix}, \quad \in \mathbb{R}^{k \times k}$

Code

Python code comparing the derivative of softmax using the derivation above vs computed by PyTorch autograd function below.

Derivative of Linear layer

To find the derivative of the linear layer $\begin{align*} \mathbf{z} &= \mathbf{W} \mathbf{x} + \mathbf{b} \\ \end{align*}$ , with respect to parameters $\mathbf{W}$ and $\mathbf{b}$ , we must compute two partial derivatives:

$\frac{\partial \mathbf{z}}{\partial \mathbf{W}}$ – how the output changes with respect to the weight matrix.
$\frac{\partial \mathbf{z}}{\partial \mathbf{b}}$ – how the output changes with respect to the bias vector.

Derivative of Weights

To compute the derivative $\frac{\partial \mathbf{z}}{\partial \mathbf{W}}$ , we evaluate how each weight parameter $W_{ij}$ affects each output dimension $z_i$ . The i-th component of $\mathbf{z}$ is:

$z_i = \sum_{j=1}^{n} W_{ij} x_j + b_i$ .

The partial derivative of $z_i$ with respect to $W_{ij}$ is :

$\frac{\partial z_i}{\partial W_{ij}} = x_j$

where,

$i \in \{1, 2, \dots, k\}$ index over the output vector $\mathbf{z} \in \mathbb{R}^{k \times 1}$
$j \in \{1, 2, \dots, n\}$ indexes the elements of the input vector $\mathbf{x} \in \mathbb{R}^{n \times 1}$

Since each output $z_i$ depends only on the weights in i-th row $\mathbf{W}_{i,:}$ , the Jacobian simplifies to a matrix where each row is $\mathbf{x}^\top$ . This can be represented as

$\frac{\partial z_i}{\partial \mathbf{W}_{i,:}} = \mathbf{x}^\top \in \mathbb{R}^{1 \times n}$

For all the rows of $\mathbf{W}$ , the derivative is

$\frac{\partial \mathbf{z}}{\partial \mathbf{W}} = \begin{bmatrix} \mathbf{x}^\top \\ \mathbf{x}^\top \\ \vdots \\ \mathbf{x}^\top \end{bmatrix} \in \mathbb{R}^{k \times n} \quad \text{(each row is } \mathbf{x}^\top \text{)}$

Derivative of Bias

The bias vector $\mathbf{b} \in \mathbb{R}^{k \times 1}$ is added element-wise to the output of the linear transformation $\mathbf{W}\mathbf{x}$ . That is, each output component $z_i$ is given by:

$z_i = \sum_{j=1}^{n} W_{ij} x_j + b_i$

So the partial derivative of $z_i$ with respect to $b_j$ is:

$\frac{\partial z_i}{\partial b_j} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \ne j \end{cases}$

This implies that the Jacobian matrix of $\mathbf{z}$ with respect to $\mathbf{b}$ is an identity matrix:

$\frac{\partial \mathbf{z}}{\partial \mathbf{b}} = \mathbf{I}_k \in \mathbb{R}^{k \times k}$

This tells us that the bias only affects its corresponding output component (i.e., $b_j$ only affects $z_j$ ).

Loss for multi-class classification

Maximum Likelihood Estimate

The likelihood of observing the true class c, given input $\mathbf{x}$ , under the model is:

$P(y = c \mid \mathbf{x}) = a_c$

The log-likelihood over a dataset with $m$ examples is:

$\log \mathcal{L} = \sum_{i=1}^{m} \log a^i_c$

where,

$a^i_c$ is the model’s predicted probability for the correct class $c$ for the i-th example.

Maximizing this log-likelihood is equivalent to minimizing the negative log-likelihood:

$\mathcal{L_{\text{NLL}}} = -\sum_{i=1}^{m} \log a^i_c$

Connecting to Cross Entropy Loss

To map the ground truth class label $y \in \{1, 2, \ldots, k\}$ to be represented as a target vector $\mathbf{y} \in \mathbb{R}^{k \times 1}$ , a common choice is the one-hot encoding scheme, where the true class is indicated by a 1 in the corresponding position and 0 elsewhere. For example, suppose we have $k=3$ classes, and the correct label is class 2, i.e., $y = 2$ , then the one-hot encoded vector $\mathbf{y}$ becomes :

$\mathbf{y} = [0, 1, 0]^T$

Cross Entropy

To compare the model’s predicted probability vector $\mathbf{a}$ with the one-hot encoded true label $\mathbf{y}$ , we use a metric called cross-entropy ^{(refer wiki entry on Cross Entropy)}. The cross-entropy of the distribution $\mathbf{y}$ relative to a distribution $\mathbf{p}$ over a given set is defined as follows:

$H(p, q) = -\operatorname{E}_p[\log q]$

where,

$\operatorname{E}_p[\cdot]$ is the expected value operator with respect to the distribution $\mathbf{y}$ .

For discrete probability distributions $\mathbf{p}$ and $\mathbf{y}$ with $\mathcal{X}$ set of all possible outcomes or classes.

$H(p,q) = -\sum_{x \in \mathcal{X}} p(x) \log q(x)$

Cross Entropy Loss

In the context of training classification models, we use the cross-entropy loss as a cost function to minimize. For a single training example, to evaluate how well the predicted probability vector $\mathbf{a} \in \mathbb{R}^{k \times 1}$ matches the ground truth vector $\mathbf{y} \in \mathbb{R}^{k \times 1}$ , the cross-entropy loss is defined as:

$\mathcal{L}_{\text{CE}}(\mathbf{y}, \mathbf{a}) = - \sum_{k=1}^{k} y_i \log(a_i)$

where,

$y_i$ is the true probability of class i and
$a_i$ is the predicted probability for class i.

The loss encourages the model to assign higher probability to the correct class which indirectly lowers the probabilities to the incorrect classes. The smaller the cross-entropy loss, the closer the predicted probabilities are to the true labels.

Loss across all $m$ examples is,

$\mathcal{L}_{\text{CE}}(\mathbf{y}, \mathbf{a}) = -\sum_{i=1}^{m} \sum_{j=1}^{k} y^i_j \log(a^i_j)$

When $y^i_c$ is one-hot coded, as only the correct class is non-zero, the equation reduces to

$\mathcal{L}_{\text{CE}}(\mathbf{y}, \mathbf{a}) = -\sum_{i=1}^{m} \log(a^i_c)$

We can see that this cross entropy loss is same as the maximum likelihood estimate derived earlier.

Note :

Function for cross entropy loss is available in PyTorch library as torch.nn.CrossEntropyLoss ^{(refer entry on CELoss in PyTorch)}.
In the torch.nn.CrossEntropyLoss definition, we only need to provide the output of the linear layer $\mathbf{z} \in \mathbb{R}^{k \times 1}$ (called logits) and the class indices $y \in \{1, 2, \ldots, k\}$ as an integer. The softmax and logarithm of probabilities are computed internally, so we do not need to apply softmax before passing logits to this function.

Gradients with Cross Entropy (CE) Loss

The system model for binary classification involves multiple steps:

firstly, the vector $\mathbf{z}$ is defined as linear function of $\mathbf{x}$ using parameters $\mathbf{W}$ and $\mathbf{b}$ ,
then $\mathbf{z}$ gets transformed into a estimated probability score $\mathbf{a}$ using softmax function.
lastly, using the true label $y \in \{0, 1, \ldots k\}$ and estimated probability score $\mathbf{a}$ , cross entropy loss $\mathcal{L}_{\text{CE}}$ is computed

For performing gradient descent of the parameters, the goal is to the find the gradients of the loss $\mathcal{L}_{\text{CE}}$ w.r.t to the parameters $\mathbf{W}$ and $\mathbf{b}$ . To find the gradients, we go in the reverse order i.e.

first, gradients of the loss $\mathcal{L}_{\text{CE}}$ w.r.t to the estimated probability score $\mathbf{a}$ , $\frac{\partial \mathcal{L}}{\partial \mathbf{a}}$ is computed.
then gradients of the probability score $\mathbf{a}$ w.r.t to the output of linear function $\mathbf{z}$ is multiplied with the gradients of loss with respect to $\mathbf{a}$ i.e. $\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z}$
lastly, to find the gradients of output of linear function $\mathbf{z}$ w.r.t to parameters $\mathbf{W}$ and $\mathbf{b}$ , the product of all the individual gradients is used. This is written as,

$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}}$

The steps described, calculating gradients in the reverse order from the loss back to the parameters is an application of the chain rule from calculus ^{(refer wiki entry on Chain Rule)}. This method is the foundation of back propagation used in training models ^{(refer wiki entry on Backpropagation)}.

Gradients of Loss with respect to Probability (dL/da)

As defined earlier, for a multi-class classification setting, the cross-entropy loss is given by:

$\mathcal{L} = - \sum_{i=1}^{k} y_i \log a_i$

Derivative of $\mathcal{L}$ w.r.t $a_i$ is,

$\frac{\partial \mathcal{L}}{\partial a_i} = - \frac{y_i}{a_i}$

So, the gradient is large if the predicted probability $a_i$ is small for the correct class — this penalises the model for incorrect predictions, which is desired during training. The vectorized form of the loss gradient w.r.t. the probability vector $\mathbf{a}$ is:

$\frac{\partial \mathcal{L}}{\partial \mathbf{a}} = \begin{bmatrix} -\dfrac{y_1}{a_1} \\ -\dfrac{y_2}{a_2} \\ \vdots \\ -\dfrac{y_k}{a_k} \end{bmatrix}, \quad \in \mathbb{R}^{k \times 1}$

Equivalently,

$\frac{\partial \mathcal{L}}{\partial \mathbf{a}} = \frac{\mathbf{y}}{\mathbf{a}}, \quad \in \mathbb{R}^{k \times 1}$

Gradients of Loss with respect to z (dL/dz)

Using chain rule, to the find the gradient of loss with respect to z i.e. $\frac{\partial \mathcal{L}}{\partial \mathbf{z}}$ , we multiply the derivative of softmax output $\frac{\partial \mathbf{a}}{\partial \mathbf{z}}$ which is a $k \times k$ matrix with $\frac{\partial \mathcal{L}}{\partial \mathbf{a}}$ which is of dimension $k \times 1$ ,

$\begin{array}{lll}\frac{\partial \mathcal{L}}{\partial \mathbf{z}} & = & \frac{\partial \mathbf{a}}{\partial \mathbf{z}} \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{a}} \\ &= & \begin{bmatrix} a_1(1 - a_1) & -a_1 a_2 & -a_1 a_3 & \cdots & -a_1 a_k \\ -a_2 a_1 & a_2(1 - a_2) & -a_2 a_3 & \cdots & -a_2 a_k \\ -a_3 a_1 & -a_3 a_2 & a_3(1 - a_3) & \cdots & -a_3 a_k \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ -a_k a_1 & -a_k a_2 & -a_k a_3 & \cdots & a_k(1 - a_k) \end{bmatrix}\begin{bmatrix} -\dfrac{y_1}{a_1} \\ -\dfrac{y_2}{a_2} \\ \vdots \\ -\dfrac{y_k}{a_k} \end{bmatrix} \\ \\ &=& \begin{bmatrix}a_1(1 - a_1)\left(-\frac{y_1}{a_1}\right) + (-a_1 a_2)\left(-\frac{y_2}{a_2}\right) + (-a_1 a_3)\left(-\frac{y_3}{a_3}\right) + \cdots + (-a_1 a_k)\left(-\frac{y_k}{a_k}\right) \\ (-a_2 a_1)\left(-\frac{y_1}{a_1}\right) + a_2(1 - a_2)\left(-\frac{y_2}{a_2}\right) + (-a_2 a_3)\left(-\frac{y_3}{a_3}\right) + \cdots + (-a_2 a_k)\left(-\frac{y_k}{a_k}\right) \\ \vdots\\ (-a_k a_1)\left(-\frac{y_1}{a_1}\right) + (-a_k a_2)\left(-\frac{y_2}{a_2}\right) + (-a_k a_3)\left(-\frac{y_3}{a_3}\right) + \cdots + a_k(1 - a_k)\left(-\frac{y_k}{a_k}\right) \\ \end{bmatrix} \\ &=& \begin{bmatrix} -y_1(1 - a_1) + a_1y_2 + a_1y_3 + \dots + a_1y_k\\ a_2y_1 -y_2(1 - a_2) + a_2y_3 + \dots + a_2y_k\\ \vdots\\ a_ky_1 + a_ky_2 + a_ky_3\dots -y_k(1 - a_k) \\ \end{bmatrix} \\ &=& \begin{bmatrix} -y_1 + a_1\cdot\left(y_1 + y_2 + y_3 + \dots + y_k\right)\\ -y_2 + a_2\cdot\left(y_1 + y_2 + y_3 + \dots + y_k\right)\\ \vdots\\ -y_k + a_k\cdot\left(y_1 + y_2 + y_3 + \dots + y_k\right) \end{bmatrix} \text{, note : }y_1 + y_2 + y_3 + \dots + y_k=1 \\ &=& \begin{bmatrix} a_1-y_1 \\ a_2-y_2 \\ \vdots\\ a_k-y_k \\ \end{bmatrix} \in \mathbb{R}^{k \times 1} \end{array}$

In vectorized form, can be represented as

$\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \mathbf{a} - \mathbf{y}, \quad \in \mathbb{R}^{k \times 1}$

Gradients of loss with respect to Parameters (dL/dW, dL/db)

Gradients of Weights (W)

Based on the chain rule, to find the gradient of loss with respect to parameter , we multiply each row of the $\frac{\partial \mathcal{L}}{\partial \mathbf{z_i}}$ with the corresponding row from $\frac{\partial z_i}{\partial \mathbf{W}_{i,:}} = \mathbf{x}^\top$

$\begin{array}{lll} \frac{\partial \mathcal{L}}{\partial \mathbf{W}} &= & \frac{\partial \mathcal{L}}{\partial \mathbf{z}} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{W}}\\ &=& \begin{bmatrix} (a_1-y_1) \cdot \mathbf{x}^\top \\ (a_2-y_2) \cdot \mathbf{x}^\top \\ \vdots\\ (a_k-y_k) \cdot \mathbf{x}^\top \\ \end{bmatrix} \\ &=& \begin{bmatrix} (a_1-y_1) \\ (a_2-y_2) \\ \vdots\\ (a_k-y_k) \\ \end{bmatrix} \mathbf{x}^\top \\ &=& \begin{bmatrix} (a_1-y_1) \\ (a_2-y_2) \\ \vdots\\ (a_k-y_k) \\ \end{bmatrix} \cdot \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix} \\ & = & \begin{bmatrix} (a_1 - y_1) \cdot x_1 & (a_1 - y_1) \cdot x_2 & \cdots & (a_1 - y_1) \cdot x_n \\ (a_2 - y_2) \cdot x_1 & (a_2 - y_2) \cdot x_2 & \cdots & (a_2 - y_2) \cdot x_n \\ \vdots & \vdots & \ddots & \vdots \\ (a_k - y_k) \cdot x_1 & (a_k - y_k) \cdot x_2 & \cdots & (a_k - y_k) \cdot x_n \end{bmatrix} \in \mathbb{R}^{k \times n} \end{array}$

This is equivalent to the outer product,

$\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = (\mathbf{a} - \mathbf{y}) \mathbf{x}^\top, \quad \in \mathbb{R}^{k \times n}$

Gradients of bias (b)

Recall the linear transformation: $\mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b}$ . The gradients are:

$\begin{array}{lll}\frac{\partial \mathcal{L}}{\partial \mathbf{b}} & = & \frac{\partial \mathcal{L}}{\partial \mathbf{z}} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{b}} \\ &=&(\mathbf{a} - \mathbf{y}) \mathbf{I}_k \\ &=&(\mathbf{a} - \mathbf{y}) \end{array}$

$\quad \frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \mathbf{a} - \mathbf{y}$

The intuition from above equations is :

if the estimated probability $\mathbf{a}$ is close to the true value $\mathbf{y}$ then the gradient is small, and the update to the parameters is also correspondingly smaller. If you recall, the gradients for binary classification ^{(refer post on Gradients for Binary Classification with Sigmoid)}, linear regression ^{(refer post on Gradients for Linear Regression)} follows a similar intuitive explanation.

These gradients are then used in the optimizer (e.g., SGD) to update parameters and reduce the loss.

Vectorised operations (with m examples)

The $m$ training examples each having $n$ features is represented as,

$\mathbf{X} = \begin{bmatrix} x_{1}^1 & x_{1}^2 & \dots & x_{1}^m \\ x_{2}^1 & x_{2}^2 & \dots & x_{2}^m \\ \vdots & \vdots & \ddots & \vdots \\ x_{n}^1 & x_{n}^2 & \dots & x_{n}^m \end{bmatrix}, \quad \mathbf{X} \in \mathbb{R}^{n \times m}$

The output, which is a probability matrix across $k$ classes for each of the $m$ examples, is:

$\mathbf{Y} = \begin{bmatrix} y_1^1 & y_1^2 & \dots & y_1^m \\ y_2^1 & y_2^2 & \dots & y_2^m \\ \vdots & \vdots & \ddots & \vdots \\ y_k^1 & y_k^2 & \dots & y_k^m \end{bmatrix}, \quad \mathbf{Y} \in \mathbb{R}^{k \times m}$

The linear transformation before applying the activation function (e.g., softmax) is given by:

$\mathbf{Z} = \mathbf{W} \mathbf{X} + \mathbf{b}, \quad \in \mathbb{R}^{k \times m}$

where, the parameters

$\mathbf{W} \in \mathbb{R}^{k \times n}$ and
$\mathbf{b} \in \mathbb{R}^{k \times 1}$

The softmax activation is applied column-wise to the matrix $\mathbf{Z}$ to obtain the probability outputs:

$\mathbf{A}_{:,j} = \mathrm{softmax}(\mathbf{Z}_{:,j}) = \frac{\exp(\mathbf{Z}_{:,j})}{\sum_{i=1}^{k} \exp(Z_{i,j})}, \quad \text{for } j = 1, 2, \dots, m$

In matrix form, this is written as,

$\mathbf{A} = \begin{bmatrix} a_1^1 & a_1^2 & \dots & a_1^m \\ a_2^1 & a_2^2 & \dots & a_2^m \\ \vdots & \vdots & \ddots & \vdots \\ a_k^1 & a_k^2 & \dots & a_k^m \end{bmatrix}, \quad \mathbf{A} \in \mathbb{R}^{k \times m}$

The cross-entropy loss compares the predicted probabilities $\mathbf{A}$ with the ground truth one-hot encoded labels $\mathbf{Y} \in \mathbb{R}^{k \times m}$ :

$L = -\frac{1}{m} \sum_{j=1}^{m} \sum_{i=1}^{k} Y_{i,j} \log A_{i,j}$

The derivative of the cross-entropy loss with softmax activation, with respect to the input $\mathbf{Z}$ (logits), simplifies to:

$\frac{\partial L}{\partial \mathbf{Z}} = \mathbf{A} - \mathbf{Y}, \quad \in \mathbb{R}^{k \times m}$

The gradient of the loss with respect to the weight matrix $\mathbf{W}$ is:

$\frac{\partial L}{\partial \mathbf{W}} = \frac{1}{m} (\mathbf{A} - \mathbf{Y}) \mathbf{X}^\top, \in \mathbb{R}^{k \times n}$

As the input matrix $\mathbf{X}$ has shape $n \times m$ , the inner product $(\mathbf{A} - \mathbf{Y}) \mathbf{X}^\top$ results in a matrix of shape $k \times n$ . This captures the total gradient of the loss over all $m$ examples. Averaging over the examples is done by multiplying with $\frac{1}{m}$ .

The gradient of the loss with respect to the bias vector $\mathbf{b}$ is computed by summing the gradient over all $m$ examples using a row vector of ones:

$\frac{\partial L}{\partial \mathbf{b}} = \frac{1}{m} (\mathbf{A} - \mathbf{Y}) \mathbf{1}_{m \times 1}, \quad \in \mathbb{R}^{k \times 1}$

Here, $\mathbf{A} - \mathbf{Y} \in \mathbb{R}^{k \times m}$ and $\mathbf{1}_{m \times 1}$ sums the gradients across all $m$ examples. The result is a $k \times 1$ vector, which matches the shape of $\mathbf{b}$ .

Code (gradients)

Example code comparing the gradients computed using the derivation with autograd from PyTorch.

Training for toy example with 3 classes

Below is an example of training a multi class classifier based on the model and gradient descent. Synthetic training data data is generated from two independent Gaussian random variables with zero mean and unit variance. Mean is shifted by (-2,-2), (+2,+2 ), (-2,+2 ) corresponding to class 0, class 1, class 2 respectively.

The training loop is done using the numerically computed gradients and using the torch.autograd provided by PyTorch, and can see that both are numerically very close.

Training with Label Smoothing

In the previous section, we derived the gradients for multi-class classification using one-hot encoded targets. In the paper “Rethinking the Inception Architecture for Computer Vision” by Szegedy et al. (2016) ^{(arXiv:1512.00567)}, the idea of label smoothing was introduced. The key observation is that one-hot targets, which drives the predicted probability for the correct class toward 1 and ignore the other classes in the loss function, encourage models to become overconfident.

Label smoothing combats this by replacing the hard 1 in the true class with a slightly lower value, such as $(1 - \varepsilon)$ , and distributing the remaining $\varepsilon$ equally among the other classes. So, instead of teaching the model that one class is absolutely correct, we teach it that one class is very likely correct — allowing for some uncertainty.

For a classification problem with $K$ classes and smoothing parameter $\varepsilon$ , the smoothed label vector becomes:

$\mathbf{y}_{\text{smooth}} = (1 - \varepsilon) \cdot \mathbf{y}_{\text{one-hot}} + \frac{\varepsilon}{K}$

For an example with $K$ =4 classes,

$\begin{array}{lll} \mathbf{y}_{\text{one-hot}} & = & [0, 0, 1, 0] \\ \mathbf{y}_{\text{smooth}} & = & \left[ \frac{\varepsilon}{K}, \frac{\varepsilon}{K}, 1 - \varepsilon, \frac{\varepsilon}{K} \right] \quad \text{where } K = 4 \end{array}$

Even though we modify the target labels using label smoothing, the sum of the smoothed probabilities still adds up to 1. Because of this, the gradient derivations from the previous section remain valid.

Training code

For the toy training example earlier, we compare the training with smoothed labels vs one-hot coded. In PyTorch function torch.nn.CrossEntropyLoss ^{(refer entry on CELoss in PyTorch)} has an optional argument label_smoothing which implements the label smoothing as defined earlier.

In the training results on the toy example we can see that the loss is higher for the training with label smoothing and correspondingly misclassification rate is also slightly higher.

However, label smoothing has been shown to improve generalization in larger models trained on complex datasets. The concept was first introduced in Rethinking the Inception Architecture for Computer Vision (Szegedy et al., 2016), and was later used in the foundational paper Attention is All You Need (Vaswani et al., 2017). A broader study, When Does Label Smoothing Help? (Müller et al., 2019), analyzed its effectiveness in large models like ResNets and Transformers.

Summary

The post covers the following key aspects

System model for multi class classification with linear layer and softmax
Loss function based on categorical cross entropy and showing that this is Maximum Likelihood Estimate
Computation of the gradient based on chain rule of derivatives
Vectorized Operations for batch of examples which implements computations using efficient matrix and vector math.
Training loop for the classification using both manual and PyTorch based gradients
Explains the concept of label smoothing and implements a training loop for explaining the concept

Have any questions or feedback? Feel free to drop your feedback in the comments section. 🙂

Model

Linear Layer

Softmax layer

Derivatives

Derivative of Softmax layer

Derivative for case i=j

Derivative for case i ≠ j

Final output (matrix form)

Code

Derivative of Linear layer

Derivative of Weights

Derivative of Bias

Loss for multi-class classification

Maximum Likelihood Estimate

Connecting to Cross Entropy Loss

Cross Entropy

Cross Entropy Loss

Gradients with Cross Entropy (CE) Loss

Gradients of Loss with respect to Probability (dL/da)

Gradients of Loss with respect to z (dL/dz)

Gradients of loss with respect to Parameters (dL/dW, dL/db)

Gradients of Weights (W)

Gradients of bias (b)

Vectorised operations (with m examples)

Code (gradients)

Training for toy example with 3 classes

Training with Label Smoothing

Training code

Summary

Leave a Reply Cancel reply

Related Articles

Gradients for Binary Classification with Sigmoid

Gradients for linear regression