Gradients for Binary Classification

In a classification problem, the output (also called the label or class) takes a small number of discrete values rather than continuous values. For a simple binary classification problem, where output takes only two discrete values : 0 or 1, the sigmoid function can be used to transform the output of a linear regression model into a value between 0 and 1, squashing the continuous prediction into a probability-like score. This score can then be interpreted as the likelihood of the output being class 1, with a threshold (commonly 0.5) used to decide between class 0 and class 1.

In this post, the intuition for loss function for binary classification based on Maximum Likelihood Estimate (MLE) is explained. We then derive the gradients for model parameters using the chain rule. Gradients computed analytically are compared against gradients computed using deep learning framework PyTorch. Further, training loop using gradient descent for a binary classification problem having two dimensional Gaussian distributed data is implemented.

Table of Contents

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Model

Let us take an example of estimating $y \in \{0, 1\}$ based on feature vector $\mathbf{x}$ having $n$ features i.e. $\mathbf{x} \in \mathbb{R}^n$ .

There are $m$ examples.

$\begin{tabular}{|c|c|c|c|c|} \hline&{example^1}&{example^2}&\ldots&{example^m}\\ \hline{feature_1}&{x_1}^{1}&{x_1}^{2}&\ldots&{x_1}^{m}\\ \hline{feature_2}&{x_2}^{1}&{x_2}^{2}&\ldots&{x_2}^{m}\\ \hline&\vdots&\vdots&\ldots&\vdots\\ \hline{feature_n}&{x_n}^{1}&{x_n}^{2}&\ldots&{x_n}^{m}&\\ \hline{output}&{y}^{1}&{y}^{2}&\ldots&{y}^{m}\end{tabular}$

Let us assume that the variable ${z}$ is defined as linear function of $\mathbf{x}$ . Then ${z}$ gets transformed into a probability score $a$ using sigmoid function. For a single training example, this can be written as :

$\begin{array}{lll}{z}^1&=&w_1{x_1}^1+w_2{x_2}^1+\dots+w_n{x_n}^1+b\\&=&\mathbf{w^T}\mathbf{x}^1+b\end{array}$

where,

$\mathbf{w}$ is the feature vector of size $n$ i.e. $\mathbf{w} \in \mathbb{R}^n$ and
$b$ is a scalar

To convert the real number ${z}$ to a number $a$ lying between 0 and 1, let us define

$\begin{equation} a^1 = \sigma(z^1) = \frac{1}{1 + \exp(-z^1)} = \frac{1}{1 + \exp\left(-(\mathbf{w^T} \mathbf{x}^1 + b)\right)} \end{equation}$

where, $\sigma(z)$ is the sigmoid function ^{(refer wiki entry on sigmoid function})

Sigmoid function and its derivative

Sigmoid function which has a smooth S-shaped mathematical function is defined as:

$\begin{equation} \sigma(z) = \frac{1}{1 + \exp(-z)} \end{equation}$

which has the properties

$\textbf{Output range: } (0, 1)\\ \text{As } z \to -\infty, \quad \sigma(z) \to 0 \\ \text{As } z \to +\infty, \quad \sigma(z) \to 1 \\ \text{Symmetric around } z = 0: \quad \sigma(0) = 0.5$

The derivative of sigmoid $\sigma'(z)$ is,

$\begin{array}{rcll} \sigma'(z) & = & \dfrac{d}{dz} \left( \dfrac{1}{1 + e^{-z}} \right) \\ & = & -1 \cdot \left(1 + e^{-z} \right)^{-2} \cdot \dfrac{d}{dz}(1 + e^{-z}) \\ & = & - \dfrac{1}{(1 + e^{-z})^2} \cdot (-e^{-z}) \\ & = & \dfrac{e^{-z}}{(1 + e^{-z})^2} \\ & = & \left( \dfrac{1}{1 + e^{-z}} \right) \left( \dfrac{e^{-z}}{1 + e^{-z}} \right) \\ & = & \sigma(z)\left(1 - \sigma(z)\right) \end{array}$

From the plots of derivative of sigmoid, two key observations :

Vanishing gradients : for very large or very small ${z}$ , the derivative approaches 0 causing gradients to vanish during back propagation — this slows or stalls learning in deep networks.
Low Maximum Gradient : the maximum value of derivative is 0.25, which caps the gradient flow, making it harder for deep layers to effectively update their weights

As mentioned in the article Yes you should understand backprop by Andrej Karpathy, these aspects have to be kept in mind when using sigmoid for training deeper neural networks.

Loss function for binary classification

Maximum Likelihood Estimation

Let us assume that the probability of output being 1, given input $\mathbf{x}$ and parameters $\mathbf{w}$ , $b$ is,

$P(y = 1 \mid \mathbf{x}, \mathbf{w}, b) = a = \sigma(z) = \frac{1}{1 + e^{-(\mathbf{w^T} \mathbf{x}+b)}}$

Then, for the binary classification, the probability of output being 0 is,

$P(y = 0 \mid \mathbf{x},\mathbf{w}, b) = 1-a$

Since $y$ can either be 0 or 1, we can compactly write the likelihood as:

$P(y^i|x^i,\mathbf{w}, b) = (a^i)^{y^i} (1-a^i)^{(1-y^i)}$

The likelihood function is the probability of the actual label $y \in \{0, 1\}$ given the prediction $a$ . When the multiple independent training examples are independently and identically distributed (i.i.d.), the total likelihood for the dataset is the product of the likelihoods for each example. With this assumption, for $m$ training examples, the likelihood for the parameters $\mathbf{w}$ and $b$ is,

$\mathcal{L}(w, b) = \prod_{i=1}^{m} P(y_i \mid x_i, \mathbf{w},b) = \prod_{i=1}^{m} (a^i)^{y^i} (1-a^i)^{(1-y^i)}$

Log Likelihood

To avoid the product of many small numbers, we take the natural logarithm of the likelihood function. The log-likelihood for the entire dataset is the sum of the log-likelihoods for each example:

$\begin{array}{lll} \log \mathcal{L}(w, b) &=& \log \prod_{i=1}^{m} P(y_i \mid x_i, \mathbf{w},b)\\ &=& \sum_{i=1}^{m} \log P(y_i \mid x_i, \mathbf{w},b) \\ &=& \sum_{i=1}^{m} \log \left[(a^i)^{y^i} (1-a^i)^{(1-y^i)} \right] \\ &=& \sum_{i=1}^{m} \left[ y^{i} \log a^{i} + (1 - y^{i}) \log(1 - a^{i}) \right] \end{array}$

Negative Log Likelihood

Since optimizers like gradient descent are designed to minimize functions, we minimize the negative log-likelihood instead of maximizing the log-likelihood.

$\begin{array}{lll} \text{neg } \log \mathcal{L}(w, b) &=& -\sum_{i=1}^{m} \left[ y^{i} \log a^{i} + (1 - y^{i}) \log(1 - a^{i}) \right] \end{array}$

Averaging the Loss

Averaging the loss ensures that the total loss remains on the same scale, regardless of the size of the training dataset. This is important because it allows the use of a fixed learning rate across different dataset sizes, leading to more stable and consistent optimization behaviour.

The averaged negative log-likelihood is defined as:

$\begin{array}{lll} \text{neg } \log \mathcal{L}_{\text{avg}}(w, b) &=& -\frac{1}{m}\sum_{i=1}^{m} \left[ y^{i} \log a^{i} + (1 - y^{i}) \log(1 - a^{i}) \right] \end{array}$

This expression is known as the Binary Cross-Entropy (BCE) Loss, which is widely used in binary classification tasks. This function is available in PyTorch library as torch.nn.BCELoss ^{(refer entry on BECLoss in PyTorch)}.

Gradients with Binary Cross Entropy (BCE) Loss

The system model for binary classification involves multiple steps:

firstly, the variable ${z}$ is defined as linear function of $\mathbf{x}$ using parameters $\mathbf{w}$ , $b$ .
then ${z}$ gets transformed into a estimated probability score $a$ using sigmoid function.
lastly, use the true label $y \in \{0, 1\}$ and estimated probability score $a$ , binary cross entropy loss $\mathcal{L}$ is computed

For performing gradient descent of the parameters, the goal is to the find the gradients of the loss $\mathcal{L}$ w.r.t to the parameters $\mathbf{w}$ and $b$ . To find the gradients, we go in the reverse order i.e.

firstly, gradients of the loss $\mathcal{L}$ w.r.t to the estimated probability score $a$
then gradients of the probability score $a$ w.r.t to the output of linear function ${z}$
lastly, gradients of output of linear function ${z}$ w.r.t to parameters $\mathbf{w}$ , $b$

Then the product of all the individual gradients from the gradients of loss w.r.t to the parameters. This is written as,

$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}}$

The steps described, calculating gradients in the reverse order from the loss back to the parameters is an application of the chain rule from calculus ^{(refer wiki entry on Chain Rule)}. This method is the foundation of backpropagation used in training models ^{(refer wiki entry on Backpropagation)}.

Deriving the gradients

For simplicity, take a single example and computing gradients step by step,

Step1 : Gradients of loss w.r.t to probability score

With the loss $\mathcal{L} = \left[ y^{1} \log a^{1} + (1 - y^{1}) \log(1 - a^{1})\right]$ , then the derivative of loss w.r.t to sigmoid output $a$ is,

$\frac{\partial \mathcal{L}}{\partial a} = \frac{y^1}{a^1} + \frac{1-y^1}{1-a^1}$

Step2 : Gradients of probability score w.r.t to output of linear function

With $a^1 =\sigma(z^1)$ as the output of sigmoid function, the derivative is

$\frac{\partial a}{\partial z} = \sigma(z^1)(1-\sigma(z^1))= a^1(1-a^1)$

Step3 : Gradients of output of linear function w.r.t to parameters

With $\begin{array}{lll}{z}^1&=&w_1{x_1}^1+w_2{x_2}^1+\dots+w_n{x_n}^1+b&=&\mathbf{w^T}\mathbf{x}^1+b\end{array}$ , the derivative is,

$\frac{\partial z}{\partial \mathbf{w}} = [x_1^1, x_2^1, \dots, x_n^1] = \mathbf{x^1}$

Similarly,

$\frac{\partial z}{\partial b} = 1$

Gradients of loss w.r.t to parameters

Taking the product of the gradients from all the steps,

$\begin{align*} \frac{\partial \mathcal{L}}{\partial w} &= \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}} \\ &= -\left( \frac{y^1}{a^1} - \frac{1 - y^1}{1 - a^1} \right) \cdot a^1(1 - a^1) \cdot \mathbf{x^1} \\ &= \left( -y^1(1 - a^1) + (1 - y^1)a^1 \right) \cdot \mathbf{x^1} \\ &= \left( -y^1 + y^1a^1 + a^1 - a^1y^1 \right) \cdot \mathbf{x^1} \\ &= (a^1 - y^1) \cdot \mathbf{x^1} \\ \end{align*}$

Similarly,

$\begin{align*} \frac{\partial \mathcal{L}}{\partial b} &= \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{b}} \\ &= -\left( \frac{y^1}{a^1} - \frac{1 - y^1}{1 - a^1} \right) \cdot a^1(1 - a^1) \\ &= \left( -y^1(1 - a^1) + (1 - y^1)a^1 \right) \\ &= \left( -y^1 + y^1a^1 + a^1 - a^1y^1 \right) \\ &= (a^1 - y^1) \\ \end{align*}$

The intuition from above equations is :

if the estimated probability ${a}=\sigma(z) = \sigma(\mathbf{w^T}\mathbf{x}+b)$ is close to the true value $y$ then the gradient is small, and the update to the parameters is also correspondingly smaller. If you recall, the gradients for linear regression ^{(refer post on Gradients for Linear Regression)} follows a similar intuitive explanation.

Note : With $m$ training examples the loss is averaged, and this becomes :

$\begin{array}{lll}\frac{\partial L}{\partial w_n}&=&\frac{1}{m}\sum_{i=1}^m$a^i-y^i${x_n}^i\end{array}$ $\begin{array}{lll}\frac{\partial L}{\partial b}&=&\frac{1}{m}\sum_{i=1}^m$a^i-y^i$\end{array}$

Vectorised operations

The $m$ training examples each having $n$ features is represented as,

$\mathbf{X} = \begin{bmatrix} x_{1}^1 & x_{1}^2 & \dots & x_{1}^m \\ x_{2}^1 & x_{2}^2 & \dots & x_{2}^m \\ \vdots & \vdots & \ddots & \vdots \\ x_{n}^1 & x_{n}^2 & \dots & x_{n}^m \end{bmatrix}, \quad \mathbf{X} \in \mathbb{R}^{n \times m}$

The output is,

$\mathbf{y} = \begin{bmatrix} y^1 & y^2 & \dots & y^m \end{bmatrix}, \quad \mathbf{y} \in \mathbb{R}^{1 \times m}$

The parameters $\mathbf{w}$ and $b$ represented as,

$\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix}, \quad \mathbf{w} \in \mathbb{R}^{n \times 1}$

$b\in \mathbb{R}^{1 \times 1}$

where,

$\mathbf{w}$ is the feature vector of size $n$ i.e. $\mathbf{w} \in \mathbb{R}^n$ and
$b$ is a scalar

The output $a$ is,

$\begin{array}{lll}\mathbf{a}& = & \sigma(\mathbf{z}) = \sigma( \mathbf{w^T}\mathbf{X} +b) &=&\begin{bmatrix} a^1 & a^2 & \dots & a^m \end{bmatrix}, \quad \mathbf{a} \in \mathbb{R}^{1 \times m}\end{array}$

Gradients,

$d\mathbf{w} = \begin{bmatrix} \frac{\partial \mathbf{L}}{\partial w_1} \\ \frac{\partial \mathbf{L}}{\partial w_2} \\ \vdots \\ \frac{\partial \mathbf{L}}{\partial w_n} \end{bmatrix}, \quad d\mathbf{w} \in \mathbb{R}^{n \times 1}$

The gradient w.r.t to $\mathbf{w}$ can be represented in matrix operations as,

$\begin{array}{lll}d\mathbf{w} &=& \frac{1}{m}\begin{bmatrix} x_{1}^1 & x_{1}^2 & \dots & x_{1}^m \\ x_{2}^1 & x_{2}^2 & \dots & x_{2}^m \\ \vdots & \vdots & \ddots & \vdots \\ x_{n}^1 & x_{n}^2 & \dots & x_{n}^m \end{bmatrix}\begin{bmatrix}a^1-y^1\\a^2-y^2\\\vdots\\a^m-y^m\end{bmatrix}\\ &=&\frac{1}{m}\mathbf{X}(\mathbf{a}-\mathbf{y})^T\end{array}$

Similarly, for the bias term

$\begin{array}{lll}\frac{\partial L}{\partial b}=d\mathbf{b} &=& \frac{1}{m}\underbrac{\begin{bmatrix} 1 & 1 & \dots & 1 \\ \end{bmatrix}}_{1\times m}\begin{bmatrix}a^1-y^1\\a^2-y^2\\\vdots\\a^m-y^m\end{bmatrix}\\ &=&\frac{1}{m}\sum_i^m(a^i-y^i)\end{array}$

Gradients computed numerically vs PyTorch

Training – Binary Classification

Below is an example of training a binary classifier based on the model and gradient descent. Synthetic training data data is generated from two independent Gaussian random variables with zero mean and unit variance. Mean is shifted on half the samples by (-2,-2) and the remaining half by (+2,+2 ) corresponding to class 0 and class 1 respectively.

The training loop is done using the numerically computed gradients and using the torch.autograd provided by PyTorch, and can see that both are numerically very close.

The estimated probability score indicates the likelihood that the given input corresponds to one of the classes. As can be seen in the plot Predicted Probability for Each Input, inputs close to center point (0,0) have a probability close to 0.5, and as we move away from the center the probabilities tend to be closer to either 0 or 1.

To convert this probability into a class label, a decision threshold needs to be applied. In this example, as can be seen in the plot of Classification Error vs Threshold, the threshold of 0.5 is corresponding to the lowest error rate.

However, there are other scenarios where the threshold of 0.5 can be inappropriate – like dealing with imbalanced datasets, skewed class distribution etc. These require adjusting the threshold for better performance.

Summary

The post covers the following key aspects

Loss function based on Maximum Likelihood Estimate
Computation of the gradient based on chain rule of derivates
Vectorized Operations Implements all computations using efficient matrix and vector math.
Training loop for the binary classification using both manual and PyTorch based gradients

Have any questions or feedback on the gradient computation techniques? Feel free to drop your feedback in the comments section. 🙂