In a classification problem, the output (also called the label or class) takes a small number of discrete values rather than continuous values. For a simple binary classification problem, where output takes only two discrete values : 0 or 1, the sigmoid function can be used to transform the output of a linear regression model into a value between 0 and 1, squashing the continuous prediction into a probability-like score. This score can then be interpreted as the likelihood of the output being class 1, with a threshold (commonly 0.5) used to decide between class 0 and class 1.
In this post, the intuition for loss function for binary classification based on Maximum Likelihood Estimate (MLE) is explained. We then derive the gradients for model parameters using the chain rule. Gradients computed analytically are compared against gradients computed using deep learning framework PyTorch. Further, training loop using gradient descent for a binary classification problem having two dimensional Gaussian distributed data is implemented.
As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.
Model
Let us take an example of estimating based on feature vector
having
features i.e.
.
There are examples.
Let us assume that the variable is defined as linear function of
. Then
gets transformed into a probability score
using sigmoid function. For a single training example, this can be written as :
where,
is the feature vector of size
i.e.
and
is a scalar
To convert the real number to a number
lying between 0 and 1, let us define
where, is the sigmoid function (refer wiki entry on sigmoid function)
Sigmoid function and its derivative
Sigmoid function which has a smooth S-shaped mathematical function is defined as:
which has the properties
The derivative of sigmoid is,
From the plots of derivative of sigmoid, two key observations :
- Vanishing gradients : for very large or very small
, the derivative approaches 0 causing gradients to vanish during back propagation — this slows or stalls learning in deep networks.
- Low Maximum Gradient : the maximum value of derivative is 0.25, which caps the gradient flow, making it harder for deep layers to effectively update their weights
As mentioned in the article Yes you should understand backprop by Andrej Karpathy, these aspects have to be kept in mind when using sigmoid for training deeper neural networks.
Loss function for binary classification
Maximum Likelihood Estimation
Let us assume that the probability of output being 1, given input and parameters
,
is,
Then, for the binary classification, the probability of output being 0 is,
Since can either be 0 or 1, we can compactly write the likelihood as:
The likelihood function is the probability of the actual label given the prediction
. When the multiple independent training examples are independently and identically distributed (i.i.d.), the total likelihood for the dataset is the product of the likelihoods for each example. With this assumption, for
training examples, the likelihood for the parameters
and
is,
Log Likelihood
To avoid the product of many small numbers, we take the natural logarithm of the likelihood function. The log-likelihood for the entire dataset is the sum of the log-likelihoods for each example:
Negative Log Likelihood
Since optimizers like gradient descent are designed to minimize functions, we minimize the negative log-likelihood instead of maximizing the log-likelihood.
Averaging the Loss
Averaging the loss ensures that the total loss remains on the same scale, regardless of the size of the training dataset. This is important because it allows the use of a fixed learning rate across different dataset sizes, leading to more stable and consistent optimization behaviour.
The averaged negative log-likelihood is defined as:
This expression is known as the Binary Cross-Entropy (BCE) Loss, which is widely used in binary classification tasks. This function is available in PyTorch library as torch.nn.BCELoss
(refer entry on BECLoss in PyTorch).
Gradients with Binary Cross Entropy (BCE) Loss
The system model for binary classification involves multiple steps:
- firstly, the variable
is defined as linear function of
using parameters
,
.
- then
gets transformed into a estimated probability score
using sigmoid function.
- lastly, use the true label
and estimated probability score
, binary cross entropy loss
is computed
For performing gradient descent of the parameters, the goal is to the find the gradients of the loss w.r.t to the parameters
and
. To find the gradients, we go in the reverse order i.e.
- firstly, gradients of the loss
w.r.t to the estimated probability score
- then gradients of the probability score
w.r.t to the output of linear function
- lastly, gradients of output of linear function
w.r.t to parameters
,
Then the product of all the individual gradients from the gradients of loss w.r.t to the parameters. This is written as,
The steps described, calculating gradients in the reverse order from the loss back to the parameters is an application of the chain rule from calculus (refer wiki entry on Chain Rule). This method is the foundation of backpropagation used in training models (refer wiki entry on Backpropagation).
Deriving the gradients
For simplicity, take a single example and computing gradients step by step,
Step1 : Gradients of loss w.r.t to probability score
With the loss , then the derivative of loss w.r.t to sigmoid output
is,
Step2 : Gradients of probability score w.r.t to output of linear function
With as the output of sigmoid function, the derivative is
Step3 : Gradients of output of linear function w.r.t to parameters
With , the derivative is,
Similarly,
Gradients of loss w.r.t to parameters
Taking the product of the gradients from all the steps,
Similarly,
The intuition from above equations is :
if the estimated probability is close to the true value
then the gradient is small, and the update to the parameters is also correspondingly smaller. If you recall, the gradients for linear regression (refer post on Gradients for Linear Regression) follows a similar intuitive explanation.
Note : With training examples the loss is averaged, and this becomes :
Vectorised operations
The training examples each having
features is represented as,
The output is,
The parameters and
represented as,
where,
is the feature vector of size
i.e.
and
is a scalar
The output is,
Gradients,
The gradient w.r.t to can be represented in matrix operations as,
Similarly, for the bias term
Gradients computed numerically vs PyTorch
Training – Binary Classification
Below is an example of training a binary classifier based on the model and gradient descent. Synthetic training data data is generated from two independent Gaussian random variables with zero mean and unit variance. Mean is shifted on half the samples by (-2,-2) and the remaining half by (+2,+2 ) corresponding to class 0 and class 1 respectively.
The training loop is done using the numerically computed gradients and using the torch.autograd
provided by PyTorch, and can see that both are numerically very close.
The estimated probability score indicates the likelihood that the given input corresponds to one of the classes. As can be seen in the plot Predicted Probability for Each Input, inputs close to center point (0,0) have a probability close to 0.5, and as we move away from the center the probabilities tend to be closer to either 0 or 1.
To convert this probability into a class label, a decision threshold needs to be applied. In this example, as can be seen in the plot of Classification Error vs Threshold, the threshold of 0.5 is corresponding to the lowest error rate.
However, there are other scenarios where the threshold of 0.5 can be inappropriate – like dealing with imbalanced datasets, skewed class distribution etc. These require adjusting the threshold for better performance.
Summary
The post covers the following key aspects
- Loss function based on Maximum Likelihood Estimate
- Computation of the gradient based on chain rule of derivates
- Vectorized Operations Implements all computations using efficient matrix and vector math.
- Training loop for the binary classification using both manual and PyTorch based gradients
Have any questions or feedback on the gradient computation techniques? Feel free to drop your feedback in the comments section. 🙂