AI Archives - DSP LOG

Gradients for multi class classification with Softmax

In a multi class classification problem, the output (also called the label or class) takes a finite set of discrete values $y \in \{1, 2, \ldots, k\}$ . In this post, system model for a multi class classification with a linear layer followed by softmax layer is defined. The softmax function transforms the output of a linear layer into values lying between 0 and 1, which can be interpreted as probability scores.

Next, the loss function using categorical cross entropy is explained and derive the gradients for model parameters using the chain rule. The analytically computed gradients are then compared with those obtained from the deep learning framework PyTorch. Finally, we implement a training loop using gradient descent for a toy multi-class classification task with 2D Gaussian-distributed data.

Gradients for Binary Classification with Sigmoid

In a classification problem, the output (also called the label or class) takes a small number of discrete values rather than continuous values. For a simple binary classification problem, where output takes only two discrete values : 0 or 1, the sigmoid function can be used to transform the output of a linear regression model into a value between 0 and 1, squashing the continuous prediction into a probability-like score. This score can then be interpreted as the likelihood of the output being class 1, with a threshold (commonly 0.5) used to decide between class 0 and class 1.

In this post, the intuition for loss function for binary classification based on Maximum Likelihood Estimate (MLE) is explained. We then derive the gradients for model parameters using the chain rule. Gradients computed analytically are compared against gradients computed using deep learning framework PyTorch. Further, training loop using gradient descent for a binary classification problem having two dimensional Gaussian distributed data is implemented.

Gradients for linear regression

Understanding gradients is essential in machine learning, as they indicate the direction and rate of change in the loss function with respect to model parameters. This post covers the gradients for the vanilla Linear Regression case taking two loss functions Mean Square Error (MSE) and Mean Absolute Error (MAE) as examples.

The gradients computed analytically are compared against gradient computed using deep learning framework PyTorch. Further, using the gradients, training loop using gradient descent is implemented for the simplest example of fitting a straight line.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.