Machine Learning Archives

Loss functions for handling class imbalance

For handling class imbalance, multiple stratetgies have emerged. Post covers Weighted Cross Entropy, Focal Loss, Assymetric Loss, Class Balanced Loss and Logit Adjusted Loss.

Most of real world datasets have class imbalance, where a “majority” class dwarfs the “minority” samples. Typical examples are – identifying rare pathologies in medical diagnosis or flagging anomalous transactions to detect fraud or detecting sparse foreground objects from vast background objects in computer vision to name a few.

The machine learning models we have discussed – binary classification ^{(refer post Gradients for Binary Classification with Sigmoid)} or multiclass classification ^{(refer post Gradients for multi class classification with Softmax)} needs tweaks to learn from these imbalanced datasets. Without these adjustments, the models can “cheat” by favouring the majority class and can report a pseudo high accuracy though the class specific accuracy is low.

Different strategies have emerged over the years, and in this article we are covering the approaches listed below.

Weighted cross entropy
- Foundational baseline, where a class-specific weight factor to the standard cross-entropy loss to weight the loss based on frequency of the class.
Focal Loss for Dense Object Detection, Lin et al. (2017)
- Propose a modulating factor $(1-p_t) ^\gamma$ to the cross-entropy loss to down-weight easy/frequent examples which indirectly forces the model to focus on hard/rare examples
Asymmetric Loss for Multi-Label Classification, Ridnik et al. (2021)
- Extended the intuition of Focal Loss by having independent $\gamma$ hyper-parameter for positive and negative samples. This allows for more aggressive “pushing” of easy/frequent examples while preserving the gradient signal for hard/rare samples.
- Additionally, authors introduces a probability margin that explicitly zeros out the loss from easy/frequent samples.
Class-Balanced Loss Based on Effective Number of Samples, Cui et al. (CVPR 2019)
- Based on the intuition that there are similarities among the samples, authors propose a framework to capture the diminishing benefit when more datasamples are added to a class.
Long-tail Learning via Logit Adjustment, Menon et al. (ICLR 2021)
- Based on the foundations from Bayes Rule, authors propose that adding a class dependent offset based on the prior probabilities help the model learn to minimise the balanced error rate (the average of error rates for each class) instead of minimising global error rate.

Word Embeddings using neural networks

The post covers various neural network based word embedding models. Starting from the Neural Probabilistic Language Model from Bengio et al 2003, then reduction of complexity using Hierarchical softmax and Noise Contrastive Estimation. Further works like CBoW, GlOVe, Skip Gram and Negative Sampling which helped to train on much higher data.

In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.

Papers referred :

Neural Probabilistic Language Model, Bengio et al 2003
- proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
- given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from $O(|V|)$ to $O(\log_2|V|)$ .
Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
- Instead of directly estimating the data distribution, noise contrastive estimation estimates the probability of a sample being from the data versus from a known noise distribution.
- This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.
Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
- proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
- Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
- speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
- simplified variant of Noise Contrastive estimation called Negative Sampling
GloVe: Global Vectors for Word Representation, Pennington et al 2014
- propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.

Gradients for multi class classification with Softmax

In a multi class classification problem, the output (also called the label or class) takes a finite set of discrete values $y \in \{1, 2, \ldots, k\}$ . In this post, system model for a multi class classification with a linear layer followed by softmax layer is defined. The softmax function transforms the output of a linear layer into values lying between 0 and 1, which can be interpreted as probability scores.

Next, the loss function using categorical cross entropy is explained and derive the gradients for model parameters using the chain rule. The analytically computed gradients are then compared with those obtained from the deep learning framework PyTorch. Finally, we implement a training loop using gradient descent for a toy multi-class classification task with 2D Gaussian-distributed data.

Gradients for Binary Classification with Sigmoid

In a classification problem, the output (also called the label or class) takes a small number of discrete values rather than continuous values. For a simple binary classification problem, where output takes only two discrete values : 0 or 1, the sigmoid function can be used to transform the output of a linear regression model into a value between 0 and 1, squashing the continuous prediction into a probability-like score. This score can then be interpreted as the likelihood of the output being class 1, with a threshold (commonly 0.5) used to decide between class 0 and class 1.

In this post, the intuition for loss function for binary classification based on Maximum Likelihood Estimate (MLE) is explained. We then derive the gradients for model parameters using the chain rule. Gradients computed analytically are compared against gradients computed using deep learning framework PyTorch. Further, training loop using gradient descent for a binary classification problem having two dimensional Gaussian distributed data is implemented.

Gradients for linear regression

Understanding gradients is essential in machine learning, as they indicate the direction and rate of change in the loss function with respect to model parameters. This post covers the gradients for the vanilla Linear Regression case taking two loss functions Mean Square Error (MSE) and Mean Absolute Error (MAE) as examples.

The gradients computed analytically are compared against gradient computed using deep learning framework PyTorch. Further, using the gradients, training loop using gradient descent is implemented for the simplest example of fitting a straight line.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.