Loss functions for handling class imbalance

For handling class imbalance, multiple stratetgies have emerged. Post covers Weighted Cross Entropy, Focal Loss, Assymetric Loss, Class Balanced Loss and Logit Adjusted Loss.

Most of real world datasets have class imbalance, where a “majority” class dwarfs the “minority” samples. Typical examples are – identifying rare pathologies in medical diagnosis or flagging anomalous transactions to detect fraud or detecting sparse foreground objects from vast background objects in computer vision to name a few.

The machine learning models we have discussed – binary classification (refer post Gradients for Binary Classification with Sigmoid) or multiclass classification (refer post Gradients for multi class classification with Softmax) needs tweaks to learn from these imbalanced datasets. Without these adjustments, the models can “cheat” by favouring the majority class and can report a pseudo high accuracy though the class specific accuracy is low.

Different strategies have emerged over the years, and in this article we are covering the approaches listed below.

  1. Weighted cross entropy
    • Foundational baseline, where a class-specific weight factor to the standard cross-entropy loss to weight the loss based on frequency of the class.
  2. Focal Loss for Dense Object Detection, Lin et al. (2017)
    • Propose a modulating factor to the cross-entropy loss to down-weight easy/frequent examples which indirectly forces the model to focus on hard/rare examples
  3. Asymmetric Loss for Multi-Label Classification, Ridnik et al. (2021)
    • Extended the intuition of Focal Loss by having independent hyper-parameter for positive and negative samples. This allows for more aggressive “pushing” of easy/frequent examples while preserving the gradient signal for hard/rare samples.
    • Additionally, authors introduces a probability margin that explicitly zeros out the loss from easy/frequent samples.
  4. Class-Balanced Loss Based on Effective Number of Samples, Cui et al. (CVPR 2019)
    • Based on the intuition that there are similarities among the samples, authors propose a framework to capture the diminishing benefit when more datasamples are added to a class.
  5. Long-tail Learning via Logit Adjustment, Menon et al. (ICLR 2021)
    • Based on the foundations from Bayes Rule, authors propose that adding a class dependent offset based on the prior probabilities help the model learn to minimise the balanced error rate (the average of error rates for each class) instead of minimising global error rate.
Continue reading “Loss functions for handling class imbalance”

Word Embeddings using neural networks

The post covers various neural network based word embedding models. Starting from the Neural Probabilistic Language Model from Bengio et al 2003, then reduction of complexity using Hierarchical softmax and Noise Contrastive Estimation. Further works like CBoW, GlOVe, Skip Gram and Negative Sampling which helped to train on much higher data.

In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.

Papers referred :

  1. Neural Probabilistic Language Model, Bengio et al 2003
    • proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
  2. Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
    • given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from to .
  3. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
  4. Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
    • proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
    • Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
  5. Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
    • speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
    • simplified variant of Noise Contrastive estimation called Negative Sampling
  6. GloVe: Global Vectors for Word Representation, Pennington et al 2014
    • propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.
Continue reading “Word Embeddings using neural networks”

Gradients for multi class classification with Softmax

In a multi class classification problem, the output (also called the label or class) takes a finite set of discrete values . In this post, system model for a multi class classification with a linear layer followed by softmax layer is defined. The softmax function transforms the output of a linear layer into values lying between 0 and 1, which can be interpreted as probability scores.

Next, the loss function using categorical cross entropy is explained and derive the gradients for model parameters using the chain rule. The analytically computed gradients are then compared with those obtained from the deep learning framework PyTorch. Finally, we implement a training loop using gradient descent for a toy multi-class classification task with 2D Gaussian-distributed data.

Continue reading “Gradients for multi class classification with Softmax”

Gradients for Binary Classification with Sigmoid

In a classification problem, the output (also called the label or class) takes a small number of discrete values rather than continuous values. For a simple binary classification problem, where output takes only two discrete values : 0 or 1, the sigmoid function can be used to transform the output of a linear regression model into a value between 0 and 1, squashing the continuous prediction into a probability-like score. This score can then be interpreted as the likelihood of the output being class 1, with a threshold (commonly 0.5) used to decide between class 0 and class 1.

In this post, the intuition for loss function for binary classification based on Maximum Likelihood Estimate (MLE) is explained. We then derive the gradients for model parameters using the chain rule. Gradients computed analytically are compared against gradients computed using deep learning framework PyTorch. Further, training loop using gradient descent for a binary classification problem having two dimensional Gaussian distributed data is implemented.

Continue reading “Gradients for Binary Classification with Sigmoid”

Gradients for linear regression

Understanding gradients is essential in machine learning, as they indicate the direction and rate of change in the loss function with respect to model parameters. This post covers the gradients for the vanilla Linear Regression case taking two loss functions Mean Square Error (MSE) and Mean Absolute Error (MAE) as examples.

The gradients computed analytically are compared against gradient computed using deep learning framework PyTorch. Further, using the gradients, training loop using gradient descent is implemented for the simplest example of fitting a straight line.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Continue reading “Gradients for linear regression”

Migrated to Amazon EC2 instance (from shared hosting)

Being not too happy with the speed of the shared hosting, decided to move the blog to an Amazon Elastic Compute Cloud (Amazon EC2) instance.  Given this is a baby step, picked up a micro instance running an Ubuntu server and installed Apache web server, MySQL, PHP . After doing a bit of tweaking with this new instance, imported the SQL database and other files from the shared hosting and pointed the A name record to the new IP address. This switch happened over this weekend.

One particular issue which I faced was frequent crashing of MySQL due to memory limitations. Followed few online instructions to improve the situation and the current configuration seems to be holding up (but this is a cause of worry – need to figure the right solution).

Anyhow, hope you like the decreased page load time! 🙂

Some helpful links from the web:

a) How to install WordPress on Amazone EC2

b) Move WordPress site from shared hosting to Amazon EC2

c) DIY: Enable CGI on your Apache server

d) Import MySQL Dumpfile, SQL Datafile Into My Database

e) Making WordPress Stable on EC2-Micro

f) how to enable mod_rewrite in apache2.2 (debian/ubuntu)

GATE-2012 ECE Q28 (electromagnetics)

Question 28 on electromagnetics from GATE (Graduate Aptitude Test in Engineering) 2012 Electronics and Communication Engineering paper.

Q28. A transmission line with a characteristic impedance of 100 is used to match a 50 section to a 200 section. If the matching is to be done both at 429MHz and 1GHz, the length of the transmission line can be approximately

(A) 82.5cm

(B) 1.05m

(C) 1.58m

(D) 1.75m

Continue reading “GATE-2012 ECE Q28 (electromagnetics)”

Image Rejection Ratio (IMRR) with transmit IQ gain/phase imbalance

The post on IQ imbalance in transmitter, briefly discussed the effect of amplitude and phase imbalance and also showed that IQ imbalance results in spectrum at the image frequency. In this article, we will quantify the power of the image with respect to the desired tone (also known as IMage Rejection Ratio IMRR) for different values of gain and phase imbalance.

Continue reading “Image Rejection Ratio (IMRR) with transmit IQ gain/phase imbalance”

GATE-2012 ECE Q15 (communication)

Question 15 on communication from GATE (Graduate Aptitude Test in Engineering) 2012 Electronics and Communication Engineering paper.

Q15. A source alphabet consists of N symbols with the probability of the first two symbols being the same. A source encoder increases the probability of the first symbol by a small amount  and decreases that of the second by . After encoding, the entropy of the source

(A) increases

(B) remains the same

(C) increases only if N=2

(D) decreases

Continue reading “GATE-2012 ECE Q15 (communication)”

GATE-2012 ECE Q7 (digital)

Question 7 on digital from GATE (Graduate Aptitude Test in Engineering) 2012 Electronics and Communication Engineering paper.

Q7. The output Y of a 2-bit comparator is logic 1 whenever the 2 bit input A is greater than 2 bit input B. The number of combinations for which output is logic 1 is

(A) 4

(B) 6

(C) 8

(D) 10

Continue reading “GATE-2012 ECE Q7 (digital)”

GATE-2012 ECE Q13 (circuits)

Question 13 on analog electronics from GATE (Graduate Aptitude Test in Engineering) 2012 Electronics and Communication Engineering paper.

Q13. The diodes and the capacitors in the circuit shown are ideal. The voltage  across the diode  is

voltage_clamper_peak_detector_circuit

(A) 

(B)  

(C) 

(D)

Continue reading “GATE-2012 ECE Q13 (circuits)”

GATE-2012 ECE Q12 (math)

Question 12 on math from GATE (Graduate Aptitude Test in Engineering) 2012 Electronics and Communication Engineering paper.

Q12. With initial condition  the solution of the differential equation,

 is

(A)

(B)

(C)

(D)

Solution

From the product rule used to find the derivative of product of two or more functions,

Applying this to the above equation, we can be seen that,

Plugging this in and integrating both sides,

.

Using the initial condition , we can solve for the unknown , i.e.
.

So the solution to the differential equation is,

Based on the above, the right choice is (D) .

 

References

[1] GATE Examination Question Papers [Previous Years] from Indian Institute of Technology, Madras http://gate.iitm.ac.in/gateqps/2012/ec.pdf

[2] Wiki entry on Product rule