Word Embeddings using neural networks

The post covers various neural network based word embedding models. Starting from the Neural Probabilistic Language Model from Bengio et al 2003, then reduction of complexity using Hierarchical softmax and Noise Contrastive Estimation. Further works like CBoW, GlOVe, Skip Gram and Negative Sampling which helped to train on much higher data.

In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.

Papers referred :

  1. Neural Probabilistic Language Model, Bengio et al 2003
    • proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
  2. Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
    • given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from to .
  3. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
  4. Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
    • proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
    • Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
  5. Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
    • speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
    • simplified variant of Noise Contrastive estimation called Negative Sampling
  6. GloVe: Global Vectors for Word Representation, Pennington et al 2014
    • propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.
Continue reading “Word Embeddings using neural networks”

Gradients for multi class classification with Softmax

In a multi class classification problem, the output (also called the label or class) takes a finite set of discrete values . In this post, system model for a multi class classification with a linear layer followed by softmax layer is defined. The softmax function transforms the output of a linear layer into values lying between 0 and 1, which can be interpreted as probability scores.

Next, the loss function using categorical cross entropy is explained and derive the gradients for model parameters using the chain rule. The analytically computed gradients are then compared with those obtained from the deep learning framework PyTorch. Finally, we implement a training loop using gradient descent for a toy multi-class classification task with 2D Gaussian-distributed data.

Continue reading “Gradients for multi class classification with Softmax”

Gradients for Binary Classification with Sigmoid

In a classification problem, the output (also called the label or class) takes a small number of discrete values rather than continuous values. For a simple binary classification problem, where output takes only two discrete values : 0 or 1, the sigmoid function can be used to transform the output of a linear regression model into a value between 0 and 1, squashing the continuous prediction into a probability-like score. This score can then be interpreted as the likelihood of the output being class 1, with a threshold (commonly 0.5) used to decide between class 0 and class 1.

In this post, the intuition for loss function for binary classification based on Maximum Likelihood Estimate (MLE) is explained. We then derive the gradients for model parameters using the chain rule. Gradients computed analytically are compared against gradients computed using deep learning framework PyTorch. Further, training loop using gradient descent for a binary classification problem having two dimensional Gaussian distributed data is implemented.

Continue reading “Gradients for Binary Classification with Sigmoid”

Gradients for linear regression

Understanding gradients is essential in machine learning, as they indicate the direction and rate of change in the loss function with respect to model parameters. This post covers the gradients for the vanilla Linear Regression case taking two loss functions Mean Square Error (MSE) and Mean Absolute Error (MAE) as examples.

The gradients computed analytically are compared against gradient computed using deep learning framework PyTorch. Further, using the gradients, training loop using gradient descent is implemented for the simplest example of fitting a straight line.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Continue reading “Gradients for linear regression”