Word Embeddings using neural networks

The post covers various neural network based word embedding models. Starting from the Neural Probabilistic Language Model from Bengio et al 2003, then reduction of complexity using Hierarchical softmax and Noise Contrastive Estimation. Further works like CBoW, GlOVe, Skip Gram and Negative Sampling which helped to train on much higher data.

In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.

Papers referred :

  1. Neural Probabilistic Language Model, Bengio et al 2003
    • proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
  2. Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
    • given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from to .
  3. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
  4. Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
    • proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
    • Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
  5. Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
    • speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
    • simplified variant of Noise Contrastive estimation called Negative Sampling
  6. GloVe: Global Vectors for Word Representation, Pennington et al 2014
    • propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.
Continue reading “Word Embeddings using neural networks”