Noise Contrastive Estimation Archives

In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.

Papers referred :

Neural Probabilistic Language Model, Bengio et al 2003
- proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
- given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from $O(|V|)$ to $O(\log_2|V|)$ .
Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
- Instead of directly estimating the data distribution, noise contrastive estimation estimates the probability of a sample being from the data versus from a known noise distribution.
- This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.
Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
- proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
- Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
- speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
- simplified variant of Noise Contrastive estimation called Negative Sampling
GloVe: Global Vectors for Word Representation, Pennington et al 2014
- propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.

Tag: Noise Contrastive Estimation

Word Embeddings using neural networks