In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.
Papers referred :
- Neural Probabilistic Language Model, Bengio et al 2003
- proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
- Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
- given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from
to
.
- given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from
- Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
- Instead of directly estimating the data distribution, noise contrastive estimation estimates the probability of a sample being from the data versus from a known noise distribution.
- This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.
- Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
- proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
- Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
- Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
- speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
- simplified variant of Noise Contrastive estimation called Negative Sampling
- GloVe: Global Vectors for Word Representation, Pennington et al 2014
- propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.