In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.
Papers referred :
- Neural Probabilistic Language Model, Bengio et al 2003
- proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
- Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
- given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from
to
.
- given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from
- Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
- Instead of directly estimating the data distribution, noise contrastive estimation estimates the probability of a sample being from the data versus from a known noise distribution.
- This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.
- Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
- proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
- Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
- Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
- speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
- simplified variant of Noise Contrastive estimation called Negative Sampling
- GloVe: Global Vectors for Word Representation, Pennington et al 2014
- propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.
In this post we will cover the key aspects proposed in the above papers with supporting python code.
Neural Probabilistic Language Model (Bengio et al, 2003)
Reference : Neural Probabilistic Language Model, Bengio et al 2003
The probability of sequence of words
can be expressed as conditional probability of sequence of previous words, i.e
For example, consider a sequence of 4 words,
Then, by the chain rule of probability:
Substituting the actual words:
For a long word sequence, instead of conditioning on all previous words, it is common to approximate the probability by conditioning only on the last words. That is:
Neural network Architecture
The neural probabilistic language model builds on the n-gram approximation and proposes a way to
- Jointly learn word feature vector (each word in the vocabulary has a feature vector — a real-valued vector in
) and
- Learn the probability of the sequence of words in terms of sequence of word feature vectors
The objective is to learn a model that predicts the probability of the next word given the previous words, i.e.
The model is subject to the following constraints:
- For any sequence of words, the model outputs a non-zero probability, i.e.
in the vocabulary equals 1, i.e.
where is the vocabulary size, and
indexes over all possible words in the vocabulary.
Note :
- Non-zero probability: Ensures that the model never completely rules out any word as a possible next word, allowing it to adapt to all possible word sequences and avoid zero-probability issues during training.
- Probabilities sum to one: Guarantees that f defines a valid probability distribution over the vocabulary for the next word, so the total probability of all possible next words is exactly 1.
The estimation of the function is done as follows :
- for any word
in the vocabulary
, lookup a real vector
- a function
maps an input sequence of feature vectors for words in the context,
to a conditional probability distribution over words in
for the next word
Model
The neural network model can be expressed as:
where:
is the concatenated input feature vector of the previous
words, with dimension
.
is a weight matrix of size
, which transforms the input
into the hidden layer space.
is a bias vector for the hidden layer, of dimension
.
is a weight matrix of size
that maps the hidden layer activations to the output layer, where
is the vocabulary size.
is a weight matrix of size
that connects the input
directly to the output layer.
is the bias vector for the output layer, of dimension
.
is the output vector containing the unnormalized log-probabilities (scores) for each word in the vocabulary, of dimension
.
Using softmax to convert the output vector into a probability distribution over the vocabulary,
Using softmax layer ensures the constraints defined earlier:
- All probabilities are positive, satisfying the
.
Loss function
The maximum likelihood estimate for selecting the target word over all the words in vocabulary
is equivalent to minimising the negative log likelihood,
where,
is multiple word sequence examples
As can be seen in the section on Loss for Multiclass classification (refer post on Gradients for Multiclass classification with SoftMax), the negative log likelihood is indeed the Categorical Cross Entropy Loss.
Python code
The training of a Neural Probabilistic Language Model in PyTorch involves a few key components, each corresponding to the mathematical elements discussed earlier:
- torch.nn.Embedding — implements the word feature vector lookup function
. Each word index in the vocabulary maps to a dense vector in
.
- torch.nn.Linear — implements the fully-connected (dense) layers, corresponding to the transformation matrices
,
.
torch.nn.Parameter– the parametersand
are explicitly created.
- torch.nn.functional.log_softmax — applies the SoftMax in log space to obtain
while maintaining numerical stability.
- torch.nn.NLLLoss — implements the Negative Log Likelihood Loss, which directly minimises
for the correct target word index.
These functions, combined with an optimizer such as torch.optim.SGD or torch.optim.Adam, form the complete training loop for the model.
code @ word_embeddings/neural_probabilistic_language_model.ipynb
The training loop implementing the model for a simple toy example of 20 sentences shows that the model is doing reasonable in predicting the probability of next word.
Hierarchical Softmax (Morin & Bengio 2005)
Reference : Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
As computing of the probability of all tokens using SoftMax scales with vocabulary size , in the paper Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005) proposed an approach to reduce the complexity from
to
.
Based on the intuition shared in the paper Classes for Fast Maximum Entropy Training, J Goodman 2001, to compute , instead of directly computing the probability of the target word
given the context words
, we decompose it hierarchically as product of :
: probability of
in class
given context
: probability of word
, given
is in class
AND context
i.e
where,
: the target word we want to predict (e.g., “dog”).
: the context (the surrounding words or features used to predict the next word, e.g., “the big”).
: the cluster/class that the target word
belongs to (e.g., dog → Noun class).
Derivation
To derive the the decomposition of , let us introduce a class variable
, i.e. the word
belongs to the class
.
Then the probability of can be written as the sum of
is in the class or not
Since each word belongs to exactly one class, the term
is zero.
Hence,
The term can be expanded using the chain rule of conditional probabilities as follows:
Summarizing,
Thus, computing reduces to first predicting the class
given the context
and then predicting the word
within that class conditioned on
.
Complexity
With this approach, instead of computing probability over the entire vocabulary , this is broken down to computing the probability over the classes, and then computing the probability over the words within the chosen class.
Taking the example shared in the paper, assuming that is 10000 words, and we break it down into 100 classes, with each class having 100 words. Then the computations needed are:
- Finding probability over 100 classes
- Finding probability over 100 words in the chosen class
This reduces the computation to ~200 probability calculations instead of 10000 in the flat structure. Equivalently, the complexity reduces from to
operations.
Binary Tree
An alternative to class-based grouping is to arrange the vocabulary words as the leaves of a binary tree. Each internal node corresponds to a binary decision (left or right child), and each leaf corresponds to one word in the vocabulary. This hierarchical arrangement reduces the search complexity from to
, making it efficient for large vocabularies.
For constructing the binary tree, multiple approaches are possible :
- Perfect binary tree
- Requires the leaves to be a power of 2 (for eg, 2, 4, 8, 16 etc).
- If
is not a power of 2, some leaves will remain unused
- To reach every word, it takes the same path length i.e
.
- Average depth: exactly
since all leaves are at the same level.
- Balanced binary tree
- Tries to keep the left and right subtrees of equal size.
- When the vocabulary is not a power of 2, leaf depths differ by at most 1.
- No empty leaves; every leaf corresponds to a word.
- Average depth: approximately
, often slightly smaller because some leaves are shallower.
- Word frequency based tree
- Constructed using a Huffman coding structure, frequent words are placed closer to the root node while rare words are deeper.
- This minimises the average number of binary decisions required to reach a word.
- Average depth: depends on the frequency distribution; it is minimised and typically much smaller than
for natural language vocabularies (due to Zipf’s law).
For a toy corpus of 12 words, construction of the binary tree with the above approaches is shown below. code @ word_embeddings/binary_tree.ipynb
Model
The probability of the next word given the context can be written as:
where,
- each word
is represented by a bit vector
- the path p depends on the position of the word in the binary tree.
For example, if each word is represented by 4 bits, then the probability of predicting the next word given the context becomes:
Taking log on both sides converts the product into a summation:
In general, for a word represented with bits:
The bit vector corresponds to the path (left or right at each nodes) starting from the root node to the leaf node (the word). Each internal node outputs a probability of going right ( ). For the true label
, the binary cross-entropy loss at that node is:
The total loss for predicting is the sum of the node losses along the path:
where
is the predicted probability at the j-th node along the path.
denotes the binary choice (0 or 1) at the j-th internal node along the path to word
.
This is equivalent to the negative log-likelihood of the full word probability.
Binary Node Predictor
Each internal node of the binary tree acts as a logistic classifier that decides left vs right, based on both the (n−1)-gram context and the node embedding. The conditional probability of taking the binary decision at a node, given the past context, is modelled as:
where,
- the sigmoid function is
.
- for any word
in the vocabulary
, lookup a real vector
: concatenation of the previous (n−1) word embeddings,
: bias term specific to the node,
: projection vector applied after hidden nonlinearity,
: bias for hidden layer,
: weight matrix projecting context to hidden space,
: weight matrix projecting node embedding,
: embedding vector for the current node,
The matrices ,
,
(projection vector) and the bias
are common parameters shared across all nodes.
Each internal node has its own (scalar bias), and
(node embedding). These take care of the decision boundary at each internal node.
Python code – Naive implementation using for-loops
For the toy corpus, naive implementation of hierarchical softmax using for-looops is provided.
- Defined a toy corpus of 20 sentences which has around 42 words.
- Training example is constructed as 3 context words and the corresponding target word
- Constructed a balanced binary tree, which has 41 internal nodes
- Model defined with the binary node predictor for each of the nodes
- The parameters
,
,
(projection vector) and the bias
are shared across all nodes.
- Each internal node has its own
and
parameter
- The parameters
- For each target word in the training example, the path to the leaf node via the tree is known
- Using the binary decision at each path, the loss for each example is computed
- The loss is back propagated to find the parameters which minimizes the loss
Using the trained model, for finding the probabilities for top-k words given the context words,
- For each word in the vocabulary find the path to its leaf node
- Starting from the root node, find the probability at each node
- Based on the known decision (right vs left) at each node, use either p for going right OR (1-p) for going left
- The joint probability is the product of probabilities at each node.
- For numerical stability (loss of accuracy when many small probabilities are multiplied), log of probabilities is found and then summed
- On the final log probability is exponentiated to get back in probability (optional)
- Then the top-k candidate words are printed
code @ word_embeddings/hierarchical_probabilistic_neural_language_model.ipynb
Python code – Vectorized implementation
As one can imagine, using the for loops significantly slows down the training. To form a vectorized implementation, the following was done.
- Path preparation
- Assign a unique id to every internal node in the binary tree.
- Precompute for each word:
- sequence of node-ids on the path to its leaf,
- binary decision targets at each node.
- Pad all paths to a fixed length
using a dummy (UNK) node id. Build a mask to ignore padded positions.
- Parameter lookup
- Use
torch.nn.Embeddingto fetch node-specific parameters(biases) and
(embeddings).
- Shapes:
- Use
- Forward pass (vectorized)
- Context projection:
and bias
.
- Node projection:
using
.
- Broadcast:
.
- Nonlinearity :
.
- Projection:
with
.
- Probabilities:
.
- Context projection:
- Loss and masking
- Binary cross-entropy is computed between
and the decision targets, with mask applied to ignore padded nodes:
- Binary cross-entropy is computed between
Notes :
- Compute
once per batch and broadcast, instead of recomputing per path.
- UNK node parameters are trainable but excluded from loss using the mask.
code @ word_embeddings/vectorized_hierarchical_probabilistic_neural_language_model.ipynb
Noise contrastive estimation (Gutmann et al 2012, Mnih et al 2012)
As computing of the probability using SoftMax scales with vocabulary size , in the paper Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012, provided an approach called Noise Contrastive Estimation (NCE). Instead of directly estimating the data distribution, NCE estimates the probability of a sample being from the data versus from a known noise distribution. By learning the ratio between the data and noise distributions, and knowing the noise distribution, the data distribution can be inferred.
This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.
Model
In Neural Probabilistic Language model, the estimation of the probability of the words using SoftMax computation is,
where,
- context words is
- term in numerator
is estimated using a neural model with parameters
for the target word
using the context words
- term in denominator is sum over all the words in the vocabulary i.e.
Let us define a set , which is the union of two sets
, where
- the class label
when the word is from the true target word distribution
- the class label
when the word is NOT from the true target word distribution
is the number of true (data) samples in the batch (or dataset)
is the the number of noise samples generated for contrast
The formulation is,
- for each true (data) sample
will draw
noise samples
from
- the model has to learn a binary classification where the sample
is from the true distribution
or from the noise distribution
Further, instead of computing the denominator term for normalzing to probabilities, learn it as a context dependent normalizing term, i.e.
The probability of sample coming from the true distribution given the context can be written as ,
Similarly, the probability of the word in noise distribution is,
Further,
Since NCE reframes the problem as a binary classification task (distinguishing true data from noise), the class labels are modelled as independent Bernoulli variables. Consequently, the conditional log likelihood is the sum of the binary cross-entropy terms:
For a single true target word and its corresponding
noise samples
To evaluate this loss, need to express in terms of model parameters. Using Bayes Rule, the probability that the class is true
given the context
and target word
is,
This gives the general probability for any word . When calculating the loss for a true target word, we substitute
to get the positive sample probability
Converting to sigmoid form which is used in logistic regression,
Similarly, for the target word from noise distribution i.e. the probability that the class is noise given the context
and target word
is,
This gives the general probability for any word . When calculating the loss for a noise target word, we substitute
to get the noise sample probability
Converting to sigmoid form which is used in logistic regression,
Plugging in the terms to the log likelihood for a single example,
To obtain the objective function for the entire dataset, sum the log-likelihoods over all true training examples . For each training example at step
, we have a specific context
, a true target word
, and a fresh set of
noise samples.
The final loss function that we minimize is the negative log-likelihood over the full dataset:
Note :
In the paper, A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012, authors mention that approximating the learning of context dependent normalizing factor to 1 did not affect the performance of downstream tasks.
Noise Distribution
The noise distribution is typically chosen proportional to the unigram frequency of words in the corpus:
Often a smoothed unigram distribution improves results:
Python code
For the toy vocabulary, the code for Neural Probabilistic Language Model, Bengio et al, with the SoftMax head replaced with with Noise Contrastive Estimation (NCE).
The code @ word_embeddings/nplm_with_noise_contrastive_estimation.ipynb
Word2Vec papers (Mikolov et al, 2013)
In the paper, Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013. proposed architectures to reduce the computation complexity in learning word embeddings , with the intuition that simpler models enable training on much larger corpus of data.
Two architectures where proposed.
Continuous Bag of Words (CBOW) Model
When comparing with Neural Probabilistic Language Model, Bengio et al 2003, the following simplifications are proposed.
- order of context words is ignored
- instead of concatenating embedding of previous words, averaging the word embeddings of surrounding words is proposed
- this approach is called “bag-of-words” as the order is not taken into consideration
- no non linear hidden layer
- the model uses a shared projection layer
Additionally, the in the model context including future words too.
Equations
The neural network output is :
where:
is the output weight matrix, mapping from hidden dimension
to vocabulary size
is the averaged context embeddings.
The averaged context embedding vector is computed as:
where,
is the input embedding matrix
is the embedding of the i-th context word, and
is the number of words to the left or right of the target word, giving a total context size of
The probability distribution over the vocabulary is obtained using the softmax function:
where:
is the predicted probability of word
being the target word.
is the score for word
from the output layer.
- The denominator sums the exponentiated scores over all vocabulary entries.
Continuous Skip-gram Model
The Skip-gram model, tries to predict context words given the current target word. The main idea is that each word is trained to predict the words surrounding it within a context window of size n.
- Input: one-hot encoding of the target word wt
- Output: probability distribution over vocabulary for each context word
- No non-linear hidden layer: uses a shared projection matrix (linear)
Given a target word , the model tries to predict each surrounding context word
for
. The training goal is to maximize the probability of all context words around each target word:
Equations
The output score,
where,
, where
is the embedding vector for the word
is the input embedding matrix
is the output embedding matrix and
The scores are computed for each context word, and the probability of all the context word is maximized.
For either CBOW or Skipgram, both and
are trainable. After training, either one (or their average) is used as the word embedding.
Naive way for finding the probability is using SoftMax,
where,
is the output word
is score for output word
- denominator = normalizing constant over vocabulary
For finding the probability, hierarchical softmax is proposed in the paper.
Negative Sampling
In the paper, Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013 introduced two concepts:
- the speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
- simplified variant of Noise Contrastive estimation called Negative Sampling
The key intuition in Negative Sampling is that noise contrastive loss defined had terms to normalize the score to approximate the probabilities. However, to learn word embeddings the probabilities are not needed and the terms
can be ignored.
With this simplification, the negative sampling loss is,
Python code
For the toy vocabulary, finding word vectors with
a) Continuous Bag of words (CBOW) with negative sampling
The code @ word_embeddings/cbow_negative_sampling copy.ipynb
b) Skip Graph with Negative Sampling
The code @ word_embeddings/skip_gram_negative_sampling.ipynb
GloVe Embeddings ( Penning et al 2014)
In the paper GloVe: Global Vectors for Word Representation, Penning et al 2014, authors propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.
Let,
be matrix of the word co-occurance counts
be the number of times word
occur in context of word
.
be the number of times any word appear in the context of word
be the probability that word
occur in context of word
.
Authors show that on the 6 billion token corpus dataset,
Taking the ratio of co-occurance probabilities,
The ratios indicate that,
- solid and ice has a higher relationship than with steam.
- gas and ice is far less likely to co-occur than with steam
- water is related to both ice and steam in similar proportions
Model
To capture this ratio relationship in a vector space, the authors search for a function that satisfies:
where
is context word
are target words
are the word vectors and
are separate context vectors.
Authors enforce that the relationship should be linear (vector difference) and the result should be a scalar (dot product), leading to :
To satisfy that, authors propose choosing function so that the dot product of vector difference can be written as ratio of probabilities,
With this choice, for a single word-context pair estimates the co-occurence probabilities,
Taking logarithm,
Note :
The model capturing the relation between two words should not change even if the words are swapped. Even though the co-occurrence counts are identical (), because the total counts of words are not equal (
) , the conditional probability is not symmetric (
).
The above equation is not symmetric if we swap target word and context word
as the row-dependent term
has to be handled.
To make it symmetric, the authors absorb into a learnable bias term
and then adds a corresponding bias
for the context word. This ensures the model is fully symmetric i.e.
The loss function then becomes ,
The key aspect in the above simplification is that, by training for the pairs of words to minimize the above loss will indirectly ensure that the dot product of vector difference of target words with context word will arrive at the ratio of probabilities.
Weighted Least Squares
The above loss function weighs all co-occurances equally. Authors noted that rare co-occurances are noisy and around 75-95% of the co-occurance is zeros, and proposed adding a weighting function to least squares loss proposed above.
The weighting function is chosen to obey the following :
(to handle the zero co-occurance counts)
should be non-decreasing so that rare co-occurance are given less weight
should be relatively small for large values of
so that frequent co-occurrences are not over-weighted
The parameters and
are chosen empirically.
Then the Weighted Least Squares loss function becomes,
Python code
The code @ word_embeddings/glove_word_embedding.ipynb
Summary
This article covers
Evolution: How we moved from Bengio’s NPLM (2003) to efficient architectures like Word2Vec and GloVe.
Math: Detailed derivations of Hierarchical Softmax (using binary trees) and Noise Contrastive Estimation (differentiating data from noise).
Architectures: A deep look at CBOW, Skip-Gram, and the intuition behind Negative Sampling.
Code: Complete Python implementations for every model discussed, including vectorized implementations for efficiency.
Acknowledgment
In addition to the primary papers listed above, this post draws inspiration from the excellent overview in the post Learning word embedding, Weng, Lilian 2017. Credit also goes to the recent Large Language Models Gemini and ChatGPT which helped to bounce thoughts and refine the drafts.