Word Embeddings using neural networks

In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.

Papers referred :

  1. Neural Probabilistic Language Model, Bengio et al 2003
    • proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
  2. Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
    • given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from to .
  3. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
  4. Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
    • proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
    • Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
  5. Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
    • speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
    • simplified variant of Noise Contrastive estimation called Negative Sampling
  6. GloVe: Global Vectors for Word Representation, Pennington et al 2014
    • propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.

In this post we will cover the key aspects proposed in the above papers with supporting python code.

Neural Probabilistic Language Model (Bengio et al, 2003)

Reference : Neural Probabilistic Language Model, Bengio et al 2003

The probability of sequence of words can be expressed as conditional probability of sequence of previous words, i.e

For example, consider a sequence of 4 words,

Then, by the chain rule of probability:

Substituting the actual words:

For a long word sequence, instead of conditioning on all previous words, it is common to approximate the probability by conditioning only on the last words. That is:

Neural network Architecture

The neural probabilistic language model builds on the n-gram approximation and proposes a way to

  • Jointly learn word feature vector (each word in the vocabulary has a feature vector — a real-valued vector in ) and
  • Learn the probability of the sequence of words in terms of sequence of word feature vectors

The objective is to learn a model that predicts the probability of the next word given the previous words, i.e.

The model is subject to the following constraints:

  • For any sequence of words, the model outputs a non-zero probability, i.e. in the vocabulary equals 1, i.e.

where is the vocabulary size, and indexes over all possible words in the vocabulary.

Note :

  • Non-zero probability: Ensures that the model never completely rules out any word as a possible next word, allowing it to adapt to all possible word sequences and avoid zero-probability issues during training.
  • Probabilities sum to one: Guarantees that f defines a valid probability distribution over the vocabulary for the next word, so the total probability of all possible next words is exactly 1.

The estimation of the function is done as follows :

  • for any word in the vocabulary , lookup a real vector
  • a function maps an input sequence of feature vectors for words in the context, to a conditional probability distribution over words in for the next word

Model

The neural network model can be expressed as:

where:

  • is the concatenated input feature vector of the previous words, with dimension .
  • is a weight matrix of size , which transforms the input into the hidden layer space.
  • is a bias vector for the hidden layer, of dimension .
  • is a weight matrix of size that maps the hidden layer activations to the output layer, where is the vocabulary size.
  • is a weight matrix of size that connects the input directly to the output layer.
  • is the bias vector for the output layer, of dimension .
  • is the output vector containing the unnormalized log-probabilities (scores) for each word in the vocabulary, of dimension .

Using softmax to convert the output vector into a probability distribution over the vocabulary,

Using softmax layer ensures the constraints defined earlier:

  • All probabilities are positive, satisfying the .

Loss function

The maximum likelihood estimate for selecting the target word over all the words in vocabulary is equivalent to minimising the negative log likelihood,

where,

is multiple word sequence examples

As can be seen in the section on Loss for Multiclass classification (refer post on Gradients for Multiclass classification with SoftMax), the negative log likelihood is indeed the Categorical Cross Entropy Loss.

Python code

The training of a Neural Probabilistic Language Model in PyTorch involves a few key components, each corresponding to the mathematical elements discussed earlier:

  • torch.nn.Embedding — implements the word feature vector lookup function . Each word index in the vocabulary maps to a dense vector in .
  • torch.nn.Linear — implements the fully-connected (dense) layers, corresponding to the transformation matrices , .
  • torch.nn.Parameter – the parameters and are explicitly created.
  • torch.nn.functional.log_softmax — applies the SoftMax in log space to obtain while maintaining numerical stability.
  • torch.nn.NLLLoss — implements the Negative Log Likelihood Loss, which directly minimises for the correct target word index.

These functions, combined with an optimizer such as torch.optim.SGD or torch.optim.Adam, form the complete training loop for the model.

code @ word_embeddings/neural_probabilistic_language_model.ipynb
The training loop implementing the model for a simple toy example of 20 sentences shows that the model is doing reasonable in predicting the probability of next word.

Hierarchical Softmax (Morin & Bengio 2005)

Reference : Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)

As computing of the probability of all tokens using SoftMax scales with vocabulary size , in the paper Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005) proposed an approach to reduce the complexity from to .

Based on the intuition shared in the paper Classes for Fast Maximum Entropy Training, J Goodman 2001, to compute , instead of directly computing the probability of the target word given the context words , we decompose it hierarchically as product of :

  • : probability of in class given context
  • : probability of word , given is in class AND context

i.e

P(Y=y|X=x)=P(C=c(y)|X=x)\cdot P(Y=y|C=c(y),X=x)

where,

  • y : the target word we want to predict (e.g., “dog”).
  • x : the context (the surrounding words or features used to predict the next word, e.g., “the big”).
  • c(y) : the cluster/class that the target word y belongs to (e.g., dog → Noun class).

Derivation

To derive the the decomposition of , let us introduce a class variable c(y), i.e. the word y belongs to the class c(y).

Then the probability of can be written as the sum of y is in the class or not

Since each word y belongs to exactly one class, the term is zero.

Hence,

The term can be expanded using the chain rule of conditional probabilities as follows:

Summarizing,

P(Y=y|X=x)=P(C=c(y)|X=x)P(Y=y|C=c(y),X=x)

Thus, computing reduces to first predicting the class given the context and then predicting the word within that class conditioned on .

Complexity

With this approach, instead of computing probability over the entire vocabulary |V|, this is broken down to computing the probability over the classes, and then computing the probability over the words within the chosen class.

Taking the example shared in the paper, assuming that |V| is 10000 words, and we break it down into 100 classes, with each class having 100 words. Then the computations needed are:

  • Finding probability over 100 classes
  • Finding probability over 100 words in the chosen class

This reduces the computation to ~200 probability calculations instead of 10000 in the flat structure. Equivalently, the complexity reduces from |V| to \sqrt{|V|} operations.

Binary Tree

An alternative to class-based grouping is to arrange the vocabulary words as the leaves of a binary tree. Each internal node corresponds to a binary decision (left or right child), and each leaf corresponds to one word in the vocabulary. This hierarchical arrangement reduces the search complexity from to , making it efficient for large vocabularies.

For constructing the binary tree, multiple approaches are possible :

  • Perfect binary tree
    • Requires the leaves to be a power of 2 (for eg, 2, 4, 8, 16 etc).
    • If |V| is not a power of 2, some leaves will remain unused
    • To reach every word, it takes the same path length i.e ceil(log2(|V|)).
    • Average depth: exactly log2(|V|) since all leaves are at the same level.
  • Balanced binary tree
    • Tries to keep the left and right subtrees of equal size.
    • When the vocabulary is not a power of 2, leaf depths differ by at most 1.
    • No empty leaves; every leaf corresponds to a word.
    • Average depth: approximately log2(|V|), often slightly smaller because some leaves are shallower.
  • Word frequency based tree
    • Constructed using a Huffman coding structure, frequent words are placed closer to the root node while rare words are deeper.
    • This minimises the average number of binary decisions required to reach a word.
    • Average depth: depends on the frequency distribution; it is minimised and typically much smaller than log2(|V|) for natural language vocabularies (due to Zipf’s law).

For a toy corpus of 12 words, construction of the binary tree with the above approaches is shown below. code @ word_embeddings/binary_tree.ipynb

Model

The probability of the next word given the context probability formula can be written as:

probability formula

where,

  • each word is represented by a bit vector (b1(v), b2(v), ..., bp(v))
  • the path p depends on the position of the word in the binary tree.

For example, if each word is represented by 4 bits, then the probability of predicting the next word given the context becomes:

4-bit chain rule example

Taking log on both sides converts the product into a summation:

log form 4 bit

In general, for a word represented with bits:

general log form

The bit vector corresponds to the path (left or right at each nodes) starting from the root node to the leaf node (the word). Each internal node outputs a probability of going right ( ). For the true label , the binary cross-entropy loss at that node is:

binary cross entropy

The total loss for predicting is the sum of the node losses along the path:

word loss

where

  • pj is the predicted probability at the j-th node along the path.
  • bj denotes the binary choice (0 or 1) at the j-th internal node along the path to word .

This is equivalent to the negative log-likelihood of the full word probability.

Binary Node Predictor

Each internal node of the binary tree acts as a logistic classifier that decides left vs right, based on both the (n−1)-gram context and the node embedding. The conditional probability of taking the binary decision at a node, given the past context, is modelled as:

probability equation

where,

  • the sigmoid function is sigmoid function.
  • for any word in the vocabulary , lookup a real vector
  • x : concatenation of the previous (n−1) word embeddings,
  • alpha node : bias term specific to the node, scalar
  • beta : projection vector applied after hidden nonlinearity, beta in R^h
  • c : bias for hidden layer, c in R^{h \times 1}
  • W : weight matrix projecting context to hidden space, W in R^(h x (n-1)m)
  • U : weight matrix projecting node embedding, U in R^(h x d_node)
  • N node : embedding vector for the current node, N_node in R^(d_node)

The matrices , , (projection vector) and the bias are common parameters shared across all nodes.

Each internal node has its own (scalar bias), and (node embedding). These take care of the decision boundary at each internal node.

Python code – Naive implementation using for-loops

For the toy corpus, naive implementation of hierarchical softmax using for-looops is provided.

  1. Defined a toy corpus of 20 sentences which has around 42 words.
  2. Training example is constructed as 3 context words and the corresponding target word
  3. Constructed a balanced binary tree, which has 41 internal nodes
  4. Model defined with the binary node predictor for each of the nodes
    • The parameters , , (projection vector) and the bias are shared across all nodes.
    • Each internal node has its own and parameter
  5. For each target word in the training example, the path to the leaf node via the tree is known
  6. Using the binary decision at each path, the loss for each example is computed
  7. The loss is back propagated to find the parameters which minimizes the loss

Using the trained model, for finding the probabilities for top-k words given the context words,

  • For each word in the vocabulary find the path to its leaf node
  • Starting from the root node, find the probability at each node
  • Based on the known decision (right vs left) at each node, use either p for going right OR (1-p) for going left
  • The joint probability is the product of probabilities at each node.
  • For numerical stability (loss of accuracy when many small probabilities are multiplied), log of probabilities is found and then summed
  • On the final log probability is exponentiated to get back in probability (optional)
  • Then the top-k candidate words are printed

code @ word_embeddings/hierarchical_probabilistic_neural_language_model.ipynb

Python code – Vectorized implementation

As one can imagine, using the for loops significantly slows down the training. To form a vectorized implementation, the following was done.

  1. Path preparation
    • Assign a unique id to every internal node in the binary tree.
    • Precompute for each word:
      • sequence of node-ids on the path to its leaf,
      • binary decision targets at each node.
    • Pad all paths to a fixed length using a dummy (UNK) node id. Build a mask to ignore padded positions.
  2. Parameter lookup
    • Use torch.nn.Embedding to fetch node-specific parameters (biases) and (embeddings).
    • Shapes:
  3. Forward pass (vectorized)
    • Context projection: and bias .
    • Node projection: using .
    • Broadcast: .
    • Nonlinearity : .
    • Projection: with .
    • Probabilities: .
  4. Loss and masking
    • Binary cross-entropy is computed between and the decision targets, with mask applied to ignore padded nodes:

Notes :

  • Compute once per batch and broadcast, instead of recomputing per path.
  • UNK node parameters are trainable but excluded from loss using the mask.

code @ word_embeddings/vectorized_hierarchical_probabilistic_neural_language_model.ipynb

Noise contrastive estimation (Gutmann et al 2012, Mnih et al 2012)

As computing of the probability using SoftMax scales with vocabulary size , in the paper Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012, provided an approach called Noise Contrastive Estimation (NCE). Instead of directly estimating the data distribution, NCE estimates the probability of a sample being from the data versus from a known noise distribution. By learning the ratio between the data and noise distributions, and knowing the noise distribution, the data distribution can be inferred.

This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.

Model

In Neural Probabilistic Language model, the estimation of the probability of the words using SoftMax computation is,

where,

  • context words is
  • term in numerator is estimated using a neural model with parameters for the target word using the context words
  • term in denominator is sum over all the words in the vocabulary i.e.

Let us define a set , which is the union of two sets , where

  • the class label when the word is from the true target word distribution
  • the class label when the word is NOT from the true target word distribution
  • is the number of true (data) samples in the batch (or dataset)
  • is the the number of noise samples generated for contrast

The formulation is,

  • for each true (data) sample will draw noise samples from
  • the model has to learn a binary classification where the sample is from the true distribution or from the noise distribution

Further, instead of computing the denominator term for normalzing to probabilities, learn it as a context dependent normalizing term, i.e.

The probability of sample coming from the true distribution given the context can be written as ,

Similarly, the probability of the word in noise distribution is,

Further,

Since NCE reframes the problem as a binary classification task (distinguishing true data from noise), the class labels are modelled as independent Bernoulli variables. Consequently, the conditional log likelihood is the sum of the binary cross-entropy terms:

For a single true target word and its corresponding noise samples

To evaluate this loss, need to express in terms of model parameters. Using Bayes Rule, the probability that the class is true given the context and target word is,

This gives the general probability for any word . When calculating the loss for a true target word, we substitute to get the positive sample probability

Converting to sigmoid form which is used in logistic regression,

Similarly, for the target word from noise distribution i.e. the probability that the class is noise given the context and target word is,

This gives the general probability for any word . When calculating the loss for a noise target word, we substitute to get the noise sample probability

Converting to sigmoid form which is used in logistic regression,

Plugging in the terms to the log likelihood for a single example,

To obtain the objective function for the entire dataset, sum the log-likelihoods over all true training examples . For each training example at step , we have a specific context , a true target word , and a fresh set of noise samples.

The final loss function that we minimize is the negative log-likelihood over the full dataset:

Note :
In the paper, A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012, authors mention that approximating the learning of context dependent normalizing factor to 1 did not affect the performance of downstream tasks.

Noise Distribution

The noise distribution is typically chosen proportional to the unigram frequency of words in the corpus:

Often a smoothed unigram distribution improves results:

Python code

For the toy vocabulary, the code for Neural Probabilistic Language Model, Bengio et al, with the SoftMax head replaced with with Noise Contrastive Estimation (NCE).

The code @ word_embeddings/nplm_with_noise_contrastive_estimation.ipynb

Word2Vec papers (Mikolov et al, 2013)

In the paper, Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013. proposed architectures to reduce the computation complexity in learning word embeddings , with the intuition that simpler models enable training on much larger corpus of data.

Two architectures where proposed.

Continuous Bag of Words (CBOW) Model

When comparing with Neural Probabilistic Language Model, Bengio et al 2003, the following simplifications are proposed.

  • order of context words is ignored
    • instead of concatenating embedding of previous words, averaging the word embeddings of surrounding words is proposed
    • this approach is called “bag-of-words” as the order is not taken into consideration
  • no non linear hidden layer
    • the model uses a shared projection layer

Additionally, the in the model context including future words too.

Equations

The neural network output is :

where:

  •  is the output weight matrix, mapping from hidden dimension  to vocabulary size 
  • x in R^{m x 1} is the averaged context embeddings.

The averaged context embedding vector x in R^{m x 1} is computed as:

x averaging formula

where,

  • C in R^{|V| x m}is the input embedding matrix
  • is the embedding of the i-th context word, and
  •  is the number of words to the left or right of the target word, giving a total context size of 2n

The probability distribution over the vocabulary is obtained using the softmax function:

softmax

where:

  •  is the predicted probability of word  being the target word.
  •  is the score for word  from the output layer.
  • The denominator sums the exponentiated scores over all vocabulary entries.

Continuous Skip-gram Model

The Skip-gram model, tries to predict context words given the current target word. The main idea is that each word is trained to predict the words surrounding it within a context window of size n.

  • Input: one-hot encoding of the target word wt
  • Output: probability distribution over vocabulary for each context word
  • No non-linear hidden layer: uses a shared projection matrix (linear)

Given a target word w_t, the model tries to predict each surrounding context word w_{t+i} for . The training goal is to maximize the probability of all context words around each target word:

Equations

The output score,

where,

  • , where x in R^{m x 1} is the embedding vector for the word w_t
  • C in R^{|V| x m}is the input embedding matrix
  • is the output embedding matrix and

The scores are computed for each context word, and the probability of all the context word is maximized.

For either CBOW or Skipgram, both C and U are trainable. After training, either one (or their average) is used as the word embedding.

Naive way for finding the probability is using SoftMax,

where,

  • is the output word
  • is score for output word
  • denominator = normalizing constant over vocabulary

For finding the probability, hierarchical softmax is proposed in the paper.

Negative Sampling

In the paper, Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013 introduced two concepts:

  • the speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
  • simplified variant of Noise Contrastive estimation called Negative Sampling

The key intuition in Negative Sampling is that noise contrastive loss defined had terms to normalize the score to approximate the probabilities. However, to learn word embeddings the probabilities are not needed and the terms

can be ignored.

With this simplification, the negative sampling loss is,

Python code

For the toy vocabulary, finding word vectors with

a) Continuous Bag of words (CBOW) with negative sampling

The code @ word_embeddings/cbow_negative_sampling copy.ipynb

b) Skip Graph with Negative Sampling

The code @ word_embeddings/skip_gram_negative_sampling.ipynb


GloVe Embeddings ( Penning et al 2014)

In the paper GloVe: Global Vectors for Word Representation, Penning et al 2014, authors propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.

Let,

  • be matrix of the word co-occurance counts
  • be the number of times word occur in context of word .
  • be the number of times any word appear in the context of word
  • be the probability that word occur in context of word .

Authors show that on the 6 billion token corpus dataset,

Taking the ratio of co-occurance probabilities,

The ratios indicate that,

  • solid and ice has a higher relationship than with steam.
  • gas and ice is far less likely to co-occur than with steam
  • water is related to both ice and steam in similar proportions

Model

To capture this ratio relationship in a vector space, the authors search for a function that satisfies:

where

  • is context word
  • are target words
  • are the word vectors and
  • are separate context vectors.

Authors enforce that the relationship should be linear (vector difference) and the result should be a scalar (dot product), leading to :

To satisfy that, authors propose choosing function so that the dot product of vector difference can be written as ratio of probabilities,

With this choice, for a single word-context pair estimates the co-occurence probabilities,

Taking logarithm,

Note :

The model capturing the relation between two words should not change even if the words are swapped. Even though the co-occurrence counts are identical (), because the total counts of words are not equal () , the conditional probability is not symmetric ().

The above equation is not symmetric if we swap target word and context word as the row-dependent term has to be handled.

To make it symmetric, the authors absorb into a learnable bias term and then adds a corresponding bias for the context word. This ensures the model is fully symmetric i.e.

The loss function then becomes ,

The key aspect in the above simplification is that, by training for the pairs of words to minimize the above loss will indirectly ensure that the dot product of vector difference of target words with context word will arrive at the ratio of probabilities.

Weighted Least Squares

The above loss function weighs all co-occurances equally. Authors noted that rare co-occurances are noisy and around 75-95% of the co-occurance is zeros, and proposed adding a weighting function to least squares loss proposed above.

The weighting function is chosen to obey the following :

  1. (to handle the zero co-occurance counts)
  2. should be non-decreasing so that rare co-occurance are given less weight
  3. should be relatively small for large values of so that frequent co-occurrences are not over-weighted

The parameters and are chosen empirically.

Then the Weighted Least Squares loss function becomes,

Python code

The code @ word_embeddings/glove_word_embedding.ipynb

Summary

This article covers

Evolution: How we moved from Bengio’s NPLM (2003) to efficient architectures like Word2Vec and GloVe.

Math: Detailed derivations of Hierarchical Softmax (using binary trees) and Noise Contrastive Estimation (differentiating data from noise).

Architectures: A deep look at CBOW, Skip-Gram, and the intuition behind Negative Sampling.

Code: Complete Python implementations for every model discussed, including vectorized implementations for efficiency.

Acknowledgment

In addition to the primary papers listed above, this post draws inspiration from the excellent overview in the post Learning word embedding, Weng, Lilian 2017. Credit also goes to the recent Large Language Models Gemini and ChatGPT which helped to bounce thoughts and refine the drafts.

Leave a Reply

Your email address will not be published. Required fields are marked *