Word Embeddings using neural networks

In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.

Papers referred :

Neural Probabilistic Language Model, Bengio et al 2003
- proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
- given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from $O(|V|)$ to $O(\log_2|V|)$ .
Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
- Instead of directly estimating the data distribution, noise contrastive estimation estimates the probability of a sample being from the data versus from a known noise distribution.
- This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.
Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
- proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
- Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
- speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
- simplified variant of Noise Contrastive estimation called Negative Sampling
GloVe: Global Vectors for Word Representation, Pennington et al 2014
- propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.

In this post we will cover the key aspects proposed in the above papers with supporting python code.

Neural Probabilistic Language Model (Bengio et al, 2003)

Reference : Neural Probabilistic Language Model, Bengio et al 2003

The probability of sequence of $\mathbf{T}$ words $w_1^T = \left(w_1, w_2, \ldots, w_T\right)$ can be expressed as conditional probability of sequence of previous words, i.e

$\hat{P}(w_1^T) = \prod_{t=1}^T \hat{P}\big(w_t \,|\, w_1^{t-1}\big)$

For example, consider a sequence of 4 words,

$w_1^T = (w_1, w_2, w_3, w_4) = (\text{the}, \text{cat}, \text{sat}, \text{down})$

Then, by the chain rule of probability:

$\hat{P}(w_1^4) = \begin{array}{l} \hat{P}(w_1) \times \\ \hat{P}(w_2 \,|\, w_1) \times \\ \hat{P}(w_3 \,|\, w_1, w_2) \times \\ \hat{P}(w_4 \,|\, w_1, w_2, w_3) \end{array}$

Substituting the actual words:

$\hat{P}(\text{the cat sat down}) = \begin{array}{l} \hat{P}(\text{the}) \times \\ \hat{P}(\text{cat} \,|\, \text{the}) \times \\ \hat{P}(\text{sat} \,|\, \text{the}, \text{cat}) \times \\ \hat{P}(\text{down} \,|\, \text{the}, \text{cat}, \text{sat}) \end{array}$

For a long word sequence, instead of conditioning on all previous words, it is common to approximate the probability by conditioning only on the last $n-1$ words. That is:

$\hat{P}(w_t \mid w_1^{t-1}) \approx \hat{P}(w_t \mid w_{t-n+1}^{t-1})$

Neural network Architecture

The neural probabilistic language model builds on the n-gram approximation and proposes a way to

Jointly learn word feature vector (each word in the vocabulary has a feature vector — a real-valued vector in $\mathbb{R}^k$ ) and
Learn the probability of the sequence of words in terms of sequence of word feature vectors

The objective is to learn a model that predicts the probability of the next word given the previous $n-1$ words, i.e.

$f(w_t, \ldots, w_{t-n+1}) = \hat{P}(w_t \mid w_1^{t-1})$

The model is subject to the following constraints:

For any sequence of words, the model outputs a non-zero probability, i.e. $f(\dots) > 0″ alt=”” align=”absmiddle”></li> <li>The <strong>sum of probabilities</strong> over all possible next words <img decoding=$ in the vocabulary equals 1, i.e. $\sum_{i=1}^{|V|} f(w_i, w_{t-1}, \ldots, w_{t-n+1}) = 1$

where $|V|$ is the vocabulary size, and $i$ indexes over all possible words in the vocabulary.

Note :

Non-zero probability: Ensures that the model never completely rules out any word as a possible next word, allowing it to adapt to all possible word sequences and avoid zero-probability issues during training.

Probabilities sum to one: Guarantees that f defines a valid probability distribution over the vocabulary for the next word, so the total probability of all possible next words is exactly 1.

The estimation of the function $f(w_t, \ldots, w_{t-n+1}) = \hat{P}(w_t \mid w_1^{t-1})$ is done as follows :

for any word $w_i$ in the vocabulary $|V|$ , lookup a real vector $C(w_i) \in \mathbb{R}^m$
a function $g$ maps an input sequence of feature vectors for words in the context, $\big(C(w_{t-n+1}), \ldots, C(w_{t-1})\big)$ to a conditional probability distribution over words in $V$ for the next word $w_t$

Model

The neural network model can be expressed as:

$y = b + W \cdot x + U \cdot \tanh(d + H \cdot x)$

where:

$x$ is the concatenated input feature vector of the previous $n-1$ words, with dimension $\big[(n-1) \cdot m \times 1\big]$ .
$H$ is a weight matrix of size $\big[h \times (n-1) \cdot m\big]$ , which transforms the input $x$ into the hidden layer space.
$d$ is a bias vector for the hidden layer, of dimension $\big[h \times 1\big]$ .
$U$ is a weight matrix of size $\big[|V| \times h\big]$ that maps the hidden layer activations to the output layer, where $|V|$ is the vocabulary size.
$W$ is a weight matrix of size $\big[|V| \times (n-1) \cdot m\big]$ that connects the input $x$ directly to the output layer.
$b$ is the bias vector for the output layer, of dimension $\big[|V| \times 1\big]$ .
$y$ is the output vector containing the unnormalized log-probabilities (scores) for each word in the vocabulary, of dimension $\big[|V| \times 1\big]$ .

Using softmax to convert the output vector $y$ into a probability distribution over the vocabulary,

$f(w_t, \ldots, w_{t-n+1}) = \hat{P}(w_t \mid w_{t-n+1}^{t-1})=\hat{P}(w_t \mid w_{t-1}, \ldots, w_{t-n+1}) = a(w_t)= \frac{e^{y_{w_t}}}{\sum_i e^{y_{w_i}}$

Using softmax layer ensures the constraints defined earlier:

All probabilities are positive, satisfying the $f > 0″ alt=””> constraint.</li> <li>The probabilities <strong>sum to one</strong> across all possible next words, satisfying the normalization constraint <img decoding=$ .

Loss function

The maximum likelihood estimate for selecting the target word $w_t$ over all the words in vocabulary $|V|$ is equivalent to minimising the negative log likelihood,

$\mathcal{L_{\text{NLL}}} = -\sum_{i=1}^{k} \log a(w_t),$

where,

$k$ is multiple word sequence examples

As can be seen in the section on Loss for Multiclass classification ^{(refer post on Gradients for Multiclass classification with SoftMax)}, the negative log likelihood is indeed the Categorical Cross Entropy Loss.

Python code

The training of a Neural Probabilistic Language Model in PyTorch involves a few key components, each corresponding to the mathematical elements discussed earlier:

torch.nn.Embedding — implements the word feature vector lookup function $C(w_i)$ . Each word index in the vocabulary maps to a dense vector in $\mathbb{R}^m$ .
torch.nn.Linear — implements the fully-connected (dense) layers, corresponding to the transformation matrices $W$ , $U$ .
torch.nn.Parameter – the parameters $d$ and $b$ are explicitly created.
torch.nn.functional.log_softmax — applies the SoftMax in log space to obtain $\log \hat{P}(w_t \mid w_{t-n+1}^{t-1})$ while maintaining numerical stability.
torch.nn.NLLLoss — implements the Negative Log Likelihood Loss, which directly minimises $-\log \hat{P}(w_t \mid w_{t-n+1}^{t-1})$ for the correct target word index.

These functions, combined with an optimizer such as torch.optim.SGD or torch.optim.Adam, form the complete training loop for the model.

code @ word_embeddings/neural_probabilistic_language_model.ipynb
The training loop implementing the model for a simple toy example of 20 sentences shows that the model is doing reasonable in predicting the probability of next word.

Hierarchical Softmax (Morin & Bengio 2005)

Reference : Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)

As computing of the probability of all tokens using SoftMax scales with vocabulary size $|V|$ , in the paper Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005) proposed an approach to reduce the complexity from $O(|V|)$ to $O(\log_2|V|)$ .

Based on the intuition shared in the paper Classes for Fast Maximum Entropy Training, J Goodman 2001, to compute $P(Y=y|X=x)$ , instead of directly computing the probability of the target word $y$ given the context words $x$ , we decompose it hierarchically as product of :

$P(C=c(y)|X=x)$ : probability of $y$ in class $C=c(y)$ given context $X=x$
$P(Y|C=c(y),X=x)$ : probability of word $Y=y$ , given $y$ is in class $c(y)$ AND context $X=x$

i.e

$P(Y=y|X=x)=P(C=c(y)|X=x)\cdot P(Y=y|C=c(y),X=x)$

where,

$y$ : the target word we want to predict (e.g., “dog”).
$x$ : the context (the surrounding words or features used to predict the next word, e.g., “the big”).
$c(y)$ : the cluster/class that the target word $y$ belongs to (e.g., dog → Noun class).

Derivation

To derive the the decomposition of $P(Y=y|X=x)$ , let us introduce a class variable $c(y)$ , i.e. the word $y$ belongs to the class $c(y)$ .

Then the probability of $P(Y=y|X=x)$ can be written as the sum of $y$ is in the class or not

$\begin{array}{lll} P(Y=y\mid X=x) &=& P\big(Y=y,\,C=c(y)\mid X=x\big) + {P\big(Y=y,\,C\neq c(y)\mid X=x\big)} \end{array}$

Since each word $y$ belongs to exactly one class, the term $P(Y,\,C\ne c(y)|X=x)$ is zero.

Hence,

$\begin{array}{lll} P(Y=y\mid X=x) &=& P\big(Y=y,\,C=c(y)\mid X=x\big) + \underbrace{P\big(Y=y,\,C\neq c(y)\mid X=x\big)}_{=\,0} \\ &=& P\big(Y=y,\,C=c(y)\mid X=x\big) \end{array}$

The term $P(Y=y,\,C = c(y)|X=x)$ can be expanded using the chain rule of conditional probabilities as follows:

$\begin{array}{lll} P\big(Y{=}y,\,C{=}c(y)\mid X{=}x\big) &= \frac{P\big(Y{=}y,\,C{=}c(y),\,X{=}x\big)}{P(X{=}x)} \\ &= \frac{P\big(C{=}c(y),\,X{=}x\big)\cdot \;P\big(Y{=}y\mid C{=}c(y),\,X{=}x\big)}{P(X{=}x)} \\ &= P\big(C{=}c(y),\,X{=}x\big)\frac{\;P\big(Y{=}y\mid C{=}c(y),\,X{=}x\big)}{P(X{=}x)} \\ &= P\big(C{=}c(y)\mid X{=}x\big)\cdot P(X{=}x)\frac{\;P\big(Y{=}y\mid C{=}c(y),\,X{=}x\big)}{P(X{=}x)}\\ &= P\big(C{=}c(y)\mid X{=}x\big)\;P\big(Y{=}y\mid C{=}c(y),\,X{=}x\big). \end{array}$

Summarizing,

$P(Y=y|X=x)=P(C=c(y)|X=x)P(Y=y|C=c(y),X=x)$

Thus, computing $P(Y=y|X=x)$ reduces to first predicting the class $c(y)$ given the context $x$ and then predicting the word $y$ within that class conditioned on $x$ .

Complexity

With this approach, instead of computing probability over the entire vocabulary $|V|$ , this is broken down to computing the probability over the classes, and then computing the probability over the words within the chosen class.

Taking the example shared in the paper, assuming that $|V|$ is 10000 words, and we break it down into 100 classes, with each class having 100 words. Then the computations needed are:

Finding probability over 100 classes
Finding probability over 100 words in the chosen class

This reduces the computation to ~200 probability calculations instead of 10000 in the flat structure. Equivalently, the complexity reduces from $|V|$ to $\sqrt{|V|}$ operations.

Binary Tree

An alternative to class-based grouping is to arrange the vocabulary words as the leaves of a binary tree. Each internal node corresponds to a binary decision (left or right child), and each leaf corresponds to one word in the vocabulary. This hierarchical arrangement reduces the search complexity from $O(|V|)$ to $O(\log_2|V|)$ , making it efficient for large vocabularies.

For constructing the binary tree, multiple approaches are possible :

Perfect binary tree
- Requires the leaves to be a power of 2 (for eg, 2, 4, 8, 16 etc).
- If $|V|$ is not a power of 2, some leaves will remain unused
- To reach every word, it takes the same path length i.e $ceil(log2(|V|))$ .
- Average depth: exactly $log2(|V|)$ since all leaves are at the same level.
Balanced binary tree
- Tries to keep the left and right subtrees of equal size.
- When the vocabulary is not a power of 2, leaf depths differ by at most 1.
- No empty leaves; every leaf corresponds to a word.
- Average depth: approximately $log2(|V|)$ , often slightly smaller because some leaves are shallower.
Word frequency based tree
- Constructed using a Huffman coding structure, frequent words are placed closer to the root node while rare words are deeper.
- This minimises the average number of binary decisions required to reach a word.
- Average depth: depends on the frequency distribution; it is minimised and typically much smaller than $log2(|V|)$ for natural language vocabularies (due to Zipf’s law).

For a toy corpus of 12 words, construction of the binary tree with the above approaches is shown below. code @ word_embeddings/binary_tree.ipynb

Model

The probability of the next word given the context $probability formula$ can be written as:

$probability formula$

where,

each word $v$ is represented by a bit vector $(b1(v), b2(v), ..., bp(v))$
the path p depends on the position of the word in the binary tree.

For example, if each word is represented by 4 bits, then the probability of predicting the next word given the context $w_{t-1},...,w_{t-n+1}$ becomes:

$4-bit chain rule example$

Taking log on both sides converts the product into a summation:

$log form 4 bit$

In general, for a word represented with $p$ bits:

$general log form$

The bit vector corresponds to the path (left or right at each nodes) starting from the root node to the leaf node (the word). Each internal node outputs a probability of going right ( $b=1$ ). For the true label $b \in [0,1]$ , the binary cross-entropy loss at that node is:

$binary cross entropy$

The total loss for predicting $v$ is the sum of the node losses along the path:

$word loss$

where

$pj$ is the predicted probability at the j-th node along the path.
$bj$ denotes the binary choice (0 or 1) at the j-th internal node along the path to word $v$ .

This is equivalent to the negative log-likelihood of the full word probability.

Binary Node Predictor

Each internal node of the binary tree acts as a logistic classifier that decides left vs right, based on both the (n−1)-gram context and the node embedding. The conditional probability of taking the binary decision $b=1$ at a node, given the past context, is modelled as:

$probability equation$

where,

the sigmoid function is $sigmoid function$ .
for any word $w_i$ in the vocabulary $|V|$ , lookup a real vector $C(w_i) \in \mathbb{R}^m$
$x$ : concatenation of the previous (n−1) word embeddings, $x \in \mathbb{R}^{(n-1)\cdot m \times 1}$
$alpha node$ : bias term specific to the node, $scalar$
$beta$ : projection vector applied after hidden nonlinearity, $beta in R^h$
$c$ : bias for hidden layer, $c in R^{h \times 1}$
$W$ : weight matrix projecting context to hidden space, $W in R^(h x (n-1)m)$
$U$ : weight matrix projecting node embedding, $U in R^(h x d_node)$
$N node$ : embedding vector for the current node, $N_node in R^(d_node)$

The matrices $W$ , $U$ , $\beta$ (projection vector) and the bias $c$ are common parameters shared across all nodes.

Each internal node has its own $\alpha_{node}$ (scalar bias), and $N_{node}$ (node embedding). These take care of the decision boundary at each internal node.

Python code – Naive implementation using for-loops

For the toy corpus, naive implementation of hierarchical softmax using for-looops is provided.

Defined a toy corpus of 20 sentences which has around 42 words.
Training example is constructed as 3 context words and the corresponding target word
Constructed a balanced binary tree, which has 41 internal nodes
Model defined with the binary node predictor for each of the nodes
- The parameters $W$ , $U$ , $\beta$ (projection vector) and the bias $c$ are shared across all nodes.
- Each internal node has its own $\alpha_{node}$ and $N_{node}$ parameter
For each target word in the training example, the path to the leaf node via the tree is known
Using the binary decision at each path, the loss for each example is computed
The loss is back propagated to find the parameters which minimizes the loss

Using the trained model, for finding the probabilities for top-k words given the context words,

For each word in the vocabulary find the path to its leaf node
Starting from the root node, find the probability at each node
Based on the known decision (right vs left) at each node, use either p for going right OR (1-p) for going left
The joint probability is the product of probabilities at each node.
For numerical stability (loss of accuracy when many small probabilities are multiplied), log of probabilities is found and then summed
On the final log probability is exponentiated to get back in probability (optional)
Then the top-k candidate words are printed

code @ word_embeddings/hierarchical_probabilistic_neural_language_model.ipynb

Python code – Vectorized implementation

As one can imagine, using the for loops significantly slows down the training. To form a vectorized implementation, the following was done.

Path preparation
- Assign a unique id to every internal node in the binary tree.
- Precompute for each word:
  - sequence of node-ids on the path to its leaf,
  - binary decision targets at each node.
- Pad all paths to a fixed length $p_{pad}$ using a dummy (UNK) node id. Build a mask to ignore padded positions.
Parameter lookup
- Use torch.nn.Embedding to fetch node-specific parameters $\alpha_{nodes}$ (biases) and $N_{nodes}$ (embeddings).
- Shapes:
  - $N_{nodes} \in \mathbb{R}^{n_{batch}\times p_{pad}\times d_{node}}$
  - $\alpha_{nodes} \in \mathbb{R}^{n_{batch}\times p_{pad}}$
  - $x \in \mathbb{R}^{n_{batch}\times (n-1)\cdot m}$
Forward pass (vectorized)
- Context projection: $Wx \in \mathbb{R}^{n_{batch}\times h}$ and bias $c \in \mathbb{R}^{h}$ .
- Node projection: $UN_{nodes} \in \mathbb{R}^{n_{batch}\times p_{pad}\times h}$ using $U \in \mathbb{R}^{h\times d_{node}}$ .
- Broadcast: $c+Wx+UN_{nodes} \in \mathbb{R}^{n_{batch}\times p_{pad}\times h}$ .
- Nonlinearity : $H=\tanh(c+Wx+UN_{nodes})$ .
- Projection: $logits=\alpha_{nodes}+(H\cdot\beta) \in \mathbb{R}^{n_{batch}\times p_{pad}}$ with $\beta \in \mathbb{R}^{h\times 1}$ .
- Probabilities: $p=\sigma(logits) \in \mathbb{R}^{n_{batch}\times p_{pad}}$ .
Loss and masking
- Binary cross-entropy is computed between and the decision targets, with mask applied to ignore padded nodes:
  - $loss_i=\sum_{j}mask_{ij}\cdot BCE(p_{ij},target_{ij})$
  - $loss=\frac{1}{n_{batch}}\sum_i loss_i$

Notes :

Compute $Wx+c$ once per batch and broadcast, instead of recomputing per path.
UNK node parameters are trainable but excluded from loss using the mask.

code @ word_embeddings/vectorized_hierarchical_probabilistic_neural_language_model.ipynb

Noise contrastive estimation (Gutmann et al 2012, Mnih et al 2012)

As computing of the probability using SoftMax scales with vocabulary size $|V|$ , in the paper Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012, provided an approach called Noise Contrastive Estimation (NCE). Instead of directly estimating the data distribution, NCE estimates the probability of a sample being from the data versus from a known noise distribution. By learning the ratio between the data and noise distributions, and knowing the noise distribution, the data distribution can be inferred.

This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.

Model

In Neural Probabilistic Language model, the estimation of the probability of the words using SoftMax computation is,

$\begin{array}{llll} \hat{P}(w_t \mid w_{t-1}, \ldots, w_{t-n+1}) = \hat{P}(w_t \mid h) & = & \frac{\exp({y_{w_t}})}{\sum_i \exp\left(y_{w_i}\right)} \\ & = & \frac{\exp({s_\theta(w_t,h)})}{\sum_i \exp\left(s_\theta(w_i,h\)\right)}\\ \end{array}$

where,

context words is $h = \left[w_{t-1}, \ldots, w_{t-n+1}\right]$
term in numerator $y_{w_t}=s_\theta(w_t,h)$ is estimated using a neural model with parameters $\theta$ for the target word $w_t$ using the context words $h$
term in denominator is sum over all the words in the vocabulary i.e. $\sum_i e^{y_{w_i}}=\sum_i e^{s_\theta(w_i,h)}$

Let us define a set $\mathbf{W} = \left\{w_1, w_2, \cdots, w_{T_x+T_n}\right\}$ , which is the union of two sets $\left{\mathbf{X},\ \mathbf{N}\right}$ , where

the class label $C_t=1 \quad : \quad w_t \in \mathbf{X}$ when the word is from the true target word distribution
the class label $C_t=0 \quad : \quad w_t \in \mathbf{N}$ when the word is NOT from the true target word distribution
$T_x$ is the number of true (data) samples in the batch (or dataset)
$T_n$ is the the number of noise samples generated for contrast

The formulation is,

for each true (data) sample $\left(h,w^+\right)$ will draw $k$ noise samples $\left(w_1^-, w_2^-, \cdots, w_k^-\right)$ from $P_n$
the model has to learn a binary classification where the sample $w$ is from the true distribution $C_t=1$ or from the noise distribution $C_t=0$

Further, instead of computing the denominator term for normalzing to probabilities, learn it as a context dependent normalizing term, i.e.

$\begin{array}{llll} \hat{P}(w_t \mid h) = P_{\theta}(w \mid h) = & \frac{\exp({s_\theta(w,h)})}{\mathbf{Z}_\theta(h)} \quad, \text{where } \mathbf{Z}_\theta(h)=\sum_i \exp\left(s_\theta(w_i,h\) \\ \end{array}$

The probability of sample coming from the true distribution given the context $h$ can be written as ,

$P_{\theta}(w|h)=P(w|C_t=1)$

Similarly, the probability of the word in noise distribution is,

$P_{n}(w)=P(w|C_t=0)$

Further,

$\begin{array}{lr} P(C_t=1)& = &\frac{T_x}{T_x+T_n} \\ P(C_t=0)& = &\frac{T_n}{T_x+T_n}\\ k = \frac{P(C_t=0)}{P(C_t=1)} & = \frac{T_n}{T_x} \end{array}$

Since NCE reframes the problem as a binary classification task (distinguishing true data from noise), the class labels $C_t$ are modelled as independent Bernoulli variables. Consequently, the conditional log likelihood is the sum of the binary cross-entropy terms:

$\begin{array}{llll} \mathbf{L} (\theta) & = & \sum_{t=1}^{T_x+T_n} \left[C_t\log ( P(C_t=1|w)) + (1-C_t)\log(P(C_t=0|w)) \right]\\ & = & \sum_{t=1}^{T_x} \log (P(C_t=1|w)) + \sum_{t=1}^{T_n} \log(P(C_t=0|w)) \\ \end{array}, \\ \\$

For a single true target word $w$ and its corresponding $k$ noise samples

$\begin{array}{llll} \mathcal{L}_t (\theta) & = & \log (P(C_t=1|w)) + \sum_{j=1}^{k} \log(P(C_t=0|w)) \\ \end{array} \\ \\$

To evaluate this loss, need to express $P(C_t=1|w)$ in terms of model parameters. Using Bayes Rule, the probability that the class is true $C_t=1$ given the context $h$ and target word $w$ is,

$\begin{array}{lll} P(C_t=1|w) & = & \frac{P(w|C_t=1)P(C_t=1)}{P(w)}\\ & = & \frac{P(w|C_t=1)P(C_t=1)}{P(w|C_t=1)P(C_t=1) + P(w|C_t=0)P(C_t=0)}\\ & = & \frac{P(w|C_t=1)}{P(w|C_t=1) + P(w|C_t=0)\frac{P(C_t=0)}{P(C_t=1)}}\\ & = & \frac{P_{\theta}(w|h)}{P_{\theta}(w|h) + kP_n(w)}\\ & = & \frac{1}{1 + \frac{k\cdot P_n(w)}{P_{\theta}(w|h)}}\\ & = & \frac{1}{1 + \frac{k P_n(w)}{\left(\frac{\exp({s_\theta(w,h)})}{\mathbf{Z}_\theta(h)}\right)}} \\ & = & \frac{1}{1 + \frac{k P_n(w)\mathbf{Z}_\theta(h)}{\exp({s_\theta(w,h)})}} \\ \end{array} \quad \text{,where } P_{\theta}(w \mid h) = \frac{\exp({s_\theta(w,h)})}{\mathbf{Z}_\theta(h)}$

This gives the general probability for any word $w$ . When calculating the loss for a true target word, we substitute $w=w^+$ to get the positive sample probability

Converting to sigmoid form which is used in logistic regression,

$\begin{array}{lll} P(C_t=1|w^+) & = & \frac{1}{1 + \frac{k P_n(w^+)\mathbf{Z}_\theta(h)}{\exp({s_\theta(w^+,h)})}} & = & \frac{1}{1+\frac{1}{z^+}} & = & \frac{1}{1+\exp(\log(1/z^+))} & = & \frac{1}{1+\exp(-\log(z^+))} \\ & = & \sigma( \log(z^+))\\ \end{array} \\ \\ \text{where, }\\ \begin{array}{llll} z^+ & = &\frac{\exp({s_\theta(w^+,h)}) }{k P_n(w^+)\mathbf{Z}_\theta(h)} \\ \log(z^+) & = & \log\left(\frac{\exp({s_\theta(w^+,h)}) }{k P_n(w)\mathbf{Z}_\theta(h)} \right) \\ & = & \log \left(\exp({s_\theta(w^+,h)} ) \right) - \log(k P_n(w^+)) - \log (\mathbf{Z}_\theta(h)) \\ & = & \left[ s_\theta(w^+,h) - \log(k P_n(w^+)) - \log (\mathbf{Z}_\theta(h))\right] \end{array}$

Similarly, for the target word from noise distribution i.e. the probability that the class is noise $C_t=0$ given the context $h$ and target word $w$ is,

$\begin{array}{lll} P(C_t=0|w) & = & \frac{P(w|C_t=0)P(C_t=0)}{P(w)}\\\\ & = & \frac{P(w|C_t=0)P(C_t=0)}{P(w|C_t=1)P(C_t=1) + P(w|C_t=0)P(C_t=0)}\\\\ & = & \frac{P(w|C_t=0)P(C_t=0)/P(C_t=1)}{P(w|C_t=1) + P(w|C_t=0)\dfrac{P(C_t=0)}{P(C_t=1)}}\\\\ & = & \frac{k\,P_n(w)}{P_{\theta}(w|h) + k\,P_n(w)}\\\\ & = & \frac{1}{1 + \dfrac{P_{\theta}(w|h)}{k\,P_n(w)}}\\\\ & = & \frac{1}{1 + \dfrac{\exp\!\big(s_\theta(w,h)\big)}{k\,P_n(w)\,\mathbf{Z}_\theta(h)}}\\\\ \end{array} \quad,\ \text{where } P_{\theta}(w \mid h) = \frac{\exp\!\big(s_\theta(w,h)\big)}{\mathbf{Z}_\theta(h)}$

This gives the general probability for any word $w$ . When calculating the loss for a noise target word, we substitute $w=w^-$ to get the noise sample probability

Converting to sigmoid form which is used in logistic regression,

$\begin{array}{lll} P(C_t=0|w^-) & = & \frac{1}{1 + \dfrac{\exp\!\big(s_\theta(w^-,h)\big)}{k\,P_n(w^-)\,\mathbf{Z}_\theta(h)}} & = & \frac{1}{1+\frac{1}{z^-}} & = & \frac{1}{1+\exp(\log(1/z^-))} & = & \frac{1}{1+\exp(-\log(z^-))} \\ & = & \sigma( \log(z^-)) \end{array} \\ \text{where, } \\ \begin{array}{llll} z^- & = & \frac{k P_n(w^-)\mathbf{Z}_\theta(h)}{\exp({s_\theta(w^-,h)}) } \\ \log(z^-) & = & \log\left(\frac{k P_n(w^-)\mathbf{Z}_\theta(h)}{\exp({s_\theta(w^-,h)}) } \right) \\ & = & \log(k P_n(w^-)) + \log (\mathbf{Z}_\theta(h)) -\log \left(\exp({s_\theta(w^-,h)})\right) \\ & = & -\left[s_\theta(w^-,h) -\log(k P_n(w^-)) - \log (\mathbf{Z}_\theta(h)) \right] \end{array}$

Plugging in the terms to the log likelihood for a single example,

$\begin{array}{llll} \mathbf{L}_t (\theta) & = & \log (P(C_t=1|w)) + \sum_{j=1}^{k} \log(P(C_t=0|w)) \\ & = & \log (\sigma( \log(z^+))) + \sum_{j=1}^{k} \log(\sigma( \log(z^-))) \\ & = &\log (\sigma(\left[ s_\theta(w^+,h) - \log(k P_n(w^+)) - \log (\mathbf{Z}_\theta(h))\right])) +\\ && \sum_{j=1}^{k} \log(\sigma(-\left[s_\theta(w^-,h) -\log(k P_n(w^-)) - \log (\mathbf{Z}_\theta(h)) \right] )) \\ \end{array}, \\ \\$

To obtain the objective function for the entire dataset, sum the log-likelihoods over all true training examples $t=1 \dots T_x$ . For each training example at step $t$ , we have a specific context $h_t$ , a true target word $w_t$ , and a fresh set of $k$ noise samples.

$\begin{array}{llll} \mathbf{L}(\theta) &=& \sum_{t=1}^{T_x} \mathcal{L}_t (\theta) \\ \ & = & \sum_{t=1}^{T_x} \left[ \log (P(C_t=1|w_t)) + \sum_{j=1}^{k} \log(P(C_t=0|w_{t,j}^-)) \right] \end{array}$

The final loss function $\mathbf{J}(\theta)$ that we minimize is the negative log-likelihood over the full dataset:

Note :
In the paper, A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012, authors mention that approximating the learning of context dependent normalizing factor to 1 $\mathbf{Z}_\theta(h) \approx 1$ did not affect the performance of downstream tasks.

Noise Distribution

The noise distribution $P_n(w)$ is typically chosen proportional to the unigram frequency of words in the corpus:

$P_n(w)=\frac{\text{count}(w)}{\sum_v \text{count}(v)}$

Often a smoothed unigram distribution improves results:

$P_n(w)\propto \text{count}(w)^{3/4}$

Python code

For the toy vocabulary, the code for Neural Probabilistic Language Model, Bengio et al, with the SoftMax head replaced with with Noise Contrastive Estimation (NCE).

The code @ word_embeddings/nplm_with_noise_contrastive_estimation.ipynb

Word2Vec papers (Mikolov et al, 2013)

In the paper, Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013. proposed architectures to reduce the computation complexity in learning word embeddings , with the intuition that simpler models enable training on much larger corpus of data.

Two architectures where proposed.

Continuous Bag of Words (CBOW) Model

When comparing with Neural Probabilistic Language Model, Bengio et al 2003, the following simplifications are proposed.

order of context words is ignored
- instead of concatenating embedding of previous words, averaging the word embeddings of surrounding words is proposed
- this approach is called “bag-of-words” as the order is not taken into consideration
no non linear hidden layer
- the model uses a shared projection layer

Additionally, the in the model context including future words too.

Equations

The neural network output is :

$z=Ux$

where:

$U \in \mathbb{R}^{|V| \times m}$ is the output weight matrix, mapping from hidden dimension $m$ to vocabulary size $|V|$
$x in R^{m x 1}$ is the averaged context embeddings.

The averaged context embedding vector $x in R^{m x 1}$ is computed as:

$x averaging formula$

where,

$C in R^{|V| x m}$ is the input embedding matrix
$C(w_i) \in \mathbb{R}^{m \times 1}$ is the embedding of the i-th context word, and
$n$ is the number of words to the left or right of the target word, giving a total context size of $2n$

The probability distribution over the vocabulary is obtained using the softmax function:

$softmax$

where:

$a(w_t)$ is the predicted probability of word $w_t$ being the target word.

$z_{w_t}$ is the score for word $w_t$ from the output layer.
The denominator sums the exponentiated scores over all vocabulary entries.

Continuous Skip-gram Model

The Skip-gram model, tries to predict context words given the current target word. The main idea is that each word is trained to predict the words surrounding it within a context window of size n.

Input: one-hot encoding of the target word w_t
Output: probability distribution over vocabulary for each context word
No non-linear hidden layer: uses a shared projection matrix (linear)

Given a target word $w_t$ , the model tries to predict each surrounding context word $w_{t+i}$ for $-n \le i \le n, i \ne 0$ . The training goal is to maximize the probability of all context words around each target word:

$\begin{array}{ll} J&=&\frac{1}{T} \sum_{t=1}^{T} \sum_{-n \le i \le n, i \ne 0} \log p(w_{t+i} | w_t) \\ \text{where, } \\ &T&\text{ total words in vocabulary} \end{array}$

Equations

The output score,

$z_i = Ux$

where,

$x=C({w_t})$ , where $x in R^{m x 1}$ is the embedding vector for the word $w_t$
$C in R^{|V| x m}$ is the input embedding matrix
$U \in \mathbb{R}^{|V| \times m}$ is the output embedding matrix and
$z_i \in \mathbb{R}^{|V| \times 1}$

The scores $z_i$ are computed for each context word, and the probability of all the context word is maximized.

For either CBOW or Skipgram, both $C$ and $U$ are trainable. After training, either one (or their average) is used as the word embedding.

Naive way for finding the probability is using SoftMax,

$p(w_o|w_t)=\frac{\exp(z_{w_o})}{\sum_{k=1}^{|V|}\exp(z_k)}$

where,

$w_o$ is the output word
$z_{w_o}$ is score for output word
denominator = normalizing constant over vocabulary

For finding the probability, hierarchical softmax is proposed in the paper.

Negative Sampling

In the paper, Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013 introduced two concepts:

the speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
simplified variant of Noise Contrastive estimation called Negative Sampling

The key intuition in Negative Sampling is that noise contrastive loss defined had terms to normalize the score to approximate the probabilities. However, to learn word embeddings the probabilities are not needed and the terms

$\log(k P_n(w^+)),\quad \log (\mathbf{Z}_\theta(h)), \quad \log(k P_n(w^-))$

can be ignored.

With this simplification, the negative sampling loss is,

$\begin{array}{llll} \mathbf{L}_{ns} (\theta) & = &\log (\sigma(\left[ s_\theta(w^+,h) \right])) + \sum_{t=1}^{k} \log(\sigma(-\left[s_\theta(w^-,h) \right] )) \\ \end{array} \\ \\$

Python code

For the toy vocabulary, finding word vectors with

a) Continuous Bag of words (CBOW) with negative sampling

The code @ word_embeddings/cbow_negative_sampling copy.ipynb

b) Skip Graph with Negative Sampling

The code @ word_embeddings/skip_gram_negative_sampling.ipynb

GloVe Embeddings ( Penning et al 2014)

In the paper GloVe: Global Vectors for Word Representation, Penning et al 2014, authors propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.

Let,

$\mathbf{X}$ be matrix of the word co-occurance counts
$X_{ij}$ be the number of times word $j$ occur in context of word $i$ .
$X_{i}=\sum_kX_{ik}$ be the number of times any word appear in the context of word $i$
$P_{ij}=P(j|i)=\frac{X_{ij}}{{X_i}}$ be the probability that word $j$ occur in context of word $i$ .

Authors show that on the 6 billion token corpus dataset,

$\begin{array}{llllll} P(solid|ice)&=&1.9 \times 10^{-4} & P(gas|ice)&=&6.6\times 10^{-5} & P(water|ice)&=&3.0\times 10^{-3} \\ P(solid|steam)&=&2.2 \times 10^{-5} & P(gas|steam)&=&7.8\times 10^{-4} & P(water|steam)&=&2.2\times 10^{-3} \\ \end{array}$

Taking the ratio of co-occurance probabilities,

$\begin{array}{llllll} \frac{P(solid|ice)}{P(solid|steam)} &=& 8.9 \\ \\ \frac{P(gas|ice)}{P(gas|steam)} &=&8.5\times 10^{-2} \\ \\ \frac{P(water|ice)}{P(water|steam)}&=&1.36 \\ \end{array}$

The ratios indicate that,

solid and ice has a higher relationship than with steam.
gas and ice is far less likely to co-occur than with steam
water is related to both ice and steam in similar proportions

Model

To capture this ratio relationship in a vector space, the authors search for a function $F$ that satisfies:

$F(w_i, w_j, \tilde{w}_k) = \frac{P(k|i)}{P(k|j)} = \frac{P_{ik}}{P_{jk}}$

where

$k$ is context word
$i,j$ are target words
$w \in \mathbb{R}^d$ are the word vectors and
$\tilde{w} \in \mathbb{R}^d$ are separate context vectors.

Authors enforce that the relationship should be linear (vector difference) and the result should be a scalar (dot product), leading to :

$F((w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$

To satisfy that, authors propose choosing function $F=\exp()$ so that the dot product of vector difference can be written as ratio of probabilities,

$\begin{array}{lll} F((w_i - w_j)^T \tilde{w}_k) & = & \frac{F(w_i^T \tilde{w}_k)}{F(w_j^T \tilde{w}_k)} \\ &=&\frac{P_{ik}}{P_{jk}} \end{array}$

With this choice, for a single word-context pair estimates the co-occurence probabilities,

$F(w_i^T \tilde{w}_k) = \exp((w_i^T \tilde{w}_k)) = P_{ik} = \frac{X_{ik}}{X_i}$

Taking logarithm,

$\begin{array}{lll} w_i^T \tilde{w}_k = \log(P_{ik}) & = & \log(X_{ik}) - \log(X_i) \end{array}$

Note :

The model capturing the relation between two words should not change even if the words are swapped. Even though the co-occurrence counts are identical ( $X_{ik} = X_{ki}$ ), because the total counts of words are not equal ( $X_{i} \ne X_{k}$ ) , the conditional probability is not symmetric ( $P_{ik} \ne P_{ki}$ ).

The above equation is not symmetric if we swap target word $i$ and context word $k$ as the row-dependent term $\log(X_{i})$ has to be handled.

To make it symmetric, the authors absorb $\log(X_{i})$ into a learnable bias term $b_{i}$ and then adds a corresponding bias $\tilde{b}_{k}$ for the context word. This ensures the model is fully symmetric i.e.

$\begin{array}{lll} w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log(X_{ik}) \end{array}$

The loss function then becomes ,

$\begin{array}{lll} J = \sum_{i=1}^{V}\sum_{k=1}^{V}\left[w^T_i \tilde{w}_k + b_i + \tilde{b}_k - \log(X_{ik}) \right]^2 \end{array}$

The key aspect in the above simplification is that, by training for the pairs of words to minimize the above loss will indirectly ensure that the dot product of vector difference of target words with context word will arrive at the ratio of probabilities.

Weighted Least Squares

The above loss function weighs all co-occurances equally. Authors noted that rare co-occurances are noisy and around 75-95% of the co-occurance is zeros, and proposed adding a weighting function to least squares loss proposed above.

The weighting function is chosen to obey the following :

$f(0)=0$ (to handle the zero co-occurance counts)
$f(x)$ should be non-decreasing so that rare co-occurance are given less weight
$f(x)$ should be relatively small for large values of $x$ so that frequent co-occurrences are not over-weighted

$f(x) = \left\{ \begin{array}{ll} (x/x_{max})^{\alpha} & \text{if } x \le x_{max} \\ 1 & \text{otherwise} \end{array} \right.$

The parameters $\alpha = 3/4$ and $x_{max} = 100$ are chosen empirically.

Then the Weighted Least Squares loss function becomes,

$\begin{array}{lll} J = \sum_{i=1}^{V}\sum_{k=1}^{V}f(X_{ik})\left[w^T_i \tilde{w}_k + b_i + \tilde{b}_k - \log(X_{ik}) \right]^2 \end{array}$

Python code

The code @ word_embeddings/glove_word_embedding.ipynb

Summary

This article covers

Evolution: How we moved from Bengio’s NPLM (2003) to efficient architectures like Word2Vec and GloVe.

Math: Detailed derivations of Hierarchical Softmax (using binary trees) and Noise Contrastive Estimation (differentiating data from noise).

Architectures: A deep look at CBOW, Skip-Gram, and the intuition behind Negative Sampling.

Code: Complete Python implementations for every model discussed, including vectorized implementations for efficiency.

Acknowledgment

In addition to the primary papers listed above, this post draws inspiration from the excellent overview in the post Learning word embedding, Weng, Lilian 2017. Credit also goes to the recent Large Language Models Gemini and ChatGPT which helped to bounce thoughts and refine the drafts.

Neural Probabilistic Language Model (Bengio et al, 2003)

Neural network Architecture

Model

Python code

Hierarchical Softmax (Morin & Bengio 2005)

Derivation

Binary Tree

Model

Python code – Naive implementation using for-loops

Python code – Vectorized implementation

Noise contrastive estimation (Gutmann et al 2012, Mnih et al 2012)

Model

Noise Distribution

Python code

Word2Vec papers (Mikolov et al, 2013)

Continuous Bag of Words (CBOW) Model

Continuous Skip-gram Model

Negative Sampling

Python code

GloVe Embeddings ( Penning et al 2014)

Model

Python code

Summary

Leave a Reply Cancel reply

Related Articles

Gradients for multi class classification with Softmax

Gradients for Binary Classification with Sigmoid

Gradients for linear regression