In the seminal paper Attention is All you Need (Vaswani et al 2017), the authors proposed Transformer architecture where all tokens in sequence can be processed in parallel. As the architecture process all tokens simultaneously, the concept of positional embeddings to encode the sequence information is needed. In this post, we cover few positional encoding techniques and techniques for extending the pre-trained context length.

Sinusoidal Positional Encoding (Vaswani et al 2017)
RoPE – Rotary Positional Encoding (Su et al., 2021)
ALiBi – Attention with Linear Biases, Ofir Press et al 2021
Extending pre-trained context window
- Position Interpolation (Chen et al 2023)
- NTK aware scaling (block97 2023)
- YaRN – Yet Another RoPE extensioN method (Peng et al 2023)

Sinusoidal Positional Encoding (Vaswani et al., 2017)

Prior to transformer architecture proposed in the paper Attention Is All You Need” (Vaswani et al., 2017), sequence modelling tasks such as language translation were handled using recurrent neural networks (RNNs) and recurrent variants such as LSTMs. In recurrent architectures, tokens are processed sequentially, one after another.

For a sequence of $N$ tokens $t_0, t_1, ... t_{N-1}$ , the the hidden state evolves as,

$h_n = f(h_{n-1},x_n)$

where,

$x_n$ is the embedding of token at position $n$
$h_n$ is the hidden state after processing token $n$

Since tokens are processed sequentially, the order is naturally embedded into the computation. However, sequential processing prevents parallel implementation during training because a token at position $n$ cannot be processed until the previous $n-1$ tokens has been consumed.

Transformer architecture

In Transformer architecture all tokens in sequence can be processed in parallel. To encode the sequence information, a positional encoding vector defined using sinusodial basis functions is defined for each element in the sequence.

In the following sections, we will go over the positional encoding scheme followed by attention scheme. In the attention layer, the interactions between the tokens (with the injected position embeddings) is used to understand the intuitions behind the definition of the positional encoding.

Positional Encoding

Each token is mapped into an embedding vector of dimension $D$ as,

$\begin{array} E_n = \begin{bmatrix} e_{n,0} \\ e_{n,1} \\ e_{n,2} \\ e_{n,3} \\ \vdots \\ e_{n,D-1} \end{bmatrix}^T \end{array}$

Stacking all the all the embedding vectors for the sequence of length $N$ ,

$\begin{array} \mathbf{E} = \begin{bmatrix} E_0^T \\ E_1^T \\ E_2^T \\ \vdots \\ E_{N-1}^T \end{bmatrix}^T \quad \in \R^{N \times D} \end{array}$

The position embedding $P_n$ at token position $n$ is defined as :

$\begin{array} P_n = \begin{bmatrix} \sin(\omega_0 n) \\ \cos(\omega_0 n) \\ \sin(\omega_1 n) \\ \cos(\omega_1 n) \\ \vdots \\ \sin(\omega_{\frac{D}{2}-1} n) \\ \cos(\omega_{\frac{D}{2}-1} n) \end{bmatrix}^T\end{array}, \quad \omega_i = \frac{1}{10000^{\frac{2i}{D}}}$

Positional embedding can be visualized as even & odd dimensions having sine/cosine terms and with multiple frequencies across the dimension $D$ ,

$PE_{pos,2i} = \sin$\frac{pos}{10000^{\frac{2i}{D}}}$ \\ PE_{pos,2i+1} = \cos$\frac{pos}{10000^{\frac{2i}{D}}$$

Stacking all the all the positional encoding vectors for the sequence of length $N$ ,

$\begin{array} \mathbf{P} = \begin{bmatrix} P_0^T \\ P_1^T \\ P_2^T \\ \vdots \\ P_{N-1}^T \end{bmatrix}^T \quad \in \R^{N \times D} \end{array}$

To encode the sequence information, postitional embeddings are added into the token embeddings.

The combined input matrix capturing the sequence of token embedding and position encoding is defined as,

$\mathbf{X} = \mathbf{E} + \mathbf{P},\quad \in \R^{N \times D}$

Self-attention layer

The transformer architecture models interactions between all tokens using the self-attention mechanism. The combined input matrix $\mathbf{X}$ containing token embeddings and positional encoding is projected into three learned representations called query, key and value as,

$\mathbf{Q}=\mathbf{X}\mathbf{W}^{Q} \\ \mathbf{K}=\mathbf{X}\mathbf{W}^{K} \\ \mathbf{V}=\mathbf{X}\mathbf{W}^{V}$

where, the learned matrices $\mathbf{W}^{Q}$ , $\mathbf{W}^{K}$ and $\mathbf{W}^{V}$ project the token embeddings into different representation spaces.

Broadly,

$\mathbf{Q}$ (query) captures what information a token is searching for
$\mathbf{K}$ (key) captures what information a token contains
$\mathbf{V}$ (value) contains the information passed forward to the next layer

The interaction between every pair of tokens is computed through the query-key similarity matrix,

$\mathbf{S} = \mathbf{Q}\mathbf{K}^{T}$

where the element $S_{i,j}$ measures how strongly token $i$ attends to token $j$ ,

$S_{i,j} = Q_iK_j^T$

The interaction score is normalized by the key dimension and converted into attention weights using softmax ^{(refer post Gradients for multi class classification with Softmax)},

$\mathbf{A} = \mathrm{softmax} \left( \frac{ \mathbf{Q}\mathbf{K}^{T} } {\sqrt{D_h}} \right)$

The output of the attention layer is computed as,

$\mathbf{Y} = \mathrm{softmax} \left( \frac{ \mathbf{Q}\mathbf{K}^{T} } {\sqrt{D_h}} \right) \mathbf{V}$

Each token representation is updated by aggregating information from all other tokens in the sequence.

Multi-head attention

Instead of using a single attention computation, transformer architecture employs multiple attention heads, where each head learns different relationships between tokens.

For attention head $h$ , the query, key and value projections are computed as,

$\mathbf{Q}_h = \mathbf{X}\mathbf{W}_h^Q \\ \mathbf{K}_h = \mathbf{X}\mathbf{W}_h^K \\ \mathbf{V}_h = \mathbf{X}\mathbf{W}_h^V$

Each head independently computes self-attention as,

$\mathbf{Y}_h = \mathrm{softmax} \left( \frac{ \mathbf{Q}_h \mathbf{K}_h^T } {\sqrt{D_h}} \right) \mathbf{V}_h$

The outputs of all heads are concatenated and projected as,

$\mathbf{Y} = \mathrm{Concat} ( \mathbf{Y}_1, \mathbf{Y}_2, \cdots, \mathbf{Y}_H ) \mathbf{W}^{O}$

where $H$ denotes the number of attention heads.

Using multiple attention heads enables the model to simultaneously learn different relationships such as local context, long-range dependencies, syntax and semantics.

Stacking multiple transformer layers

The first layer takes the position encoded token embeddings as input,

$\mathbf{X}^{(0)} = \mathbf{E} + \mathbf{P}$

The transformer architecture consists of multiple stacked attention layers. The output of one layer is passed as the input to the next layer, progressively refining the token representations.

If the input to layer $l$ is $\mathbf{X}^{(l)}$ , then the output of the transformer block is defined as,

$\mathbf{X}^{(l+1)} = f^{(l)} \left( \mathbf{X}^{(l)} \right)$

where $f^{(l)}(\cdot)$ denotes the transformer block consisting of

multi-head self-attention,
feed-forward network,
residual connections ^{(refer paper Deep Residual Learning for Image Recognition, He et al 2015)} and
layer normalization ^{(refer paper Layer Normalization, Lei Ba et al 2016)}

Expanding the operations,

$\begin{array}{lll}\mathbf{Z}^{(l)} & = & \mathrm{LayerNorm} \left( \mathbf{X}^{(l)} + \mathrm{MHA} \left( \mathbf{X}^{(l)} \right) \right) \\ \\\mathbf{X}^{(l+1)} & = & \mathrm{LayerNorm} \left( \mathbf{Z}^{(l)} + \mathrm{FFN} \left( \mathbf{Z}^{(l)} \right) \right)\end{array}$

where,

$\mathrm{MHA}(\cdot)$ denotes the multi-head attention operation
$\mathrm{FFN}(\cdot)$ denotes the position-wise feed-forward network (see
$\mathrm{LayerNorm}(\cdot)$ denotes layer normalization

After $L$ stacked transformer layers, the final representation becomes,

$\mathbf{X}^{(L)} = f^{(L-1)} \Big( f^{(L-2)} ( \cdots f^{(0)} ( \mathbf{X}^{(0)} )) \Big)$

Parameters in Sinusoidal Positional Encoding

In the additive sinusoidal Positional encoding described in the previous section, authors makes several key design choices, namely

sinusoidal basis functions
Exponentially spaced frequencies
Emperically chosen base frequency

In the rest of the section, let us explore the intuitions and rationale behind some of the choices.

Need for sinusoidal basis function

To understand the rationale for sine and cosine terms in positional embedding, let us consider the input to the attention block for a token at $n+k$ ,

$X_{n+k} = E_{n+k} + P_{n+k}$

In the attention layer, the interaction between two tokens at position $n$ and $n+k$ is computed as,

$\text{Score} = (X_n W^Q) \cdot (X_{n+k} W^K)^T$

Assuming that the terms $W^Q$ and $W^K$ are identity matrices, the score can be approximated as,

$\begin{array}{lll} \text{Score} & = & (X_n W^Q) \cdot (X_{n+k} W^K)^T \\ & \approx & (X_n) \cdot (X_{n+k} )^T\\ & \approx & (E_n + P_n) \cdot (E_{n+k} + P_{n+k} )^T \\ & \approx & \underbrace{E_n\cdot E_{n+k}^T}_{\text{content}} + \underbrace{E_n\cdot P_{n+k}^T + P_n\cdot E_{n+k}^T}_{\text{content position}} + \underbrace{P_n\cdot P_{n+k}^T}_{\text{position} }\end{array}$

Taking the position interaction term,

$\begin{array}{lll} P_nP_{n+k}^T &=& \begin{bmatrix} \sin(\omega_0 n) \\ \cos(\omega_0 n) \\ \sin(\omega_1 n) \\ \cos(\omega_1 n) \\ \vdots \\ \sin(\omega_{\frac{D}{2}-1} n) \\ \cos(\omega_{\frac{D}{2}-1} n) \end{bmatrix}^T\begin{bmatrix} \sin(\omega_0 (n+k)) \\ \cos(\omega_0 (n+k)) \\ \sin(\omega_1 (n+k)) \\ \cos(\omega_1 (n+k)) \\ \vdots \\ \sin(\omega_{\frac{D}{2}-1} (n+k)) \\ \cos(\omega_{\frac{D}{2}-1} (n+k)) \end{bmatrix}\\ & = & \sin(\omega_0 n)\sin(\omega_0 (n+k)) + \cos(\omega_0 n) \cos(\omega_0 (n+k)) + \\ && \sin(\omega_1 n)\sin(\omega_1 (n+k)) + \cos(\omega_1 n) \cos(\omega_1 (n+k)) + \cdots \\ && \sin(\omega_{\frac{D}{2}-1} n)\sin(\omega_{\frac{D}{2}-1} (n+k)) + \cos(\omega_{\frac{D}{2}-1} n) \cos(\omega_{\frac{D}{2}-1} (n+k)) \\\end{array}$

Applying the trignometric identity,

$\sin(A)\sin(B) + \cos(A)\cos(B) =\cos(A-B), \\ \begin{array}{lll} \sin(\omega_0 n)\sin(\omega_0 {n+k}) + \cos(\omega_0 n) \cos(\omega_0 {n+k}) & = & \cos(\omega_0 n - \omega_0 ({n+k})) \\ & = & \cos(-\omega_0 k) \\ & = & \cos(\omega_0 k) \\ \end{array}$

the position interaction term simplifies to

$\begin{array}{lll} P_nP_{n+k}^T & = & \cos(\omega_0k) + \cos(\omega_1k) + \cdots + \cos(\omega_{\frac{D}{2}-1}k) \\ & = & \sum_{i=0}^{\frac{D}{2}-1}\cos(\omega_ik)\end{array}$

The position interaction term depends only on the relative distance between the tokens $k$ .

The choice of using sine and cosine terms in the position encoding, leverages the sinusoidal identity $\sin(A)\sin(B) + \cos(A)\cos(B) =\cos(A-B)$ makes the position interaction term $P_nP_{n+k}^T$ to a decaying function which depends only on the relative distance $k$ between tokens.

Normalized positional encoding terms

Let us define the normalized position interaction terms as

$\begin{array}{lll} R(k) & = & \frac{2}{D}$P_nP_{n+k}^T$ & = & \frac{2}{D}\sum_{i=0}^{\frac{D}{2}-1}\cos(\omega_ik)\end{array}$

Plot of the normalized positional encoding terms

code @ positional_encoding/autocorrelation_positional_encoding.ipynb

Can see that $R(k)$ , which captures the relative positional similarity (autocorrelation of positional encoding), exhibits an overall decaying envelope as the distance increases due to multiple frequency components being out of phase. The overall decaying trend indirectly encodes information about the relative distance between tokens.

Need for multiple frequencies

In the position encoding, multiple frequencies are chosen for various dimension as

$\begin{array}{lll} PE_{n,2i} & = & \sin$w_i n$ \\ PE_{n,2i+1} & = & \cos$w_i n$ \\ \end{array}$

where, the frequencies are spaced exponentially as

$\omega_i = \frac{1}{10000^{\frac{2i}{D}}}$

If only a single frequency is chosen, the relative position encoding term becomes periodic. This means different relative distance can produce identical positional similarities which causes ambiguity. Plotting the normalised positional similarity term for different number of frequency counts, can see that :

with few frequencies (low D) : stronger oscillations
with more frequencies (higher D) : smoother decay

Summarising, by combining multiple frequencies, the broad intuition is :

high frequencies capture local relative distances
low frequencies capture the long range distances

Together, they increase the probability of each relative distance having a unique signature enabling the model to distinguish near by and far away tokens.

code @ positional_encoding/multiple_frequencies_positional_encoding.ipynb

Thus, the use of multiple exponentially spaced frequencies allows the model to encode position as a combination of signals at different scales, enabling robust representation of both local and global structure.

The choice of the constant 10,000

The choice of 10,000 as the base in the geometric progression of frequencies was an empirical engineering decision.

The positional encoding uses frequencies

$\omega_i=\frac{1}{10000^{\frac{2i}{D}}}$

which span from high-frequency components (short wavelengths) to low-frequency components (long wavelengths).

a) At $i=0$ (shortest wavelength):

The angular frequency is $\omega_0=1$ , giving a wavelength of

$\lambda_0=\frac{2\pi}{\omega_0}=2\pi\approx 6.28$

This short wavelength changes rapidly across nearby positions, helping the model distinguish tokens that are close to each other.

b) At $i=\frac{D}{2}-1$ (longest wavelength):

The angular frequency is approximately

$\omega_{\frac{D}{2}-1}\approx\frac{1}{10000}$

giving a wavelength of

$\lambda_{\frac{D}{2}-1}\approx 2\pi\cdot 10000$

This long wavelength varies slowly across positions, helping the model encode coarse distinctions between tokens that are far apart.

Thus, the choice of 10,000 creates a spectrum of frequencies that provides both local positional resolution and long-range positional awareness.

The plot of autocorrelation of positional encoding for different base frequencies of 1000, 10000 (default) and 100000 is below.

Based on the autocorrelation plot, handwaving explanation for the choice of the base frequency, and on hindsight that typical sequence lengths around the 2017 ish were around 512 tokens,

With a base of 1000, the relative positional similarity decays more rapidly, dropping below about 0.2 at a relative distance near 100 tokens.
With a base of 10,000, the decay is slower, and the same similarity threshold is reached only around 600 tokens.
When the base is increased further to 100,000, the decay becomes even slower, but the lower slope reduces the separation of tokens which are close by.

So, a base of 10,000 appears to be a practical compromise between distinguishing nearby tokens and retaining information across the full training context.

Code @ positional_encoding/base_frequency_positional_encoding.ipynb

RoPE – Rotary Positional Encoding (Su et al., 2021)

In the paper RoFormer: Enhanced Transformer with Rotary Position Embedding“Su et. al 2021, the authors proposed a multiplicative approach for positional encoding instead of the additive approach for sinusoidal positional encoding.

The token embedding be of dimension $D$ , and defined as :

$\begin{array} E_n = \begin{bmatrix} e_{n,0} \\ e_{n,1} \\ e_{n,2} \\ e_{n,3} \\ \vdots \\ e_{n,D-1} \end{bmatrix} \end{array}$

The token embedding at position $n$ is grouped into adjacent dimension pairs, i.e . $(e_{n,2i},e_{n,2i+1})$ . The token embedding at position $n$ is first projected into query and key representations using learned projection matrices $W^Q$ and $W^K$ .

$Q_n = W^Q E_n \\ K_n = W^K E_n$

The query and key vectors are then grouped into adjacent dimension pairs, i.e., $(q_{n,2i},q_{n,2i+1})$ and $(k_{n,2i},k_{n,2i+1})$ .

The rotation matrix for the $i^{th}$ pair is defined as

$\begin{array} R_i(n) = \begin{bmatrix} \cos(\omega_in) & -\sin(\omega_in)\\ \sin(\omega_in) & \cos(\omega_in)\\ \end{bmatrix} \end{array}$

where,

$\omega_i = \frac{1}{10000^{\frac{2i}{D}}}$

The complete rotation matrix for positional encoding is formed by stacking these 2×2 rotation blocks across the diagonal, where each 2×2 block rotates a pair of embedding dimensions with a different angular frequency.

$R_n = \mathrm{diag}\left( R_0(n), R_1(n), \cdots, R_{\frac{D}{2}-1}(n) \right)$

The rotary positional encoding is then applied to the query and key vectors as,

$\tilde{Q}_n = R_n Q_n \\ \tilde{K}_n = R_n K_n$

Expanding the matrices for $\tilde{Q}_n$ ,

$\begin{array}{lll}\tilde{Q}_n & = & R_n\cdot Q_n \\ \\& = & \begin{bmatrix}\cos(\omega_0 n) & -\sin(\omega_0 n) & 0 & 0 & \cdots & 0 & 0 \\ \sin(\omega_0 n) & \cos(\omega_0 n) & 0 & 0 & \cdots & 0 & 0 \\0 & 0 & \cos(\omega_1 n) & -\sin(\omega_1 n) & \cdots & 0 & 0 \\0 & 0 & \sin(\omega_1 n) & \cos(\omega_1 n) & \cdots & 0 & 0 \\\vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\0 & 0 & 0 & 0 & \cdots & \cos(\omega_{\frac{D}{2}-1} n) & -\sin(\omega_{\frac{D}{2}-1} n) \\0 & 0 & 0 & 0 & \cdots & \sin(\omega_{\frac{D}{2}-1} n) & \cos(\omega_{\frac{D}{2}-1} n)\end{bmatrix}\begin{bmatrix}q_{n,0} \\ q_{n,1} \\ q_{n,2} \\ q_{n,3} \\ \vdots \\ q_{n,D-2} \\ q_{n,D-1}\end{bmatrix}\\ \\& = &\begin{bmatrix}q_{n,0}\cos(\omega_0 n)-q_{n,1}\sin(\omega_0 n) \\q_{n,0}\sin(\omega_0 n)+q_{n,1}\cos(\omega_0 n) \\q_{n,2}\cos(\omega_1 n)-q_{n,3}\sin(\omega_1 n) \\q_{n,2}\sin(\omega_1 n)+q_{n,3}\cos(\omega_1 n) \\\vdots \\q_{n,D-2}\cos(\omega_{\frac{D}{2}-1} n) - q_{n,D-1}\sin(\omega_{\frac{D}{2}-1} n) \\q_{n,D-2}\sin(\omega_{\frac{D}{2}-1} n) + q_{n,D-1}\cos(\omega_{\frac{D}{2}-1} n)\end{bmatrix}\end{array}$

Similarly, the rotated query and key vectors for the token at position $n+k$ are defined as

$\tilde{Q}_{n+k} = R_{n+k} Q_{n+k} \\ \tilde{K}_{n+k} = R_{n+k} K_{n+k}$

Note :

Unlike sinusoidal positional encoding, which is added once to the input embeddings, RoPE is applied inside the self-attention block by rotating the query and key vectors. Since each transformer layer contains a self-attention block, the rotation operation is applied at every attention layer of the transformer architecture.

Interaction terms

In the attention layer, the interaction between two tokens at position $n$ and $n+k$ is computed as

$\tilde{Q}n \tilde{K}^{T}_{n+k} = (R_nQ_n) (R_{n+k}K^T_{n+k})$ .

For intuition, assuming that the terms $W^Q$ and $W^K$ are identity matrices. Then the query and key terms simplify to

$Q_n \approx E_n \\ K_n \approx E_n$

With this assumption, the score is approximated as,

$\text{Score} \approx (R_nE_n) (R_{n+k}E_{n+k})^T$

Expanding the terms,

$\begin{array}{lll}\mathrm{Score}\\& = &\begin{bmatrix}e_{n,0}\cos(\omega_0 n)-e_{n,1}\sin(\omega_0 n) \\e_{n,0}\sin(\omega_0 n)+e_{n,1}\cos(\omega_0 n) \\e_{n,2}\cos(\omega_1 n)-e_{n,3}\sin(\omega_1 n) \\e_{n,2}\sin(\omega_1 n)+e_{n,3}\cos(\omega_1 n) \\\vdots \\e_{n,D-2}\cos(\omega_{\frac{D}{2}-1} n) - e_{n,D-1}\sin(\omega_{\frac{D}{2}-1} n) \\e_{n,D-2}\sin(\omega_{\frac{D}{2}-1} n) + e_{n,D-1}\cos(\omega_{\frac{D}{2}-1} n)\end{bmatrix}^{T}\begin{bmatrix}e_{n+k,0}\cos(\omega_0 (n+k)) - e_{n+k,1}\sin(\omega_0 (n+k)) \\e_{n+k,0}\sin(\omega_0 (n+k)) + e_{n+k,1}\cos(\omega_0 (n+k)) \\e_{n+k,2}\cos(\omega_1 (n+k)) - e_{n+k,3}\sin(\omega_1 (n+k)) \\e_{n+k,2}\sin(\omega_1 (n+k)) + e_{n+k,3}\cos(\omega_1 (n+k)) \\\vdots \\e_{n+k,D-2}\cos(\omega_{\frac{D}{2}-1}(n+k)) - e_{n+k,D-1}\sin(\omega_{\frac{D}{2}-1}(n+k)) \\e_{n+k,D-2}\sin(\omega_{\frac{D}{2}-1}(n+k)) + e_{n+k,D-1}\cos(\omega_{\frac{D}{2}-1}(n+k))\end{bmatrix}\end{array}$

Let us find the score consider onlying the first frequency pair $\omega_0$ ,

$\begin{array}{lll}Score_0 & = &\Big( e_{n,0}\cos(\omega_0 n) - e_{n,1}\sin(\omega_0 n) \Big)\Big( e_{n+k,0}\cos(\omega_0(n+k)) - e_{n+k,1}\sin(\omega_0(n+k)) \Big) \\ \\&&+\Big( e_{n,0}\sin(\omega_0 n) + e_{n,1}\cos(\omega_0 n) \Big)\Big( e_{n+k,0}\sin(\omega_0(n+k)) + e_{n+k,1}\cos(\omega_0(n+k)) \Big)\end{array}$

Expanding the multiplication terms,

$\begin{array}{lll}\mathrm{Score}_0& = &e_{n,0}e_{n+k,0} \cos(\omega_0 n) \cos(\omega_0(n+k))\\ &&- e_{n,0}e_{n+k,1} \cos(\omega_0 n) \sin(\omega_0(n+k))\\ &&- e_{n,1}e_{n+k,0} \sin(\omega_0 n) \cos(\omega_0(n+k))\\ &&+ e_{n,1}e_{n+k,1} \sin(\omega_0 n) \sin(\omega_0(n+k))\\ &&+ e_{n,0}e_{n+k,0} \sin(\omega_0 n) \sin(\omega_0(n+k))\\ &&+ e_{n,1}e_{n+k,0} \cos(\omega_0 n) \sin(\omega_0(n+k))\\ &&+ e_{n,0}e_{n+k,1} \sin(\omega_0 n) \cos(\omega_0(n+k))\\ &&+ e_{n,1}e_{n+k,1} \cos(\omega_0 n) \cos(\omega_0(n+k))\end{array}$

Grouping common terms,

$\begin{array}{lll}Score_0 & = &e_{n,0}e_{n+k,0} \Big[ \cos(\omega_0 n) \cos(\omega_0(n+k)) + \sin(\omega_0 n) \sin(\omega_0(n+k)) \Big] \\ \\&&+e_{n,1}e_{n+k,1} \Big[ \sin(\omega_0 n) \sin(\omega_0(n+k)) + \cos(\omega_0 n) \cos(\omega_0(n+k)) \Big] \\ \\&&+\Big( e_{n,1}e_{n+k,0} - e_{n,0}e_{n+k,1} \Big)\Big[ \sin(\omega_0 n) \cos(\omega_0(n+k)) - \cos(\omega_0 n) \sin(\omega_0(n+k)) \Big]\end{array}$

Applying the trigonometric identities,

$\cos(A)\cos(B)+\sin(A)\sin(B)=\cos(A-B) \\ \sin(A)\cos(B)-\cos(A)\sin(B)=\sin(A-B)$

the interaction term corresponding to the frequency $\omega_0$ becomes

$\begin{array}{lll}Score_0 & = &e_{n,0}e_{n+k,0} \cos\Big( \omega_0 n - \omega_0(n+k) \Big) \\ \\&&+e_{n,1}e_{n+k,1} \cos\Big( \omega_0 n - \omega_0(n+k) \Big) \\ \\&&+\Big( e_{n,1}e_{n+k,0} - e_{n,0}e_{n+k,1} \Big)\sin\Big( \omega_0 n - \omega_0(n+k) \Big) \\&=& e_{n,0}e_{n+k,0} \cos(-\omega_0 k) \\ \\&&+e_{n,1}e_{n+k,1} \cos(-\omega_0 k) \\ \\&&+\Big( e_{n,1}e_{n+k,0} - e_{n,0}e_{n+k,1} \Big)\sin(-\omega_0 k)\end{array}$

Since $\cos(-x) = \cos(x)$ and $\sin(-x) = -\sin(x)$ , the term simplifies to

$\begin{array}{lll}Score_0 & = &\Big(e_{n,0}e_{n+k,0} + e_{n,1}e_{n+k,1} \Big) \cos(\omega_0 k) \\ \\&&\Big( e_{n,0}e_{n+k,1} -e_{n,1}e_{n+k,0} \Big)\sin(\omega_0 k)\end{array}$

Notice the key result that sine and cosine terms depend only on $k$ , and hence the interaction term depends only on the relative distance.

Extending for all frequencies, for the frequency pair $\omega_i$ , the interaction term becomes

$\begin{array}{lll}Score_i & = &\Big( e_{n,2i}e_{n+k,2i} + e_{n,2i+1}e_{n+k,2i+1} \Big) \cos(\omega_i k) \\ \\&&+\Big( e_{n,2i}e_{n+k,2i+1} - e_{n,2i+1}e_{n+k,2i} \Big) \sin(\omega_i k)\end{array}$

Summing across all frequency pairs, the overall interaction score becomes

$\begin{array}{lll}\mathrm{Score} & = & X_n^T X_{n+k} \\ \\& = & \sum_{i=0}^{\frac{D}{2}-1} Score_i \\ \\& = & \sum_{i=0}^{\frac{D}{2}-1} \Big[ \Big( e_{n,2i}e_{n+k,2i} + e_{n,2i+1}e_{n+k,2i+1} \Big) \cos(\omega_i k) \\ \\&&\qquad +\Big( e_{n,2i}e_{n+k,2i+1} - e_{n,2i+1}e_{n+k,2i} \Big) \sin(\omega_i k)\Big]\end{array}$

Similar to the additive positional encoding defined earlier, the interaction term in multiplicative positional encoding defined in RoPE also depends only on the relative distance between tokens $k$ .

However, unlike additive sinusoidal positional encoding, multiplicative positional encoding in RoPE does not introduce token-position cross interaction terms such as $E_n\cdot P_{n+k}^T, \ P_n\cdot E_{n+k}^T$ .

Complex Number Interpretation

The derivation for rotary positional encoding using explicit matrix multiplication can be expressed compactly using complex numbers. The token embedding at position $n$ is grouped into adjacent dimension pairs, i.e. $(e_{n,2i},e_{n,2i+1})$ , where each pair can be represented as a complex number

$e'_{n,i} = e_{n,2i} + j\,e_{n,2i+1}$

where,

$j=\sqrt{-1}$ is the imaginary unit.

The rotary positional encoding is applied as a complex phase rotation,

$R_i(n) = \exp^{j\cdot \omega_in} = \cos(\omega_in) + j\cdot \sin(\omega_in)$

Thus, the rotated embedding becomes,

$\begin{array}{lll} x'_{n,i} & = & e'_{n,i} \cdot R_i(n)\\ & = & (e_{n,2i} + j e_{n,2i+1}) \Big( \cos(\omega_i n) + j\sin(\omega_i n) \Big) \end{array}$

Expanding the multiplication terms,

$\begin{array}{lll} x'_{n,i}& = & e_{n,2i}\cos(\omega_i n) + je_{n,2i}\sin(\omega_i n)\\ && +j e_{n,2i+1}\cos(\omega_i n) + j^2 e_{n,2i+1}\sin(\omega_i n) \end{array}$

Since $j^2=-1$ ,

$\begin{array}{lll} x'_{n,i} & = & \Big( e_{n,2i}\cos(\omega_i n) - e_{n,2i+1}\sin(\omega_i n) \Big) \\ &&+ j \Big( e_{n,2i}\sin(\omega_i n) + e_{n,2i+1}\cos(\omega_i n) \Big)\end{array}$

Comparing the real and imaginary components, we observe that this is identical to the math derived earlier.

Interaction terms

In the attention layer, the interaction between two rotated token embeddings at positions $n$ and $n+k$ can be written using complex number representation.

For simplicity, assuming the query and key projection matrices are identity matrices, the rotated embedding pair at frequency $\omega_i$ is expressed as

$\begin{array}{lll} x'_{n,i} & = & e'_{n,i} e^{j\omega_i n} \\ \\x'_{n+k,i} & = & e'_{n+k,i} e^{j\omega_i (n+k)}\end{array}$

The interaction score is computed using the complex conjugate of the second term and taking the real-valued component of the output,

$\begin{array}{lll} \mathrm{Score}_i & = & \mathrm{Re} \Big(x'_{n,i} (x'_{n+k,i})^{*} \Big) \\ \\ & = & \mathrm{Re} \Big( e'_{n,i} e^{j\omega_i n} \Big( e'_{n+k,i} e^{j\omega_i(n+k)} \Big)^* \Big)\end{array}$

with,

$\Big( e^{j\omega_i(n+k)} \Big)^* = e^{-j\omega_i(n+k)}$

the interaction term becomes

$\begin{array}{lll} \mathrm{Score}_i & = & \mathrm{Re} \Big(e'_{n,i}(e'_{n+k,i})^* e^{j\omega_i n} e^{-j\omega_i(n+k)}\Big) \\ \\ & = & \mathrm{Re}\Big(e'_{n,i}(e'_{n+k,i})^*e^{-j\omega_i k}\Big)\end{array}$

The absolute token position $n$ cancels out, leaving only the relative token distance $k$ .

Normalized Positional Encoding

Let us define the embedding interaction term as,

$e'_{n,i}(e'_{n+k,i})^* = A_ie^{j\phi_i}$

where,

$A_i$ is the embedding similarity magnitude
$\phi_i$ is the embedding dependent phase difference

To isolate the positional component independent of the token embedding, consider

embedding interactions are normalized i.e. $A_i \approx 1$
embedding interaction phase effects are not there i.e. $\phi_i \approx 0$

With this the score simplifies to,

$\begin{array}{lll} \mathrm{Score}_i & = & \mathrm{Re}\Big(e'_{n,i}(e'_{n+k,i})^*e^{-j\omega_i k}\Big) \\ & = & \mathrm{Re}\Big(A_ie^{j\phi_i}e^{-j\omega_i k}\Big) \\ & = & \mathrm{Re}\Big(e^{-j\omega_i k}\Big) \\ & = & \cos(\omega_i k)\end{array}$

Then the normalized positional encoding over all the dimensions is,

$\begin{array}{lll} R(k) & = & \frac{2}{D}$P_nP_{n+k}^T$ & = & \frac{2}{D}\sum_{i=0}^{\frac{D}{2}-1}\cos(\omega_ik)\end{array}$

This is identical to the sinusoidal positional encoding defined earlier.

ALiBi – Attention with Linear Biases, Ofir Press et al 2021

In the paper “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, Ofir Press et al 2021″, authors note that both additive sinusoidal positional encoding and multiplicative rotary positional encoding (RoPE) degrade when evaluated on sequence lengths significantly longer than those seen during training.

They also note that RoPE extrapolates better than additive sinusoidal positional encoding, with potential reasons being :

rotary positional encoding injects positional information in every layer and not just the initial one as in additive sinusoidal positional encoding
rotary positional encoding is done only on queries ( $Q$ ) and keys ( $K$ ) and not to values ( $V$ )

With this intuition, authors proposed Attention with Linear Bias (ALiBi), where a static non learned bias is applied to the query key dot product prior to softmax computation.

Equations

Consider a sequence of length ( $L$ ). In standard self-attention, the attention score between a query at position ( $i$ ) and a key at position ( $j$ ) is computed as

$\mathrm{softmax(\frac{q_iK^T}{\sqrt{d_k}})$

where,

$q_i \in \mathbb{R}^{1 \times d_k}$ is the query at position $i$ ,
$K \in \mathbb{R}^{L \times d_k}$ , and with causal language modelling the first i keys with $1 \le i \le L$ and
$d_k$ is the head dimension

ALiBi modifies the attention score directly by adding a position-dependent linear bias . The modified attention score becomes,

$\mathrm softmax(\frac{q_iK^T}{\sqrt{d_k}} + m\cdot[-(i-1), \cdots, -2, -1, 0])$

where,

$m$ is head specific scalar fixed before training.

for a model with $n$ attention heads, the set of slope is geometrically spaced and starts at $2^{-\frac{8h}{n}}$ .

The ALiBi paper focused on the auto-regressive causal language models, where a query at position $i$ can only attend to the first $i$ keys. Under this causal constraint, the bias term between a query at position $i$ and a key at position $j$ is calculated linearly based on their directed distance:

$Causal ALiBi Bias Function$

This creates a lower triangular bias matrix, with the diagonal elements receive a penalty of 0, and the penalty grows increasingly negative as the key moves further back into the past tokens.

This relative position information is injected at every attention operation across layers by adding a position dependent bias.

toy implementation @ positional_encoding/alibi_causal_positional_encoding.ipynb

Extending pre-trained context window

Multiple approaches for extending the pre-trained context length to a higher number has emerged in literature. In the rest of the sections, we cover three approaches – scaling the token position index, scaling by a frequency dependent factor and a scheme which combines the above approaches.

Position Interpolation (Chen et al 2023)

In the paper Extending Context Window of Large Language Models via Position Interpolation, Chen et al 2023, proposes an approach to extend the context window of RoPE based LLMs which was trained for typically ~2048 tokens to around 32768 tokens (16x).

With RoPE, the rotation factor for token at position $n$ along the $i$ the dimension is,

$R_i(n) = \exp^{j\cdot \omega_in} = \cos(\omega_in) + j\cdot \sin(\omega_in)$

where,

$\omega_i = \frac{1}{10000^{\frac{2i}{D}}}$

With Position Interpolation, the rotation factor for token position $n$ is scaled by a factor $s=L^'/L$ , to become

$R^'_i(n) = \exp^{j\cdot \omega_in\frac{L}{L^'}} = \cos(\omega_in\frac{L}{L^'}) + j\cdot \sin(\omega_in\frac{L}{L^'})$

where,

$L$ is the context length for which the model is trained
$L^'$ is the desired context length

NTK aware scaling (bloc97, 2023)

In the reddit post ^{(link here)} author bloc97 proposed an alternate way to extend to longer context lengths of RoPE based Large Language Models (LLMs). The key intuition comes from Neural Tangent Kernel (NTK) perspective of neural networks which suggested that neural networks are more sensitive to distortions in high frequencies than low frequency ones.

So instead of uniformly compressing all frequencies as in Positional Interpolation (PI), NTK aware scaling applies a frequency dependent scaling.

This is achieved by by introducing a frequency dependent scaling where highest frequency dimension $i=0$ is preserved, and progressively scaling lower frequencies, till reaching the scaling factor $\frac{1}{s}$ for the lowest frequency $i=\frac{D}{2}-1$ .

Equations

The base frequency in RoPE is defined as,

$\omega_i = \frac{1}{10000^{\frac{2i}{D}}} =10000^{\frac{-2i}{D}}$

To scale the context window by factor $s$ , the base $b=10000$ is modified as,

$b^'=10000\cdot s^{\frac{D}{D-2}}$

Substituting,

$\begin{array} {lll} \omega_{i,NTK} & = & \left(10000\cdot s^{\frac{D}{D-2}}\right)^{\frac{-2i}{D}} \\ & = & {10000}^{\frac{-2i}{D}}\cdot s^{\frac{-2i}{D-2}} \\ & = & \omega_is^{\frac{-2i}{D-2}} \end{array}$

For highest frequency ( $i=0$ ), the scale modifier is $s^0=1$ , preserving the high frequency local positional context.

For the lowest frequency ( $i=\frac{D}{2}-1$ ), the scale factor is $s^{-\frac{2(\frac{D}{2}-1)}{D-2} = s^{-1}$ , scaling the lowest frequency by the desired scaling factor $\frac{1}{s}$ .

toy implementation @

positional_encoding/pi_ntk_aware_scaling.ipynb

YaRN – Yet Another RoPE extensioN method (Peng et al 2023)

In the paper YaRN: Efficient Context Window Extension of Large Language Models, Peng et al 2023, authors noted the following limitations :

As the Position Interpolation scales all dimensions (frequencies) by equally by a factor $s$ , it alters the high frequency components of RoPE
The NTK aware scaling alleviated this to large extend by scaling high frequencies less and low frequencies more. However, identifying the optimal base frequency has to be found emperically and increasing the difficulty and cost for a fine tuned model.

To address this, authors proposed NTK by parts approach described below

NTK by parts

Let the wavelength at dimension $d$ is defined as,

$\lambda_d = \frac{2\pi}{\theta_d} \\ \\ \text{where, }\\ \theta_d = b^{-2d/D} \\ b = 10000$

Authors observe that, dimensions where the wavelength is longer than the maximum context length seen during pre-training captures the absolute positional information. However, for the dimensions where wavelength is short, it typically captures the relative positional information.

Equations

Defining the ratio between pre-trained context length $L$ and wavelength $\lambda_d$ at dimension $d$ as,

$r(d) = \frac{L}{\lambda_d} = \frac{L}{2\pi b^{\frac{2d}{D}}$

Based on the ratio, the following constraints are defined :

Do not interpolate for the dimensions whose wavelength $\lambda_d$ is smaller than the pre trained context length $L$ i.e. $L \gt \lambda_d, \quad r(d) \gt 1$
Do the interpolation for the dimensions whose wavelength $\lambda_d$ bigger than the pre trained context length $L$ i.e $L \lt \lambda_d, \quad r(d) \lt 1$
For the dimensions in between, have a bit of both

Capturing this in equations,

$h(\theta_d) = \left(1-\gamma(r_d)\right)\frac{\theta_d}{s} + \gamma(r_d)\theta_d$

where,

$s=L^'/L$ is the scale factor to increase the context length from pre-trained length $L$ to $L^'$
$\gamma(\cdot)$ is a ramp function defined as

$\gamma(r) = \begin{cases} 0, & \text{if } r \lt \alpha \\ 1, & \text{if } r \gt \beta \\ \frac{r - \alpha}{\beta - \alpha}, & \text{otherwise.} \end{cases}$

$\alpha, \quad \beta$ are thresholds to be tuned.

In the paper, authors proposed $\alpha=1, \quad \beta=32$ on the Llama family of models.

Pluggin in numbers,

for a pre-trained context length of $L=2048$ and scaling to by $s=16$ i.e. to extend the context length to $L^'=32768$ .

Ratio $r=L/\lambda_d$	Wavelength $\lambda_d$ (for $L=2048$ )	Remark
$r\le1$	$\lambda_d\ge2048$	Interpolation
$1<r<32$	$64<\lambda_d<2048$	Linear transition
$r\ge32$ ( $\beta$ )	$\lambda_d\le64$	No interpolation.

toy implementation @

positional_encoding/yarn_scaling.ipynb

Attention scaling

In addition to the frequency scaling methods described above, authors propose that scaling the attention helps lower the perplexity of the model.

The scaled attention is defined as,

$\text{softmax}\left(\frac{\mathbf{q}_m^T\mathbf{k}_n}{t\sqrt{D}}\right)$

where,

$\sqrt{\frac{1}{t}} = 0.1\times\ln(s) + 1$
$s=L^'/L$ is the scale factor to increase the context length from pre-trained length $L$ to $L^'$

The equation for temperature scaling is found by fitting $\sqrt{\frac{1}{t}}$ at the lowest perplexity against different values of $s$ .

Summary

This post covers the following :

the transformer positional encodings from additive sinusoidal vectors to multiplicative rotary matrices (RoPE) and static linear attention biases (ALiBi).
Focuses on how their mathematical interactions capture relative token distance.
Discuss the context-extension methods like Position Interpolation, NTK-Aware Scaling, and YaRN .
Code snippets for each of the scheme is provided.

Positional Encoding in Transformers

Sinusoidal Positional Encoding (Vaswani et al., 2017)

Transformer architecture

Positional Encoding

Self-attention layer

Multi-head attention

Stacking multiple transformer layers

Parameters in Sinusoidal Positional Encoding

Need for sinusoidal basis function

Normalized positional encoding terms

Need for multiple frequencies

The choice of the constant 10,000

RoPE – Rotary Positional Encoding (Su et al., 2021)

Interaction terms

Complex Number Interpretation

Interaction terms

Normalized Positional Encoding

ALiBi – Attention with Linear Biases, Ofir Press et al 2021

Equations

Extending pre-trained context window

Position Interpolation (Chen et al 2023)

NTK aware scaling (bloc97, 2023)

Equations

YaRN – Yet Another RoPE extensioN method (Peng et al 2023)

NTK by parts

Equations

Attention scaling

Summary

Leave a Reply Cancel reply