In the seminal paper Attention is All you Need (Vaswani et al 2017), the authors proposed Transformer architecture where all tokens in sequence can be processed in parallel. As the architecture process all tokens simultaneously, the concept of positional embeddings to encode the sequence information is needed. In this post, we cover few positional encoding techniques and techniques for extending the pre-trained context length.
- Sinusoidal Positional Encoding (Vaswani et al 2017)
- RoPE – Rotary Positional Encoding (Su et al., 2021)
- ALiBi – Attention with Linear Biases, Ofir Press et al 2021
- Extending pre-trained context window
- Position Interpolation (Chen et al 2023)
- NTK aware scaling (block97 2023)
- YaRN – Yet Another RoPE extensioN method (Peng et al 2023)