Rotary Position Embeddings (RoPE)

RoPE is a popular Relative Positional Encoding implementation which is used in Large Language Model like LLaMA.

In order to generalize our results in 2D to any $x_{i} \in R^{d}$ where $d$ is even, we divide the $d$ -dimension space into $d /2$ sub-spaces and combine them in the merit of the linearity of the inner product, turning $f_{{q, k}}$ into:

f_{{q, k}} (x_{m}, m) = R_{Θ, m}^{d} W_{{q, k}} x_{m}

where

R_{Θ, m}^{d} = cos m θ_{1} sin m θ_{1} 00 ⋮ 00 - sin m θ_{1} cos m θ_{1} 00 ⋮ 00 00 cos m θ_{2} sin m θ_{2} ⋮ 00 00 - sin m θ_{2} cos m θ_{2} ⋮ 00 \dots \dots \dots \dots ⋱ \dots \dots 0000 ⋮ cos m θ_{d /2} sin m θ_{d /2} 0000 ⋮ - sin m θ_{d /2} cos m θ_{d /2}

is the rotary matrix with pre-defined parameters $Θ = {θ_{i} = 1000 0^{- 2 (i - 1) / d}, i \in [1, 2, \dots, d /2]}$ . A graphic illustration of RoPE is shown in Figure (1). Applying our RoPE to self-attention in Equation (2), we obtain:

q_{m}^{T} k_{n} = (R_{Θ, m}^{d} W_{q} x_{m})^{T} (R_{Θ, n}^{d} W_{k} x_{n}) = x^{T} W_{q} R_{Θ, m}^{d} R_{Θ, n - m}^{d} W_{k} x_{n}

where $R_{Θ, n - m}^{d} = (R_{Θ, m}^{d})^{T} R_{Θ, n}^{d}$ . Note that $R_{Θ}^{d}$ is an orthogonal matrix, which ensures stability during the process of encoding position information.

Hoai-Chau Tran

Explorer

Rotary Position Embeddings (RoPE)

Reference

Graph View

Backlinks