Flash Attention Work Log

My Work Log implementing Flash Attention using CuTe DSL.

1. The Math

Algorithm 1: FlashAttention-2 Forward Pass Require: Q, K, V \in R^{N \times d} in HBM, block sizes B_{c}, B_{r} 1. T_{r} = ⌈ N / B_{r} ⌉, T_{c} = ⌈ N / B_{c} ⌉ Divide Q into blocks Q_{1}, \dots, Q_{T_{r}} of size B_{r} \times d Divide K, V into blocks K_{1}, \dots, K_{T_{c}}, V_{1}, \dots, V_{T_{c}} of size B_{c} \times d 2. Divide O \in R^{N \times d} into T_{r} blocks of size B_{r} \times d Divide logsumexp L into T_{r} blocks of size B_{r} 3. for 1 \leq i \leq T_{r} do 4. Load Q_{i} from HBM to on-chip SRAM 5. Initialize O_{i}^{(0)} = 0_{B_{r} \times d}, ℓ_{i}^{(0)} = 0_{B_{r}}, m_{i}^{(0)} = (- \infty)_{B_{r}} 6. for 1 \leq j \leq T_{c} do 7. Load K_{j}, V_{j} from HBM to on-chip SRAM 8. S_{i}^{(j)} = Q_{i} K_{j}^{⊤} \in R^{B_{r} \times B_{c}} 9. m_{i}^{(j)} = max (m_{i}^{(j - 1)}, rowmax (S_{i}^{(j)})) \tilde{P}_{i}^{(j)} = exp (S_{i}^{(j)} - m_{i}^{(j)}) (pointwise) ℓ_{i}^{(j)} = e^{m_{i}^{(j - 1)} - m_{i}^{(j)}} ℓ_{i}^{(j - 1)} + rowsum (\tilde{P}_{i}^{(j)}) 10. O_{i}^{(j)} = diag (e^{m_{i}^{(j - 1)} - m_{i}^{(j)}})^{- 1} O_{i}^{(j - 1)} + \tilde{P}_{i}^{(j)} V_{j} 11. end for 12. O_{i} = diag (ℓ_{i}^{(T_{c})})^{- 1} O_{i}^{(T_{c})} 13. L_{i} = m_{i}^{(T_{c})} + lo g (ℓ_{i}^{(T_{c})}) 14. Write O_{i} to HBM as the i -th block of O 15. Write L_{i} to HBM as the i -th block of L 16. end for 17. return O, L