FlashAttention-2 was released today, which is 5-9X faster than vanilla attention and 2X faster than FlashAttention-v1. Given that many of the top open-source LLMs leverage FlashAttention, this is an important advancement that can make existing models much more efficient during both training and inference. Hereโs what you need to knowโฆ
What is FlashAttention? In its canonical form, self-attention is anย O(N^2)ย operation, whereย Nย is the length of the input sequence. Many proposals try to achieve an approximation of self-attention that runs in O(N) time, but they achieve no wall-clock speedup. FlashAttention reformulates attention in an IO-aware manner to achieve significant wall clock speedups, as well as an improvement in memory efficiency.
Lots of LLMs use it. Given that FlashAttention is very fast (i.e., ~2-4X faster) compared to a vanilla implementation, many transformer-based applications have adopted it. For example, recent open-source LLMs (such as Falcon and MPT models) use FlashAttention, leading them to be very fast at inference time (and more efficient during pre-training). For example, Falcon-40B is 5X faster at performing inference than GPT-3.
FlashAttention-2. A new update to FlashAttention has just been made available, called FlashAttention-2. This variant is about twice as fast as the original FlashAttention, and 5-9X faster than the original attention implementation. This improvement impacts both training an inference speed. The sources of these improvements are summarized by the quote below.
โWe (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory.โ - from FlashAttention-2 paper
The only caveat. Currently, the FlashAttention-2 implementation does not apply to all GPUs. However, the GPUs to which it does apply see a significant benefit; e.g., a 225 TFLOPs/s training speed can be achieved on an A100 using FlashAttention-2.
Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently Iโve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/