Image credit: Stanford University
Artificial intelligence researchers have recently been pushing the boundaries of language models, introducing systems with dramatically expanded context lengths, such as GPT-4, MosaicML’s MPT, and Anthropic’s Claude. However, scaling up the context length of Transformers presents a challenge as their core attention layer requires quadratic runtime and memory, a limitation that the newly released FlashAttention-2 aims to address.
Authored by Tri Dao, FlashAttention-2 is a fully rewritten successor to the original FlashAttention, an algorithm designed to speed up attention and reduce memory footprint. FlashAttention-2 significantly enhances parallelism and work partitioning, leading to a near doubling of its speed, reaching up to 230 TFLOPs/s on A100 GPUs (FP16/BF16).
The new approach tweaks the original algorithm to reduce the number of non-matrix multiplication (non-matmul) FLOPs, making the best use of the GPU's specialized compute units. FlashAttention-2 also brings in better parallelism, allowing the algorithm to effectively use the multiprocessors on the GPU, particularly when dealing with long sequences.
Additionally, FlashAttention-2 improves work partitioning within each thread block, reducing the amount of synchronization and communication between different warps (groups of threads working together), resulting in less shared memory reads/writes.
Among the new features of FlashAttention-2 are support for head dimensions up to 256, and the inclusion of multi-query attention (MQA) and grouped-query attention (GQA). These changes enable models such as GPT-J, CodeGen, CodeGen2, and StableDiffusion 1.x to use FlashAttention-2 and reap its benefits in speedup and memory saving.
The author reports that FlashAttention-2 is approximately twice as fast as its predecessor, providing a 1.3x end-to-end speedup for an already optimized model with FlashAttention. This means longer context models can be trained at the same cost as previously shorter ones, paving the way for improved understanding of longer texts and high-resolution media.
Looking forward, Dao and his team plan to optimize FlashAttention-2 for a broader range of devices and data types, with immediate steps including optimizing for H100 GPUs and exploiting new hardware features. Source
FlashAttention-2 is a revamped version of FlashAttention, providing improved speed and work partitioning for the attention mechanism in AI models.
The new version reduces non-matmul FLOPs, leverages better parallelism, and improves work partitioning to speed up the attention process.
FlashAttention-2 also supports head dimensions up to 256 and introduces multi-query attention (MQA) and grouped-query attention (GQA).
FlashAttention-2 is reported to be twice as fast as its predecessor, leading to a 1.3x end-to-end speedup for already optimized models.
Future work includes further optimizations for a range of devices and data types, opening up opportunities for training AI models with much longer contexts.