Flash Attention

EnricoShippole · March 15, 2023, 11:44pm

I just wanted to confirm what is the best way to ensure that only the new Flash Attention in PyTorch 2.0 is being used for scaled dot product attention:

For example:

# pytorch 2.0 flash attn: q, k, v, mask, dropout, causal, softmax_scale
with torch.backends.cuda.sdp_kernel(
    enable_flash=True, 
    enable_math=False, 
    enable_mem_efficient=False
):
    out = F.scaled_dot_product_attention(
        q, k, v,
        attn_mask = mask,
        dropout_p = flash_attn_dropout, 
        is_causal = causal, 
        scale = scale
    )

I greatly appreciate your help.

Thank you,

Enrico

ptrblck · March 16, 2023, 1:20am

torch.backends.cuda.enable_flash_sdp is not a context manager and you could use with torch.backends.cuda.sdp_kernel instead as sen here:

print(torch.backends.cuda.flash_sdp_enabled())
# True
print(torch.backends.cuda.mem_efficient_sdp_enabled())
# True
print(torch.backends.cuda.math_sdp_enabled())
# True

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    print(torch.backends.cuda.flash_sdp_enabled())
    # True
    print(torch.backends.cuda.mem_efficient_sdp_enabled())
    # False
    print(torch.backends.cuda.math_sdp_enabled())
    # False

EnricoShippole · March 16, 2023, 1:50am

Thank you for verifying it should be used as such:

with torch.backends.cuda.sdp_kernel(
    enable_flash=True, 
    enable_math=False, 
    enable_mem_efficient=False
):

Appreciation as always.

Best,

Enrico

kchoi · May 11, 2023, 6:43pm

Hi @EnricoShippole, did you get any performance improvement by this change to force flash attention only?

EnricoShippole · May 11, 2023, 7:30pm

Hi @kchoi ,

You should see both improvements in speed and memory consumption. You can check out some recent small baseline models I trained here: GitHub - conceptofmind/PaLM: An open-source implementation of Google's PaLM models

Thank you,

Enrico

kchoi · May 11, 2023, 8:49pm

thanks a lot!

i compared true, false, false (=force to use flash attention) vs false, true, true. as you said, i expected the former should be faster, but but it’s slightly slower (second/iteration is like 7 vs 6).

i also ran @ptrblck 's code snippet to get the same results when printing the lines.

i’m using torch2.0, a100-80gb. is there anything i might be missing?

EnricoShippole · May 14, 2023, 4:44am

That is interesting. It should be both much faster and more memory-optimized on an A100 (80GB) due to the increased bandwidth. Have you opened an issue on the PyTorch github with the benchmarks for your tests?

EnricoShippole · May 15, 2023, 4:06am

I have been speaking to a few different peers and they are noticing results similar to yours. I will have to test the Triton version by Tri Dao too.

kchoi · May 15, 2023, 7:11pm

that’s interesting! i didn’t open an issue there (yet). just fyi (or for anyone), the training was with deepspeed 2 and 3, batch size per gpu 2 something, but n_head is like 40 hence the number – i believe, n_head * batch_size, that matters – should be large enough.

EnricoShippole · May 17, 2023, 4:09am

Can you try with a head dim of 128?

ahmdtaha · May 29, 2023, 5:30pm

I wrote the following toy snippet to eval flash-attention speed up. The code outputs

Flash attention took 0.0018491744995117188 seconds
Standard attention took 0.6876699924468994 seconds

Notice the following
1- I am using float16 on cuda, because flash-attention supports float16 and bfloat16
2- Flash-attention aggregates multiple operations into a single fused-kernel. Thus, more operations leads to more savings. In my code snippet, I am doing matmul, softmax, dropout only. I believe further speed up can be gained by adding the mask operation as well.
3- Flash-attention can support longer sequences that standard attention can’t. For instance, my GPU can perform flash-attention with seq_len=4096, but throws OOM error with standard attention

import time
import torch
import torch.nn.functional as F


bz = 32
seq_len = 2048
dims = 64
n_heads = 8
q = torch.randn(bz, n_heads, seq_len, dims, dtype=torch.float16).cuda()
k = torch.randn(bz, n_heads, seq_len, dims, dtype=torch.float16).cuda()
v = torch.randn(bz, n_heads, seq_len, dims, dtype=torch.float16).cuda()

dropout_rate = 0.2
num_trials = 10

with torch.backends.cuda.sdp_kernel(
    enable_flash=True, enable_math=False, enable_mem_efficient=False
):
    start = time.time()
    for i in range(num_trials):
        out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_rate)
    end = time.time()
    print('Flash attention took {} seconds'.format(end - start))


start = time.time()
for i in range(num_trials):
    attn = q @ k.transpose(-2, -1)
    attn = attn.softmax(dim=-1)
    attn = F.dropout(attn, p=dropout_rate, training=True)
    x = (attn @ v).transpose(1, 2)  # .reshape(bz, seq_len, n_heads*dims)
end = time.time()
print('Standard attention took {} seconds'.format(end - start))

ptrblck · May 29, 2023, 8:04pm

CUDA kernels are executed asynchronously so you would need to synchronize the code before starting and stopping the host timers. Otherwise you would profile the dispatching, kernel launches, or implicit syncs making your profile invalid.

ahmdtaha · June 3, 2023, 4:57am

Thanks for catching this issue. I updated the code accordingly. Please let me know if you see other mistakes.

I also switched the order of standard and flash attention evaluations as a sanity check.
The current output is

Standard attention took 0.8632566928863525 seconds for 10 trials
Flash attention took 0.07728338241577148 seconds for 10 trials

The updated code snippet is

import time
import torch
import torch.nn.functional as F


bz = 32
seq_len = 2048
dims = 64
n_heads = 8
q = torch.randn(bz, n_heads, seq_len, dims, dtype=torch.float16).cuda()
k = torch.randn(bz, n_heads, seq_len, dims, dtype=torch.float16).cuda()
v = torch.randn(bz, n_heads, seq_len, dims, dtype=torch.float16).cuda()

dropout_rate = 0.2
num_trials = 10


torch.cuda.synchronize()
start = time.time()
for i in range(num_trials):
    attn = q @ k.transpose(-2, -1)
    attn = attn.softmax(dim=-1)
    attn = F.dropout(attn, p=dropout_rate, training=True)
    x = (attn @ v).transpose(1, 2)  # .reshape(bz, seq_len, n_heads*dims)
torch.cuda.synchronize()
end = time.time()
print('Standard attention took {} seconds for {} trials'.format(end - start, num_trials))

with torch.backends.cuda.sdp_kernel(
    enable_flash=True, enable_math=False, enable_mem_efficient=False
):
    torch.cuda.synchronize()
    start = time.time()
    for i in range(num_trials):
        out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_rate)
    torch.cuda.synchronize()
    end = time.time()
    print('Flash attention took {} seconds for {} trials'.format(end - start, num_trials))

yysirs · July 19, 2023, 4:31am

with torch.backends.cuda.sdp_kernel(
            enable_flash=True, enable_math=False, enable_mem_efficient=False
        ):
            out = F.scaled_dot_product_attention(q, k, v, attn_mask=attention_mask, dropout_p=self.dropout if self.training else 0.0, is_causal = False)

I try run code, but error, the error info here:

<string>:1: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.h:545.)
<string>:1: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.h:338.)
<string>:1: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.h:547.)
<string>:1: UserWarning: Both fused kernels do not support non-null attn_mask. (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.h:191.)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: No available kernel.  Aborting execution.

but, when I remove atten_mask paramters, it work.

with torch.backends.cuda.sdp_kernel(
            enable_flash=True, enable_math=False, enable_mem_efficient=False
        ):
       out = F.scaled_dot_product_attention(q, k, v, dropout_p=self.dropout if self.training else 0.0, is_causal = False)

attention_mask shape is [bz, seq_len. target_len ,src_len]

yysirs · July 21, 2023, 6:56am

if I want to add atten_mask parameter, what shuold I do?

Jackmin801 · October 11, 2023, 7:02pm

I know its been forever.
But only the math and meff kernel supports the attn_mask parameter.

github.com/pytorch/pytorch

Add support for ALiBi/relative positional biases to the fast path for Transformers

opened 03:01PM - 06 Mar 23 UTC

closed 12:04AM - 02 Aug 23 UTC

ani300

oncall: transformer/mha

### 🚀 The feature, motivation and pitch Recent work in transformer architecture…s such as T5 [1] and Attention with Linear Biases (ALiBi) [2] have shown that moving the positional encodings from the word embedding layer and directly into the self-attention computation improves the capacity of models to extrapolate to longer sentences while keeping lower perplexity scores. Both T5 and ALiBi add biases to the logits resulting from the QK multiplication, turning the softmax operation into something like this: weights = softmax(QK^T + B), where B can either be a constant pre-defined matrix, like in ALiBi, or a learnable bias, like in T5. In both cases, the values of B are dependent on the relative distance between tokens, with the farthest tokens usually having the highest penalties. Usually, this is implemented through modification of the attention mask to add these extra weights on top, changing it from a boolean yes/no attention to a weighted kind of model. Performance-wise, our experiments are summarized in the graph below, with @daviswer explaining: "Here we train 3 AliBi decoders (390M params) with different sequence lengths (2048, 1024, 512) but perform validation at sequence length 2048 all the time. This allows us to track the degree of overfitting to sequence length over time. We then do the same for 3 equivalent models using T5 relative positional bias (rpb) instead of AliBi. While val perplexity of both variants suffers greatly over time with diverging train/val sequence lengths, the degree of increase is worse for rpb, indicating that while generalization to longer sequence lengths is a difficult problem, AliBi is indeed better at it than rpb." From the ALiBi paper we also know that both perform better than the traditional word embedding-level positional encodings. ![loss_curve (1)](https://user-images.githubusercontent.com/919977/223146953-d460bf47-29fe-482f-8f85-13376c686bab.png) As of right now, the BetterTransformers path in PyTorch only supports two models for the attention mask: none at all, or the one used for causal LMs. We propose modifying the current implementation so that more complex masks like the ones required for relative positional biases and ALiBi can also be supported. cc: @daviswer @supriyogit @raghukiran1224 @mudhakar @mayank31398 @cpuhrsch @HamidShojanazeri [1] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (https://arxiv.org/abs/1910.10683) [2] Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (https://arxiv.org/abs/2108.12409) ### Alternatives _No response_ ### Additional context _No response_ cc @jbschlosser @bhosmer @cpuhrsch @erichan1

Ezra_Chua · November 26, 2023, 12:12pm

Noob question. Why cant you set everything to True? like so. Wouldn’t that make it more memory efficient?
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=True, enable_mem_efficient=True):

MinSnz · May 15, 2024, 2:23am

flash / math / mem_efficient are different backends. By setting all to True, you are letting PyTorch choose the most favorable one; and by setting one to True, you are forcing a backend and letting it fail if not available. Usually you want to force flash attention for the best speed and check why it may fail.