Scaled_dot_product_attention

AlbertZeyer · April 26, 2024, 8:07am

In a recent PyTorch version (since when exactly?), to use an efficient attention implementation, you can simply use torch.nn.functional.scaled_dot_product_attention, right? As I understand, it would automatically use FlashAttention-2:

automatically select the most optimal implementation based on the inputs

I’m not sure exactly what this means though. How exactly is the logic? In what cases would it select FlashAttention-2?

Also, as far as I know, FlashAttention-2 only works on more recent Nvidia GPUs but does not on older. (Since what kind GPU? I guess 1080 not? 2080?)

Alternatively, there is also Memory-Efficient Attention. How does this compare in speed? This works on a 1080?

What is the most efficient implementation for a 1080, or what would be a reasonable choice there? Should I just use torch.nn.functional.scaled_dot_product_attention or sth more custom?

I have seen many other custom implementations, for example in segment_anything_fast. I’m not really sure if this is outdated (i.e. obsolete with recent PyTorch) or still makes sense. I think I also have seen a fast Triton implementation somewhere.

In some cases, I also need self-attention with relative positional encoding. As far as I understand, torch.nn.functional.scaled_dot_product_attention should support that. But I can imagine that not all implementations would support it. I also need both cross-attention and self-attention. (But it would be ok for me to use different implementations for each of those cases.)

Does torch.nn.functional.scaled_dot_product_attention work for training as well, i.e. the gradient is defined?

I saw that torch.nn.MultiheadAttention for some reason does not use the native attention function when training is enabled. Why?

eqy · April 27, 2024, 5:20am

The dispatching logic is here:

github.com

pytorch/pytorch/blob/ce503c1b40207dab770c28cbd4568cd9e105277b/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L556


      
              }
            }
          
            auto dprop = at::cuda::getCurrentDeviceProperties();
            if (dprop->major >= 8) {
              return check_tensor_dtype(params, greater_than_or_equal_sm80_mem_efficient_dtypes, debug);
            }
            return check_tensor_dtype(params, less_than_sm80_mem_efficient_dtypes, debug);
          }
          
          SDPBackend select_sdp_backend(sdp_params const& kernel_params) {
            // This function defines the priority order of the different sdp backends
            // 1. Flash Attention
            // 2. Mem Efficient Attention
            // 3. Math fallback
            auto& ctx = at::globalContext();
            if (!ctx.userEnabledMathSDP() && !ctx.userEnabledFlashSDP() &&
                !ctx.userEnabledMemEfficientSDP() && !ctx.userEnabledCuDNNSDP()) {
              return SDPBackend::error;
            }
            // Get ideal kernel ordering

On pre-sm80 GPUs like the 1080 (sm60) or 2080 (sm75) the mem efficient attention backend could be used (depending on shapes and other constraints) but flash attention requires sm80 and newer.