Using cudnnMultiHeadAttnForward for attention

There’s a CuDNNprimitive (cudnnMultiHeadAttnForward, etc) provided for handling multi-head attention. However, upon browsing the PyTorch code I realized that CuDNN API is not used in PyTorch. In fact, I found none of the issues even discuss about this API. I was wondering whether there’s any reason for not using that. Intuitively, I would expect such a “fused” API would have better performance over launching multiple kernels.

We are exploring different options to accelerate the MultiHeadAttention layer and cudnn would be one possible way to go forward.