Is there a way to implement RoPE around `nn.MultiheadAttention` somehow?

AsyncTh · March 16, 2023, 3:44pm

I want to implement Rotary Position Embeddings in PyTorch, however it seems like they need to be applied to the output of linear layers before scaled dot-product attention is computed (this is unlike sinusoidal positional encoding, which is applied to word embeddings directly).

I’m wondering if there is still a way implement Rotary Position Embeddings in a way that works with nn.MultiheadAttention and doesn’t require reimplementing multi-head attention.

The reason I want to do this is because nn.MultiheadAttention will likely perform better than my implementation since it now uses FlashAttention, plus it might have some other optimizations I’m not aware of.

J_Johnson · March 17, 2023, 4:29am

Was working on a PaLM model and using lucidrain’s Pytorch implementation. This makes use of a rotary embedding between a LayerNorm and Linear as part of a transformer block. So in this implementation, it seems to be before the Linear and attention layers.

You can find that code here:

github.com

lucidrains/PaLM-rlhf-pytorch/blob/main/palm_rlhf_pytorch/palm.py

import math
import copy
from pathlib import Path
from collections import namedtuple
from functools import wraps
from itertools import zip_longest

from tqdm import tqdm
from beartype import beartype
from beartype.typing import Tuple, Optional

import torch
from torch import einsum, nn
import torch.nn.functional as F

from einops import rearrange, repeat, reduce, pack, unpack
from einops.layers.torch import Rearrange, Reduce

from palm_rlhf_pytorch.attention import Attention
from palm_rlhf_pytorch.utils import top_p, top_k, masked_mean, gumbel_sample, eval_decorator

This file has been truncated. show original

AsyncTh · March 17, 2023, 10:13am

If I’m reading the code correctly, line 209 computes outputs of those pre-attention linear layers, so rotary embeddings are applied after linear layers and before attention like in other implementations.

AsyncTh · March 19, 2023, 3:44pm

The answer so far seems to be “no”, but as it turns out I can just use torch.nn.functional.scaled_dot_product_attention to run efficient implementations of SDPA in my custom implementation of multi-head attention, so I guess it makes this question irrelevant. Not sure if I’m losing any performance by not using nn.Transformer though.

JannisZeller · July 4, 2023, 10:59am

RoPE-ing the q and k inputs to torch.nn.functional.scaled_dot_product_attention is certainly possible, but I think, if one reads the original RoPE-paper (https://arxiv.org/pdf/2104.09864.pdf) carefully, they note in eq. (19) the the denominator should not get rotated, to prevent accidental 0-divisions. I am not sure if torch.nn.functional.scaled_dot_product_attention already handles this somehow. So if one wants to stick to the original paper, there is currently (as far as I managed to understand) no other way, than “inserting” the rotations by re-implementing the SDPA with RoPE. Unfortunately then the layer lacks the enhancements through the optimized attention versions except one implements a custom kernel. I have set up a little demo containing the described approach on my github. Feedback welcome! :)

srishti-git1110 · September 28, 2023, 4:02pm

Hey @JannisZeller
Came across your comment here as I was implementing RoPE in PyTorch.

Just wanted to discuss on eq. 19 in the paper: According to my understanding, not rotating q and k for the denominator calculation is for linear self attention only and not for SDPA. Also, went through some implementations by
hugging face and they seem to use softwax as is on the attention scores calculated using rotated q and k implying that q and k in the denominator of the SDPA formula are rotated too.

JannisZeller · September 28, 2023, 8:00pm

Hello @srishti-git1110. Thanks for pointing that out. I did not notice a differentiation between linear attention and SDPA, thanks for pointing that out! Still I guess / hope, that I suggested a correct implementation of eq. 19 as it is.