How to convert Tensorflow Multi-head attention to Pytorch equivalent?

Yasaman_Salehi · September 10, 2023, 1:47pm

Both Tensorflow and Pytoch have functions for Transformer’s Multi-head attention and they gets arguments as follows:
tensorflow.keras.layers.MultiHeadAttention(num_heads, key_dim)
torch.nn.MultiheadAttention(embed_dim, num_heads)
where num_heads stands for number of heads in multi-head attention module and key_dim and embed_dim are explained below:

Assume in a NLP example we have an input of size [1, 10, 512] which means we have a sentence with 10 words and each word is a [1, 512] vector after passing through the tokenizer. so the embed_dim is 512.
now if we have a multi-head attention with num_heads = 8 , the input would be resized to [1, 10, 8, 64]. so, each of these heads would have dimension of [1, 10, 64]. if I understand it correctly, the key_dim in Tensorflow is this 64 in each head.

I have a Tensorflow transformer and I’m trying to convert it to Pytorch. The input data is a [1, 136, 4] matrix which is used for both query and key.
In TF code, the multi-head attention gets num_head = 8 and key_dim = 4. it means the embed_dim in Pytorch must be key_dim * num_heads = 32 but my data is [1, 136, 4].
Does in mean I have to repeat my input for 8 times and have the key and query of size [1, 136, 32]?
how can I give a [1, 136, 4] data to a Transformer with 8 heads in Pytorch?
I should mention the input is a sequence of 136 bounding box coordinates including x, y, w & h.
Thank you in advanced.