Separate the linear layer for different features

If I understand the paper correctly, the params of the Heads should not be shared. But if you use Conv1d, you will end up sharing the weights per head.

@Yimeng_Zhu do you confirm?

Thanks for the reply. But I don’t think the weight should be shared between features, which is the case of the convolution.

Rather, I’m thinking about using torch.nn.ModuleList, but I’m wondering if there is a more elegant way to do that.

Your’re right, the parameters in the convolution would be shared… My bad

Yes you are right. As I replied above, I don’t think conv1d is the best solution.

I’m trying the torch.nn.ModuleList instead, but I’m not sure how the computation graph would be built under this manner and how the training and back propagation would be influenced. Do you have any suggestions?

Thanks very much.

Thanks for reply. As I answered above, I’m trying to avoid parameter sharing. Therefore I don’t think conv1d is the best solution here.

Well, not really elegant since you end up computing a extra logits, but you could do:

input_dim = 12
input = torch.rand(1, 1, input_dim)
num_heads = 3
head_dim = input_dim // num_heads

conv = nn.Conv1d(1, num_heads, head_dim, head_dim) 

h = conv(input)

h = h.diagonal(dim1=1, dim2=2)
1 Like

Thank you very much!

I’m little confused about the line:

conv = nn.Conv1d(1, num_heads, head_dim, head_dim) 

As far as I read the pytorch documentation, the nn.Conv1d should take the 4th argument as stride = head_dim. Do you explicitly left it as here or is it just a typo?

By the way, if I have a audio segment with N frames with 12 dimensions of each, should I modify your code this way:

input_dim = 12
frame_len = N
input = torch.rand(1, frame_len, input_dim)

The 4th argument is the stride:

>>> nn.Conv1d(1, num_heads, head_dim, head_dim)    
Conv1d(1, 3, kernel_size=(4,), stride=(4,))        

Python arguments can be either specified by position or by name, here I chose to rely on positional args :wink:

Oh, didn’t aware of that till you told me.

Again, thanks for your hints! Really helpful!

Well, for the N frames, I think you will have to trick a little bit with reshape:

bs = 1
input_dim = 12
N = 6  # the number of frames
num_heads = 3
head_dim = input_dim // num_heads

input = torch.rand(bs, N, input_dim)

conv = nn.Conv1d(N, N * num_heads, head_dim, head_dim) 

h = conv(input)
h = h.reshape(bs, N, num_heads, num_heads)
h = h.diagonal(dim1=2, dim2=3)

Please verify if that is what you want to do.

And keep in mind that there are a lot of extra calculations (it shouldn’t matter that much on the GPU if frame_len and num_heads are small), but there might be a cleaner solution.

Hi, I finally got it done.

I can’t define the conv1d layer as

conv = nn.Conv1d(N, N * num_heads, head_dim, head_dim) 

since the frame lengths in training data are also variable. If you are interested, the following is what I did:

    def init(...)
        self._attention = nn.Conv1d(1, self._header_num, self._header_dim, 

    def forward(...)
        att_list = []
        for h in LastHiddenLayerOutput
            # h is the hidden layer output with N frames and d dimensions of 
            # each, i.e. it has the shape [N, d]
            # transform it into shape of [N, 1, d] to suit the conv1d input
            score = self._attention(h.unsequeen(0).permute(1, 0, 2))

            # Thanks for @spanev for your hint about nn.diagonal
            score = score.diagonal(dim1=1, dim2=2)
            score = nn.Softmax(dim=0)(score)

            # Tricky part, split h_a into k(headers) parts and make
            # h_a.shape = [d/k, k, N]
            # At this step, score.shape = [N, k], capable to matmul with h_a
            h_a = h.view(-1, self._head_num, self._head_dim)
            h_a = h_a.permute(2, 1, 0)

            # matmul of h_a with score has shape of [d/k, k, k], 
            # only the diagonal of last 2 dim are the sum of attention over all frames
            score = torch.matmul(h_a, score).diagonal(dim1=1, dim2=2)

            score = torch.flatten(score)

        att_h = torch.stack(att_list)

You can set the input in shape like (3,4), and use Conv1D to do the convolution with kernel in shape like (3,4). 1 dimension will be computed for one time to get one output, and you will get three features finally. And it won’t share the weight because of the kernel shape is (3,4) when computing.

Proccessing like the pic as follow

Thanks for suggestions. I’m a little confused about your idea. How do you ensure the different channels in input won’t influence each other?

You need to set the shape of the kernel as (3,4). For convolution kernel, it will share the weight in the same channel but not between channels, right? Cuz no matter every channel of kernel or input, the size of every channel is 4, and the dimension is 1. So, your input will just compute with the kernel on the same channel for only one time. Finally, we will get three feature maps, cuz the number of channels is three.

Did you tried to code it? If so could you show me the code?

I highly suspect you are thinking the same way I discussed with @spanev previously, but just forget to get the diagonal to extract the result.

The weights won’t be shared between channels indeed. But all your kernel will be applied to all input channels.

And you are right. For one kernel, all results will be added into one. So we can just get one result. So we can just use this kind of way that is one computing with only one in_channel and one out_channel, and input only one vector. To do this operation for three times, we can get three results. But it won’t like what your pic showing, computing all at the same time. But you can set the three in_channels and three out_channels, and use group convolution to make the output of three channels won’t be added into one

Code goes like

import torch
import torch.nn as nn

# print(ind)
# conv=nn.Conv1d(in_channels=1,out_channels=1,kernel_size=4)
# for data in ind:
#     print(conv(data.unsqueeze(0).unsqueeze(0)))

I don’t think this is how multi head attention pooling works. I suggest you read the paper and pytorch docu carefully.

The conv1d in your first post will result in parameter sharing and the second will cause incorrect output feature computation.

Furthermore, if you are really into this problem, I also strongly recommend you read the code in my previous reply, which I think might be what you actually have in mind.

Sorry for I haven’t known how to tell you about it…

The last time I say this: “In convolution, the weight won’t be shared between channels”. May you set the input like (1,12) and set the stride of convolution as 4 and set the kernel size as 4. Finally you will get one sequeence with three results, but the weight is shared for this case

But it won’t be shared between channels cuz the channels of them are different, and the final results you get will be added into one value in normal situation. So we need to use group convolution to make the channels won’t be mixed

Why did you mention about pooling? I don’t know how it connect with your question…

This question is regarding to implement the multi-head attention in this paper, which is also called multi-head attention pooling.

I finally got your point to separate the channels as I read your this answer again. I just aware that is correct as you used the group to separate the input channels. Thanks for your tips and I’ll try that out.

My most confusion came from your this answer as both in_channels and out_channels are set to 1 in your conv1d layer.

BTW, would you please delete the comment part in your correct post, as that could lead to a lot of ambiguous. At least for me the first time I read it…

I’ve done that. That’s my fault. For the time that I made that code I posted firstly, there’s a thinking way that using keyword ‘for’ to take every element and input one value, getting one value. But I realised it’s not what the pic you posted shows. So I choose to use group convolution to make it again like the code I posted later

Pleasure that my answer is working for you

1 Like