Separate the linear layer for different features

Hi there,

I’m trying to do the following:

Say I have an input vector with 12 dimensions, I want to output a vector with 3 dimensions. Instead of fully connecting the input and output, I would like to compute the first feature of output based on the first 4 features of the input, and the second dimension of output based on 5th to 8th features in the input vector, and so on.

The output dimension should not be hardcoded, but variable! Therefore I can’t just simply split the input vector to the 3 equal pieces.

Concretely, I want to implement the self multi-head attention pooling in this paper.

Anyone could give me a hint?

Thanks very much!

I think you can implement this with a 1D convolution with kernel size=4 and stride=4


I think @benoriol 's method is OK.
Suppose your input shape is (12, ), you can reshape it to (1, 12), then use Conv1D. Conv1D input_channel = 1, out_channel = 1, kernel_size = 4, stride = 4. I just test it by this code:

input = torch.rand(1, 1, 12)
conv1 = torch.nn.Conv1d(1, 1, 4, stride=4)
out =  conv1(input)

the output shape is (1, 1, 3)

If I understand the paper correctly, the params of the Heads should not be shared. But if you use Conv1d, you will end up sharing the weights per head.

@Yimeng_Zhu do you confirm?

1 Like

Thanks for the reply. But I don’t think the weight should be shared between features, which is the case of the convolution.

Rather, I’m thinking about using torch.nn.ModuleList, but I’m wondering if there is a more elegant way to do that.

Your’re right, the parameters in the convolution would be shared… My bad

Yes you are right. As I replied above, I don’t think conv1d is the best solution.

I’m trying the torch.nn.ModuleList instead, but I’m not sure how the computation graph would be built under this manner and how the training and back propagation would be influenced. Do you have any suggestions?

Thanks very much.

Thanks for reply. As I answered above, I’m trying to avoid parameter sharing. Therefore I don’t think conv1d is the best solution here.

Well, not really elegant since you end up computing a extra logits, but you could do:

input_dim = 12
input = torch.rand(1, 1, input_dim)
num_heads = 3
head_dim = input_dim // num_heads

conv = nn.Conv1d(1, num_heads, head_dim, head_dim) 

h = conv(input)

h = h.diagonal(dim1=1, dim2=2)
1 Like

Thank you very much!

I’m little confused about the line:

conv = nn.Conv1d(1, num_heads, head_dim, head_dim) 

As far as I read the pytorch documentation, the nn.Conv1d should take the 4th argument as stride = head_dim. Do you explicitly left it as here or is it just a typo?

By the way, if I have a audio segment with N frames with 12 dimensions of each, should I modify your code this way:

input_dim = 12
frame_len = N
input = torch.rand(1, frame_len, input_dim)

The 4th argument is the stride:

>>> nn.Conv1d(1, num_heads, head_dim, head_dim)    
Conv1d(1, 3, kernel_size=(4,), stride=(4,))        

Python arguments can be either specified by position or by name, here I chose to rely on positional args :wink:

Oh, didn’t aware of that till you told me.

Again, thanks for your hints! Really helpful!

Well, for the N frames, I think you will have to trick a little bit with reshape:

bs = 1
input_dim = 12
N = 6  # the number of frames
num_heads = 3
head_dim = input_dim // num_heads

input = torch.rand(bs, N, input_dim)

conv = nn.Conv1d(N, N * num_heads, head_dim, head_dim) 

h = conv(input)
h = h.reshape(bs, N, num_heads, num_heads)
h = h.diagonal(dim1=2, dim2=3)

Please verify if that is what you want to do.

And keep in mind that there are a lot of extra calculations (it shouldn’t matter that much on the GPU if frame_len and num_heads are small), but there might be a cleaner solution.

Hi, I finally got it done.

I can’t define the conv1d layer as

conv = nn.Conv1d(N, N * num_heads, head_dim, head_dim) 

since the frame lengths in training data are also variable. If you are interested, the following is what I did:

    def init(...)
        self._attention = nn.Conv1d(1, self._header_num, self._header_dim, 

    def forward(...)
        att_list = []
        for h in LastHiddenLayerOutput
            # h is the hidden layer output with N frames and d dimensions of 
            # each, i.e. it has the shape [N, d]
            # transform it into shape of [N, 1, d] to suit the conv1d input
            score = self._attention(h.unsequeen(0).permute(1, 0, 2))

            # Thanks for @spanev for your hint about nn.diagonal
            score = score.diagonal(dim1=1, dim2=2)
            score = nn.Softmax(dim=0)(score)

            # Tricky part, split h_a into k(headers) parts and make
            # h_a.shape = [d/k, k, N]
            # At this step, score.shape = [N, k], capable to matmul with h_a
            h_a = h.view(-1, self._head_num, self._head_dim)
            h_a = h_a.permute(2, 1, 0)

            # matmul of h_a with score has shape of [d/k, k, k], 
            # only the diagonal of last 2 dim are the sum of attention over all frames
            score = torch.matmul(h_a, score).diagonal(dim1=1, dim2=2)

            score = torch.flatten(score)

        att_h = torch.stack(att_list)
1 Like

You can set the input in shape like (3,4), and use Conv1D to do the convolution with kernel in shape like (3,4). 1 dimension will be computed for one time to get one output, and you will get three features finally. And it won’t share the weight because of the kernel shape is (3,4) when computing.

Proccessing like the pic as follow


Thanks for suggestions. I’m a little confused about your idea. How do you ensure the different channels in input won’t influence each other?

You need to set the shape of the kernel as (3,4). For convolution kernel, it will share the weight in the same channel but not between channels, right? Cuz no matter every channel of kernel or input, the size of every channel is 4, and the dimension is 1. So, your input will just compute with the kernel on the same channel for only one time. Finally, we will get three feature maps, cuz the number of channels is three.

Did you tried to code it? If so could you show me the code?

I highly suspect you are thinking the same way I discussed with @spanev previously, but just forget to get the diagonal to extract the result.

The weights won’t be shared between channels indeed. But all your kernel will be applied to all input channels.

And you are right. For one kernel, all results will be added into one. So we can just get one result. So we can just use this kind of way that is one computing with only one in_channel and one out_channel, and input only one vector. To do this operation for three times, we can get three results. But it won’t like what your pic showing, computing all at the same time. But you can set the three in_channels and three out_channels, and use group convolution to make the output of three channels won’t be added into one

Code goes like

import torch
import torch.nn as nn

# print(ind)
# conv=nn.Conv1d(in_channels=1,out_channels=1,kernel_size=4)
# for data in ind:
#     print(conv(data.unsqueeze(0).unsqueeze(0)))
1 Like

I don’t think this is how multi head attention pooling works. I suggest you read the paper and pytorch docu carefully.

The conv1d in your first post will result in parameter sharing and the second will cause incorrect output feature computation.

Furthermore, if you are really into this problem, I also strongly recommend you read the code in my previous reply, which I think might be what you actually have in mind.