Separate the linear layer for different features

Ein_K · August 16, 2019, 2:13am

Sorry for I haven’t known how to tell you about it…

The last time I say this: “In convolution, the weight won’t be shared between channels”. May you set the input like (1,12) and set the stride of convolution as 4 and set the kernel size as 4. Finally you will get one sequeence with three results, but the weight is shared for this case

But it won’t be shared between channels cuz the channels of them are different, and the final results you get will be added into one value in normal situation. So we need to use group convolution to make the channels won’t be mixed

Why did you mention about pooling? I don’t know how it connect with your question…

Yimeng_Zhu · August 16, 2019, 2:27am

This question is regarding to implement the multi-head attention in this paper, which is also called multi-head attention pooling.

I finally got your point to separate the channels as I read your this answer again. I just aware that is correct as you used the group to separate the input channels. Thanks for your tips and I’ll try that out.

My most confusion came from your this answer as both in_channels and out_channels are set to 1 in your conv1d layer.

BTW, would you please delete the comment part in your correct post, as that could lead to a lot of ambiguous. At least for me the first time I read it…

Ein_K · August 16, 2019, 2:56am

I’ve done that. That’s my fault. For the time that I made that code I posted firstly, there’s a thinking way that using keyword ‘for’ to take every element and input one value, getting one value. But I realised it’s not what the pic you posted shows. So I choose to use group convolution to make it again like the code I posted later

Pleasure that my answer is working for you

chongkai_Lu · December 7, 2021, 12:22pm

Altough it’s late reply, for the information of someone else lately come to this question: You can solve this by setting the argument groups in Conv1d layer class to implement this:
For example:

separate1d = torch.nn.Conv1d(12, 3, kernel_size=1, groups=3)

can achive the function as requried in this question. i.e., 12 to 3, 4 groups of weights and bias.

junjiehuang2468 · August 10, 2022, 7:38am

import torch


class Model(torch.nn.Module):
   def __init__(self):
       super().__init__()
       self.ly_1 = torch.nn.Linear(in_features=4, out_features=1)
       self.ly_2 = torch.nn.Linear(in_features=4, out_features=1)
       self.ly_3 = torch.nn.Linear(in_features=4, out_features=1)
       
   def forward(self, x):
       x_1 = self.ly_1(x[:,:4])
       x_2 = self.ly_2(x[:,4:8])
       x_3 = self.ly_3(x[:,8:])
       return torch.cat((x_1, x_2, x_3), dim=1)
   
a = Model()
a(torch.rand(16,12)).shape

If you want to make sure that the model is working, you can check the gradient of the model using the following code.

import torch


class Model(torch.nn.Module):
   def __init__(self):
       super().__init__()
       self.ly_1 = torch.nn.Linear(in_features=4, out_features=1)
       self.ly_2 = torch.nn.Linear(in_features=4, out_features=1)
       self.ly_3 = torch.nn.Linear(in_features=4, out_features=1)
       
   def forward(self, x):
       x_1 = self.ly_1(x[:,:4])
       x_2 = self.ly_2(x[:,4:8])
       x_3 = self.ly_3(x[:,8:])
       return torch.cat((x_1, x_2, x_3), dim=1)
   
a = Model()
optim = torch.optim.Adam(params=a.parameters())
loss_fn = torch.nn.MSELoss()

input_ = torch.rand(4,12)
label = torch.rand(4,3)
label[:, 2:3] = a(input_)[:, 2:3]

loss = loss_fn(a(input_), label)
optim.zero_grad()
loss.backward()

print(a.ly_1.weight.grad) # tensor([[-0.1005, -0.1412, -0.1536, -0.1327]])
print(a.ly_2.weight.grad) # tensor([[-0.2368, -0.3531, -0.3396, -0.2425]])
print(a.ly_3.weight.grad) # tensor([[0., 0., 0., 0.]])