How to use attention and encoding layers in pytorch

I have a working neural network with two data streams as a backbone network. Now I’m working on the classification head and would like to sum the features extracted from the two streams in the backbone using an attention mechanism, but unfortunately I didn’t understand how I could do it. Using pytorch, I have this code:

def forward(self, x):

    x1, x2 = self.pooling_layer(x[0]), self.pooling_layer(x[1])

    assert x1.shape == x2.shape, 'The dimension of the rgb features and the pose features should be the same'

    x1 = x1.view(x1.size(0), -1)
    x2 = x2.view(x2.size(0), -1)

    x = # Perform sum with attention

    assert x.shape[1] == self.in_channels

    if self.dropout_layer is not None:
        x = self.dropout_layer(x)

    scores = self.fc_layer(x)
    return scores

How can I implement this?

If you have x1 in (N, C) and x2 in (N, C), then you can not directly merge these two into (N, C) via attention in my understanding, because the shape of the query is the same as the shape of the outputs., x2) has the shape of (2N, C). Thus if you only use an attention layer to produce the outputs, the output should also be (2N, C). Maybe you should provide more information.

What I would like to implement is not actually a concatenation, so I wouldn’t use, x2), but a weighted sum. I would like to implement an attention mechanism to weight the sum, so in my idea it would look more like torch.add(att_weight_1 * x1, att_weight_2 * x2). Unfortunately. this is my first time working with attention, so I may have misunderstood the way this layer works, in this case I would be grateful if you could explain to me what I got wrong.

So, could you please tell me the shape that you have mentioned? I think x1 and x2 are both (N, C). Thus, the attn_weight_1 is (N, N), right?

For attn_weight_1, what are your query, key, and value? For example, for me, maybe the x1 is the query, and x2 is the key and value. Then, this attention can model the relationship between x1 and x2. But I’m not sure whether this is what you want.

To give further information, I have two 3D CNN branches that extract features from RGB images in one branch and heatmaps in the other. As of now, the branches produce features with different shapes (I modified the branches), which are (1, 17, 1, 1, 1) for the heatmaps and (1, 3, 1, 1, 1) for the RGB after the pooling operation. After the reshaping operation, I would like to sum these two output feature tensors based on an attention-weighted sum, in order to make sure the network learns which branch carries the most information. Is it possible to do so? Or is it possible to obtain this result with and encoder layer otherwise? I hope that my problem is clear enough now

In (1, 17, 1, 1, 1) for example, 17 is the num of channels right? or some thing else?

If I do not misunderstand, 17 is the number of channels, then the attention layer will not make sense.

Because the attention layer is aimed to get the scores between different tokens, but your outputs of these two branches do not have multi-tokens. And I don’t think it is reasonable to make an attention layer in the channel-wise data in your situation.

Yes, 17 is the number of channels in this case. If the attention layer is not a suitable option, is there any way I can make the network learn which branch of the neural network carries the most information and perform a weighted sum based on that?

I believe there must be various designs to achieve your goal.

For me, I may consider to use adaptive weight as I do in this paper, shown in Fig 2 (orange block).

It seems like, given an input X, whose shape is (N, C), I will use an MLP module to output a W in shape (N, C). In W, w_ij means the weight of the j th channel for the i th input instance. Then, use X’ = X dot W to get the weighted X, thus deciding which channel has more information.

As for your situation, you can change both x1 and x2 into the same number of channels independently. Then use the above adaptive weight to decide the weight of each branch’s information (each branch needs a adaptive weight module). Then you can merge these two branches by add.

The formulation is:
x1’ = MLP_1(x1)
x2’ = MLP_2(x2)
W_1 = MLP_w1(x1’)
W_2 = MLP_w2(x2’)
final_x = x1’ dot W_1 + x2’ dot W_2

And there are also some keywords that may be useful when you search on the webs:

  • adaptive weight
  • channel-wise convolution
  • channel-wise attention
  • channel-wise
  • channel-wise weight

In my memory, there should be a lot of this kind of work, but I can’t remember it immediately.

Sounds like torch.nn.parameter.Parameter is all you need.

In your init, you can define:

self.att_weight_1 = Parameter(torch.randn(1))
self.att_weight_2 = Parameter(torch.randn(1))

The above would be just one learnable value per weight. You could also make it elementwise by passing in the appropriate sized tensor that matches the size of your x1/x2 tensors.

Then you can just call in your forward pass:

x = self.att_weight_1 * x1 + self.att_weight_2 * x2

Alternatively, if you want them to always add to one, you can just use one weight as follows:

x = self.att_weight_1 * x1 + (1 - self.att_weight_1) * x2