Applying Attention (Single and MultiHead Attention)

shakeel608 · July 9, 2020, 2:03pm

Applying Attention from paper

Suppose my Hidden audio representation shape is (after few CNN operations/layers)

 H = torch.Size([128, 32, 64])    [Batch Size X FeatureDim X Length]

and I want to apply self-attention weights to the audio hidden frames as

A = softmax(ReLU(AttentionWeight1 * (AttentionWeight2 * H))

In order to learn these two self attention weight matrices. Do I need to register these two weights as Parameters in the init function like below


class Model(nn.Module):
        def __init__(self, batch_size):
           super(Model, self).__init__()
           self.attention1   = nn.Parameter(torch.Tensor(self.batch_size,16, 32))
           self.attention2   = nn.Parameter(torch.Tensor(self.batch_size,1, 16))

and in the forward do I need to do like this

def forward(self, input):
    ....
   H = CNN(input)     #[B X Features X length]
   attention = nn.Softmax(nn.ReLU(torch.mm(self.attention2, torch.mm(self.attention1, H))
   H = H*attention
   return H

Please help. How can we apply attention here. A the above code is throwing error

RuntimeError: matrices expected, got 3D, 3D tensors at /opt/conda/conda-bld/pytorch_1591914985702/work/aten/src/TH/generic/THTensorMath.cpp:36

ptrblck · July 11, 2020, 9:08am

torch.mm expects two matrices (2D tensors), while you seem to use two 3D tensors.
You could use torch.bmm or torch.matmul instead, which would work for these tensors.

However, usually the parameters are not depending on the batch size.
Are you sure you want to initialize them with the batch size in dim0?

shakeel608 · July 13, 2020, 10:28am

@ptrblck
How to make these weights learnable.
Am I doing it right here

class Model(nn.Module):
        def __init__(self, batch_size):
           super(Model, self).__init__()
           self.attention1   = nn.Parameter(torch.Tensor(self.batch_size,16, 32))
           self.attention2   = nn.Parameter(torch.Tensor(self.batch_size,1, 16))

Attention Mechanism in Forward. The input here is the output after few CNN operations with Shape

[BatchSize X DimFeature X Length]   = [128 X 32 X 64]

       """ Get Attention Weights """
        attn = input    
        attention = attn.permute(0, 2, 1).matmul(self.attention1)
        attention = attention.matmul(self.attention2)
        attention = self.relu(attention)      
        attention = attention.view(attention.size(0), -1)
        attention = F.softmax(attention, 1)  
        """ Multiply Attention Weights with Audio Frames"""
        input = input * attention.unsqueeze(1)          #To Make it comptable with BatchSize X FeatureDim X FixedLength

@ptrblck Is this okay now?
I have also another question when I use nn.Softmax in place of F.softmax(attention, 1), why doesn’t it work ?

ptrblck · July 14, 2020, 2:24am

The code looks alright code-wise and you should be able to see valid gradients in model.attention1.grad and model.attention2.grad after a backward() call.

nn.Softmax should work like F.softmax, but you might have forgotten to create the module before calling it via:

nn.Softmax(dim=1)(input)

What kind of error are you seeing with nn.Softmax?

shakeel608 · July 15, 2020, 7:48am

Thank you for your feedback.

I think I was using the syntax wrongly as below

nn.Softmax(input, 1)

but it is actually like this

nn.Softmax(dim=1)(input)

krishna511 · July 3, 2021, 6:08am

@shakeel608 Have you done your task ?
I am using a transformer network for my audio, ofcourse the encoder part only for multihead attention using key quey and value matrices.
Could you plz explain what is the purpose of this H at last? Is this only for rewighted H, for better classification.

Regards