RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x32 and 128x32)

Hi everyone,
I’m trying to code a custom LSTM layer with an attention gate, as explained in “”, for a classification problem. But I’m struggling with this error when it comes to multiplying the result obtained from the attention gate and the input x. By removing the AtGate lines the code works just fine. Can anyone give me some help? Thanks in advance!

The custom LSTM Layer code is below:

import math
class AttCustomLSTM(nn.Module):
    def __init__(self, input_sz: int, hidden_sz: int):
        self.input_size = input_sz
        self.hidden_size = hidden_sz
        self.W_i = nn.Parameter(torch.Tensor(input_sz, hidden_sz))
        self.U_i = nn.Parameter(torch.Tensor(hidden_sz, hidden_sz))
        self.b_i = nn.Parameter(torch.Tensor(hidden_sz))
        self.W_f = nn.Parameter(torch.Tensor(input_sz, hidden_sz))
        self.U_f = nn.Parameter(torch.Tensor(hidden_sz, hidden_sz))
        self.b_f = nn.Parameter(torch.Tensor(hidden_sz))
        self.W_c = nn.Parameter(torch.Tensor(input_sz, hidden_sz))
        self.U_c = nn.Parameter(torch.Tensor(hidden_sz, hidden_sz))
        self.b_c = nn.Parameter(torch.Tensor(hidden_sz))
        self.W_o = nn.Parameter(torch.Tensor(input_sz, hidden_sz))
        self.U_o = nn.Parameter(torch.Tensor(hidden_sz, hidden_sz))
        self.b_o = nn.Parameter(torch.Tensor(hidden_sz))

        self.W_a = nn.Parameter(torch.Tensor(input_sz, hidden_sz))
        self.U_a = nn.Parameter(torch.Tensor(hidden_sz, hidden_sz))
        self.b_a = nn.Parameter(torch.Tensor(hidden_sz))
    def init_weights(self):
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
  , stdv)
    def forward(self,
        bs, seq_sz, _ = x.size()
        hidden_seq = []
        if init_states is None:
            h_t, c_t = (
                torch.zeros(bs, self.hidden_size).to(x.device),
                torch.zeros(bs, self.hidden_size).to(x.device),
            h_t, c_t = init_states
        for t in range(seq_sz):
            x_t = x[:, t, :]
            # Attention gate
            a_t = torch.sigmoid(x_t @ self.W_a + h_t @ self.U_a + self.b_a)
            x_t = a_t @ x_t 

            i_t = torch.sigmoid(x_t @ self.W_i + h_t @ self.U_i + self.b_i)
            f_t = torch.sigmoid(x_t @ self.W_f + h_t @ self.U_f + self.b_f)
            g_t = torch.tanh(x_t @ self.W_c + h_t @ self.U_c + self.b_c)
            o_t = torch.sigmoid(x_t @ self.W_o + h_t @ self.U_o + self.b_o)
            c_t = f_t * c_t + i_t * g_t
            h_t = o_t * torch.tanh(c_t)
        #reshape hidden_seq p/ retornar
        hidden_seq =, dim=0)
        hidden_seq = hidden_seq.transpose(0, 1).contiguous()
        return hidden_seq, (h_t, c_t)

class Net(nn.Module):
    def __init__(self):
        self.embedding = nn.Embedding(len(encoder.vocab)+1, 32)
        self.lstm = NaiveCustomLSTM(32,32)#nn.LSTM(32, 32, batch_first=True)
        self.fc1 = nn.Linear(32, 2)
    def forward(self, x):
        x_ = self.embedding(x)
        x_, (h_n, c_n) = self.lstm(x_)
        x_ = (x_[:, -1, :])
        x_ = self.fc1(x_)
        return x_

Hi, Based on your error, you would have to transpose one matrix for them to be multiple, as the second dimension of the first matrix and the first dimension of the second matrix should be the same.

Therefore you should multiply matrices of shape 128 x 32 and 32 x 128 to get a resultant shape of 128 x 128

x_t = a_t @ x_t

because a_t would be of shape batch x hidden_size and x_t is of shape batch x input_size

Please check your input shapes once. I believe the error comes from the line

On a side note, instead of using the @ operator, I believe using torch.bmm would be much faster.

Hi, thanks for the help. I transposed the matrix to get a 128 x 128 as result, but got into another problem that it can’t be multiplied in the input gate by the weight because the weight is shape 32 x 32.

Note: I’ve tried with a 32x32 result, and it can do the multiplication, but the input and hidden state sum in the input gate doesn’t work because of the shape as well.

According to your code, you are not passing the vanilla input but rather an attentive input into your input gate which is of shape batch, hidden_size. The W_i which maps from input to hidden is of input, hidden, and therein lies the problem.

Also, based on a cursory reading of the paper, the dimension of the input vector and the attention vector should be the same, therefore, your W_a should be of size input, input and the U_a should be hidden, input. The line from the paper which suggests this is as follows -

The response of
an EleAttG is a vector at with the same dimension as the input xt of the RNNs,

Also I believe the author calculates the hardmat product instead of matrix multiplication to get the final xt which therefore would change to xt = at * xt

I tried the change you suggested in the W_a and U_a size, and it worked perfectly. I even tried to change the size before posting this issue, but it wasn’t working, mostly because I did it wrong. I’m kinda new to DL especially with PyTorch, so I’m still struggling with some implementations and logic regarding matrix operations.
Thanks a lot for the help