RuntimeError: Expected hidden[0] size (2, 20, 256), got (2, 50, 256)

Raj · February 26, 2019, 11:39am

I’m building a LSTM model to classify test into multiple classes and I get the following error when training the model. I’ve used a softmax activation as the final layer. There are 44 possible classes for the texts. Below is the code snippet and network architecture. any help will be greatly appreciated. I’m using CPU to train the model for now, post which I’ll move to GPU.

training params

epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

move model to GPU, if available

if(train_on_gpu):
net.cuda()

net.train()

train for some number of epochs

for e in range(epochs):
# initialize hidden state
h = net.init_hidden(batch_size)

# batch loop
for inputs, labels in train_loader:
    counter += 1

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    # zero accumulated gradients
    net.zero_grad()

    # get the output from the model
    output, h = net(inputs, h)
    
  #         print('output:',output.squeeze())
  #         print('labels:',labels.float())
    
    # calculate the loss and perform backprop
    loss = criterion(output, labels)
    loss.backward()
    # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
    nn.utils.clip_grad_norm_(net.parameters(), clip)
    optimizer.step()

    # loss stats
    if counter % print_every == 0:
        # Get validation loss
        val_h = net.init_hidden(batch_size)
        val_losses = []
        net.eval()
        for inputs, labels in valid_loader:

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            val_h = tuple([each.data for each in val_h])

            if(train_on_gpu):
                inputs, labels = inputs.cuda(), labels.cuda()

            output, val_h = net(inputs, val_h)
            
            val_loss = criterion(output, labels)

            val_losses.append(val_loss.item())

        net.train()
        print("Epoch: {}/{}...".format(e+1, epochs),
              "Step: {}...".format(counter),
              "Loss: {:.6f}...".format(loss.item()),
              "Val Loss: {:.6f}".format(np.mean(val_losses)))

Instantiate the model w/ hyperparams

vocab_size = len(vocab_to_int)+1
output_size = 44
embedding_dim = 100
hidden_dim = 256
n_layers = 2

net = ClassificationRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

ClassificationRNN(
(embedding): Embedding(5865, 100)
(lstm): LSTM(100, 256, num_layers=2, batch_first=True, dropout=0.5)
(fc): Linear(in_features=256, out_features=44, bias=True)
(sof): LogSoftmax()
(dropout): Dropout(p=0.3)
)

DoubtWang · February 26, 2019, 12:59pm

you can print the shape of the hidden[0] at each step.
then, carefully check it.

Raj · February 27, 2019, 4:22am

@DoubtWang Don’t see a provision to include that for debugging purposes as I believe this happens within the LSTM package.

DoubtWang · February 27, 2019, 5:04am

Can you give detailed code?
Besides, explain the shape of inputs and label?

Raj · February 27, 2019, 5:44am

Input is a tensor from the data loader which has a batch size of 50 and sequence length of 20. Labels is uni dimensional tensor of possbile values ranging from 0 to 43 (44 values. Pls find below detailed code.

import torch.nn as nn

class ClassificationRNN(nn.Module):
“”"
The RNN model that will be used to perform Sentiment analysis.
“”"

def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
    """
    Initialize the model by setting up the layers.
    """
    super(ClassificationRNN, self).__init__()

    self.output_size = output_size
    self.n_layers = n_layers
    self.hidden_dim = hidden_dim
    
    # define all layers
    # embedding and LSTM layers
    self.embedding = nn.Embedding(vocab_size,embedding_dim)
    self.lstm = nn.LSTM(embedding_dim,hidden_dim,n_layers,dropout=drop_prob, batch_first=True)
    
    #full connected layer & softmax
    self.fc = nn.Linear(hidden_dim,output_size)
    self.sof = nn.LogSoftmax(dim=1)
    
    #dropout layer
    self.dropout = nn.Dropout(0.3)

def forward(self, x, hidden):
    """
    Perform a forward pass of our model on some input and hidden state.
    """
    batch_size = x.size(0)
    embeds = self.embedding(x)
    lstm_out,hidden= self.lstm(embeds,hidden)
    
     # stack up lstm outputs
    lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
    
    # dropout and fully-connected layer
    out = self.dropout(lstm_out)
    out = self.fc(out)
    
    # softmax function
    soft_out = self.sof(out)
    
    # reshape to be batch_size first
    soft_out = soft_out.view(batch_size, -1)

soft_out = soft_out[:, -1] # get last batch of labels

    # return last sigmoid output and hidden state
    return soft_out, hidden


def init_hidden(self, batch_size):
    ''' Initializes hidden state '''
    # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
    # initialized to zero, for hidden state and cell state of LSTM
    weight = next(self.parameters()).data
    
    if (train_on_gpu):
        hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
              weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
    else:
        hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
    
    return hidden

Instantiate the model w/ hyperparams

vocab_size = len(vocab_to_int)+1
output_size = 44
embedding_dim = 100
hidden_dim = 256
n_layers = 2

net = ClassificationRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

loss and optimization functions

lr=0.001

criterion = nn.NLLLoss()

optimizer = torch.optim.Adam(net.parameters(), lr=lr)

training params

epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

move model to GPU, if available

if(train_on_gpu):
net.cuda()

net.train()

train for some number of epochs

for e in range(epochs):
# initialize hidden state
h = net.init_hidden(batch_size)

# batch loop
for inputs, labels in train_loader:
    counter += 1

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    # zero accumulated gradients
    net.zero_grad()

    # get the output from the model
    output, h = net(inputs, h)
    
    print('output shape',output.shape)

print(‘labels:’,labels.float())

    # calculate the loss and perform backprop
    loss = criterion(output.squeeze(), labels)
    loss.backward()
    # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
    nn.utils.clip_grad_norm_(net.parameters(), clip)
    optimizer.step()

    # loss stats
    if counter % print_every == 0:
        # Get validation loss
        val_h = net.init_hidden(batch_size)
        val_losses = []
        net.eval()
        for inputs, labels in valid_loader:

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            val_h = tuple([each.data for each in val_h])

            if(train_on_gpu):
                inputs, labels = inputs.cuda(), labels.cuda()

            output, val_h = net(inputs, val_h)
            
            val_loss = criterion(output.squeeze(), labels)

            val_losses.append(val_loss.item())

        net.train()
        print("Epoch: {}/{}...".format(e+1, epochs),
              "Step: {}...".format(counter),
              "Loss: {:.6f}...".format(loss.item()),
              "Val Loss: {:.6f}".format(np.mean(val_losses)))

DoubtWang · February 27, 2019, 9:24am

I have run your code, but I do not find any bug.
Give you some advice:

If you initialized hidden state to zero, no operation required.

If (h_0, c_0) is not provided, both h_0 and c_0 default to zero.

If you set batch_first=True, and the shape of inputs is (batch_size, len)(50, 20)

So, the hidden[0] should be (2, 50, 256) instead of (2, 20, 256)

h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch.

For NLLLoss function,

Input: (N, C) where C = number of classes, and Target: (N) where each value is 0≤targets[i]≤C−1
So, the code should be:

        # reshape to be batch_size first
        soft_out = soft_out.view(batch_size, -1, self.output_size)
        
        soft_out = soft_out[:, -1, :] # get last batch of labels

4.If you use the lstm to complete the classification task, you can use the finally state or self-attention based on my experience.

Raj · February 27, 2019, 11:12am

Thanks @DoubtWang. Your suggestions were very useful. I found the problem. One of the batches in the validation data loader has incorrect dimension. Pls refer attached image. But, I dont know how to remove that batch. Could you pls suggest a solution for that?

DoubtWang · February 27, 2019, 2:31pm

You can try and see if it works.

def init_hidden(self):
    if (train_on_gpu):
        hidden = (weight.new(self.n_layers, 1, self.hidden_dim).zero_().cuda(),
                         weight.new(self.n_layers, 1, self.hidden_dim).zero_().cuda())
    else:
        hidden = (weight.new(self.n_layers, 1, self.hidden_dim).zero_(),
                       weight.new(self.n_layers, 1, self.hidden_dim).zero_())
    return hidden

batch_size = inputs.size(0)
h = tuple([each.repeat(1, batch_size, 1).data for each in h])

batch_size = inputs.size(0)
val_h = tuple([each.repeat(1, batch_size, 1).data for each in val_h])

I think this is a method which can solve this problem.
Can you understand that?

Raj · February 28, 2019, 5:12am

Thanks @DoubtWang, your solution as I understand it, is trying to fix the hidden layer, but I believe the issue I’m facing is with the input dimension of few bad batches in the Validation dataloader. If I can fix the dataloader, I believe this issue would be resolved.

Raj · February 28, 2019, 7:17am

I worked around the problem by skipping the training & validation for those batches which were not of the shape (50,20). Below is a simple if condition that helped me to solve the problem.

if( (inputs.shape[0],inputs.shape[1]) != (batch_size,seq_length)):
print(‘Validation - Input Shape Issue:’,inputs.shape)
continue

1dividedby0 · June 21, 2019, 4:52am

This was so helpful! Thank you!

bkuriach · December 15, 2019, 4:19pm

This error could be resolved by setting the ‘drop_last’ parameter in DataLoader. You might be getting this error because your training data is not divisible by batch size. In that case, you will get this error at the end of epoch. Suppose your training data has 1005 records and batch size is 100, for the last batch, it will have only 5 records and you may get dimension error. If you keep ‘drop_last’ as True, those 5 records will be ignored.

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)

Rasha_Salim · June 4, 2020, 6:16am

Thanks a lot! this was really helpful

erfan_asadi · December 13, 2021, 8:22pm

I have the same problem and this is the exact answer,
Thank you! <3

mapneto · May 27, 2022, 10:22pm

i had the same error and realize this issue when read your contribution. Thank you very much!

mapneto · May 27, 2022, 10:22pm

very useful!! Thank you

senitent_signal · August 1, 2022, 6:20pm

Sometimes looking at problem from different angle reduces the computation time in brain, i dunno how does it happen inside brain but it yields drop_last=True

But my question is what could be missed if train_loader is not divisble by batch size?