LSTM shape related errors

Y_S_S · January 30, 2021, 3:36am

Hi, I am relatively new to building deep learrning models and I seem to be completely confused and stuck with errors related to shape and size.

Here’s the LSTM model and relevant code:

class LSTMTagger(nn.Module):

        def __init__(self):
            super(LSTMTagger, self).__init__()
    #      self.lstm1 = nn.LSTM(input_size = 1, hidden_size = 100)
    #      self.lstm2 = nn.LSTM(100, 50)
            self.embedding = 
    nn.Embedding(wv.vectors.shape[0],512)#embedding_matrix.shape[1])
            self.lstm1 = nn.LSTM(input_size = 512, hidden_size = 64, dropout = 
    0.1,batch_first=True,bidirectional = True)
            self.dropout = nn.Dropout(p = 0.25)
            self.linear1 = nn.Linear(in_features = 128, out_features = 64)
            self.dropout = nn.Dropout(p = 0.25)
            self.linear2 = nn.Linear(in_features = 64, out_features = 1)
            self.sigmoid = nn.Sigmoid()

        def forward(self, X):
            X_embed = self.embedding(X)
            outr1, _ = self.lstm1(X_embed)
            xr = self.dropout(outr1) 
            xr= self.linear1(xr)
            xr = self.dropout(xr)
            xr= self.linear2(xr)
            outr4 = self.sigmoid(xr)
            outr4 = outr4.view(1,-1)

            return outr4

model = LSTMTagger()
torch.multiprocessing.set_sharing_strategy('file_system')
if torch.cuda.device_count() > 1:
  print("Using ", torch.cuda.device_count(), " GPUs")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs

# model =model.load_state_dict(torch.load('best_model_state.bin'))
model = nn.DataParallel(model, device_ids=[0]) #py r
torch.cuda.empty_cache()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)

def train_epoch(
      model,
      data_loader,
      loss_fn,
      optimizer,
      device,
      scheduler,
      n_examples
    ):
      model = model.train()
      losses = []
      correct_predictions = 0
      for d in data_loader:
        print(f"Input ids: {np.shape(d['input_ids'])}\n len: {len(d['input_ids'][0])}")
        input_ids = d["input_ids"].to(device)
        targets = d["targets"].to(device)
        outputs = model(input_ids)
        _, preds = torch.max(outputs, dim=1)
        print(f"outputs is {np.shape(outputs)}")
        print(f"targets is {targets}")
        # continue
        loss = criterion(outputs.squeeze(), targets)
        # loss.backward()
        # nn.utils.clip_grad_norm_(model.parameters(), clip)
        # optimizer.step()
        # loss = loss_fn(outputs, targets)
        correct_predictions += torch.sum(preds == targets)
        losses.append(loss.item())
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
      return correct_predictions.double() / n_examples, np.mean(losses)
EPOCHS = 6
optimizer = optim.Adam(model.parameters(), lr=2e-5)
total_steps = len(data_train) * EPOCHS
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
loss_fn = nn.CrossEntropyLoss().to(device)
history = defaultdict(list)
best_accuracy = 0
criterion = nn.BCELoss()


# In[86]:


print('starting training')
# exit()

for epoch in range(EPOCHS):
#   y_ip= input('Please enter y_ip y_ip value')
  # print()
#   if y_ip=='b':
#     break
      print(f'Epoch {epoch + 1}/{EPOCHS}')
      print('-' * 10)
      train_acc, train_loss = train_epoch(
        model,
        data_train,
        loss_fn,
        optimizer,
        device,
        scheduler,
        len(df_train)
      )

In this instance the sample input is a tensor of size: torch.Size([1, 512])
, that looks like this:

 tensor([[44561,   972,  7891,    94,  2191,   131,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]], device='cuda:0')

and output label (targets from train_epoch function) in case is just a simple 1 or 0 label in tensor form such as:

tensor([1], device='cuda:0')

I have been facing issues consistently with this approach. Initially the output was 1x512x1. So, I added


        outr4 = outr4.view(1,-1)

after the sigmoid layer. Then, the output shape was reduced to 1x512 and I used squeeze function but, still, I face errors such as this one:

ValueError: Using a target size (torch.Size([1])) that is different to the input size (torch.Size([512])) is deprecated. Please ensure they have the same size.

I have spent a lot of time trying to figure out what was going on but to no avail. Isn’t the output supposed to be either 1 or 0, instead of being a 1x512 shaped tensor?

I am relatively new to building models, so please excuse my lack of knowledge.

ptrblck · January 30, 2021, 10:07am

The error is raised, since you are providing a single target value, while the model output contains 512 predictions.
Could you explain your use case a bit, i.e. what should your model try to predict?
If it’s a single prediction per sample, you would have to “reduce” the 512 values somehow.
E.g. if they represent the temporal dimension, you might want to use the last time step (or calculate the mean etc.). On the other hand, if you would like to get a prediction for each time step, your target should also contain labels for all of them.

Y_S_S · January 30, 2021, 5:57pm

Hi, firstly, thanks a lot of the reply. I am trying to build a classification model to classify certain sentence vectors as 1 and others as 0. So, I am passing sentence vectors, each of shape 1x512 to the model (Not sure if it’s relevant, but the shape of embedding matrix that I used was 579964x256). I am initially using 200 sentence vectors of shape 1x512 each to test the model to make sure that everything is working well before I start the actual training. But, the batch size is 1, so at each forward/backward pass, the input is one sentence vector of size 1x512. I have posted a sample sentence vector above in the post.

At each layer of the forward function, here’s how the shape changes:
Initial input: 1x512
After embedding layer: 1x512x512
After LSTM layer: 1x512x128
After first dropout layer: 1x512x128
After first linear layer: 1x512x64
After second dropout layer: 1x512x64
After second linear later: 1x512x1
After sigmoid layer: 1x512x1
After performing ‘view(1,-1)’ on the output from sigmoid layer: 1x512

Now, as I mentioned earlier, I am trying to classify it as 1/0 and I was a bit perplexed at the shape of sigmoid’s output being 1x512x1. Then, I referred to few other posts on pytorch forums and I found changing the output using view(1,-1), and it was posted by you, @ptrblck. So, thanks for that! Anyway, now after modifying the final output from Sigmoid using view(), I am still left with 1x512 shaped vector. So, I was wondering on what to do. I saw your answer recommending to use the last time step of the sentence vector (so, I am assuming that this would be index 511 in the final output array) or to calculate mean of all the 512 numbers in the vector. Is one of these inherently better than the other and are there any other corrections in other parts of the code?

vdw · February 1, 2021, 12:38am

Since you build a classification model, you shouldn’t use the outr1 after outr1, _ = self.lstm1(X_embed) for further processing. outr1 contains the last hidden states (last w.r.t. to the number of LSTM layers, in case you have more than one). This is why you have a shape of (batch_size, seq_len, hidden_size), in your case (1, 512, 128). While you can use outr1, you need to concatenate, average, maxpool, etc. first. As a first approach, I would suggest the following: change the line(s) to

outr1, (h, c) = self.lstm1(X_embed)
h = h[-1] 
xr = self.dropout(h)

h = h[-1] yields the last hidden state (last with respect to the number of time steps) and has a shape of (batch_size, hidden_size), in your case (1, 128). This is what you want and can push through the remaining Dropout and Linear layers down down to (1, 1)

Apart from anything, why is you input sequence 512 items long, with most of them being 0 (I assume for padding)? Since you use a batch size of 1 anyway, you don’t need padding, and 512 is too long to properly learn a vanilla LSTM, I would argue.

Y_S_S · February 1, 2021, 2:49pm

Thanks a lot @vdw foR your detailed answer. It makes it much clearer now. I will implement the changes that you suggested.

Yes, it is for padding. I have a large data sample with some quite long sequences. I’ve used 200 sequences from that larger data to try to make sure that everything works well, before I start training the model on the actual data. And, the actual data does contain long sequences and that’s the reason behind padding and normalizing input data points, and converting them to 512.

I did not know that vanilla LSTMs would have hard time with sequences of larger length. I will try exploring other models like BERT etc.

Again, thank you for your comments, appreciate it. Would be glad to know if you have any more suggestions.