Loss of training remains unchanged

pourya_farzi · August 16, 2020, 8:11am

Hi,
I have designed a regression model made up of CNN and LSTM. In the forward pass, first I put CNN and use its output for LSTM. Then I compute the score.
The problem is that, after the third epoch, LOSS remains constant. I tested different loss functions(L1Loss.MSEloss, … ), and even optimizers such as AdamW, adam, …, but there is no difference.
Moreover, I checked both output and the Last Hidden Layer of my LSTM to compute the score. Again, I couldn’t see any significant changes.
Herein lies my code.

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
print("Training for %d epochs..." % N_EPOCHS)
for epoch in range(1, N_EPOCHS + 1):
     for i, (xA, xB, score) in enumerate(train_loader2, 1):
          predicted_score = model(xA, xB)
          optimizer.zero_grad()
          loss = criterion(predicted_score,  score)
          total_loss += loss.item()
          loss.backward()
          optimizer.step()
        
          if i % 20 == 0:
             print('[{}] Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.2f}'.format(
                time_since(start), epoch,  i *
                len(xA), len(train_loader.dataset),
                100. * i * len(xA) / len(train_loader.dataset),
                total_loss / i * len(xA)))

Besides, my result :

[0m 3s] Train Epoch: 1 [1000/4500 (22%)]	Loss: 4.25
[0m 5s] Train Epoch: 1 [2000/4500 (44%)]	Loss: 2.97
[0m 6s] Train Epoch: 1 [3000/4500 (67%)]	Loss: 3.08
[0m 8s] Train Epoch: 1 [4000/4500 (89%)]	Loss: 2.67
[0m 11s] Train Epoch: 2 [1000/4500 (22%)]	Loss: 2.69
[0m 13s] Train Epoch: 2 [2000/4500 (44%)]	Loss: 2.19
[0m 15s] Train Epoch: 2 [3000/4500 (67%)]	Loss: 2.56
[0m 16s] Train Epoch: 2 [4000/4500 (89%)]	Loss: 2.28
.
.
[14m 0s] Train Epoch: 100 [1000/4500 (22%)]	Loss: 2.69
[14m 2s] Train Epoch: 100 [2000/4500 (44%)]	Loss: 2.19
[14m 4s] Train Epoch: 100 [3000/4500 (67%)]	Loss: 2.56
[14m 6s] Train Epoch: 100 [4000/4500 (89%)]	Loss: 2.28

I’d be grateful if someone could inform me why this might happen.

vdw · August 16, 2020, 11:08am

Can you post the code if your models with the forward() method, particularly the LSTM one?

adamr · August 16, 2020, 11:11am

Make sure you’ve checked the gradients. If they are zero (or very close to it), that means you hit the local minimum. You may use this function:

def pprint_step(model, lr: float = 0.01):
    all_parameters = dict((k, str(v.detach().numpy()) + ("<no grad>" if v.grad is None else (
        "↑" if all(v.grad > 0) else ("↓" if all(v.grad < 0) else "")) + ", Δ: " + str(v.grad.numpy() * lr))) for k, v in
                          model.named_parameters())
    pp = pprint.PrettyPrinter()
    s = pp.pformat(all_parameters)
    print(s)
    return s

pourya_farzi · August 16, 2020, 7:42pm

this is the forward function which returns 50 scores. I have BATCH_SIZE = 50

 def forward(self, xA, xB):
        ## input  (batch,seq_len)
        b_sizeA = xA.size(0)
        b_sizeB = xB.size(0)
        
        xA=xA.t()
        xB=xB.t()
        embeddedA = self.embedding(xA) 
        embeddedB = self.embedding(xB) # S*B x I (embedding size)
        # Again b*s*i
        embeddedA = embeddedA.permute(1,0,2) 
        embeddedB = embeddedB.permute(1,0,2) 
    
        # 3D -> 4D
        embeddedA = embeddedA.unsqueeze(1)
        embeddedB = embeddedB.unsqueeze(1)
        
        xA = self.activation(self.convA1(embeddedA))   
        xB = self.activation(self.convB1(embeddedB))

        xA=embeddedA * xA.expand_as(embeddedA)
        xB=embeddedB * xB.expand_as(embeddedB)

        xA=torch.cat((embeddedA,xA),dim=3)
        xB=torch.cat((embeddedB,xB),dim=3)
        
        xA = xA.squeeze(1) # batch, seq-length, embed size 
        xB = xB.squeeze(1) 

        hA = Variable(torch.ones(self.num_layers * self.number_dir, 
                            b_sizeA, self.hidden_sizeA))
        cA = Variable(torch.ones(self.num_layers * self.number_dir, 
                             b_sizeA, self.hidden_sizeA))
        
        hB = Variable(torch.ones(self.num_layers * self.number_dir, 
                            b_sizeB, self.hidden_sizeB))
        cB = Variable(torch.ones(self.num_layers * self.number_dir, 
                            b_sizeB, self.hidden_sizeB))

        outputA, (hn_A, cn_A) = self.rnnA(xA, (hA, cA)) 
        outputB, (hn_B, cn_B) = self.rnnB(xB, (hB, cB)) 

        cos_sim_list=self.cosinesim(hn_A,hn_B)

        return cos_sim_list.float()

vdw · August 17, 2020, 6:29am

Given the code that I see I can only try to give some pointers that might help. Given that you posted in NLP and use embeddings, I assume that you work with text.

You’re unsqueezing your embeddings from 3d to 4d, so I assume that you use a nn.Conv2d. What are the kernel sizes you choose. Note that you could also use nn.Conv1d which would make the unsqueezing obsolete
To be honest, I have no idea what the expand() methods really does or why you need it. But any method that is needed to “fix” the shape of tensors needs to be applied with great care.
Your setup looks a bit like a Siamese Network. However, you use 2 different CNN layers and 2 different RNN layers – only 1 embedding layer though. In a Siamese setting, you would push both inputs through the same layer.
I trust that you define your LSTMs with batch_first=True
Why do you transpose xA and xB and permute the first 2 dimensions after the embedding layer. I’m pretty sure you can simply remove both .t() and both permute() operations.

pourya_farzi · August 17, 2020, 8:44am

hi again Chris,
First of all, thanks a bunch for your tips.

As for your first hint, I used nn.Conv2d since I decided to add more channels for the next test.
*OMG, it’s a typo. EXPAND method was just a trial, I shouldn’t have sent it.
*Yeah, you are right. and also batch_first=True.
In respect of permuting and transposing, you’re definitely right. I wanted to exactly prepare the same shape as the torch example for the embedding layer. (I had thought that it might be the problem)
However, having said that, I don’t know what is wrong. One more thing, I have a varied size of sentences so I used pad sequences to fix the size of each input (It was the same thing as previous papers). I couldn’t use pack_padded due to network producing higher Loss. perhaps it’s because of the differences in the length of two sentences as one input each time.

pourya_farzi · August 17, 2020, 8:45am

Hi again Chris,
First of all, thanks a bunch for your tips.

As for your first hint, I used nn.Conv2d since I decided to add more channels for the next test.
OMG, it’s a typo. EXPAND method was just a trial, I shouldn’t have sent it.
Yeah, you are right. and also batch_first=True.
In respect of permuting and transposing, you’re definitely right. I wanted to exactly prepare the same shape as the torch example for the embedding layer . (I had thought that it might be the problem)
However, having said that, I don’t know what is wrong. One more thing, I have a varied size of sentences so I used pad sequences to fix the size of each input (It was the same thing as previous papers). I couldn’t use pack_padded due to network producing higher Loss. perhaps it’s because of the differences in the length of two sentences as one input each time.

pourya_farzi · August 17, 2020, 8:55am

Hi, Thanks a million for your help.
I checked it and a part of my result is like this.

    'rnnB.weight_hh_l0':
        '[[ 0.04917971  0.06434657 -0.09789566 ... -0.08545043  0.06401704\n  '
        '-0.0960645 ]\n [ 0.04035911  0.03888848  0.00772235 ...  0.01090766  '
        '0.07309137\n   0.0360844 ]\n [-0.06821513  0.04650425  0.00499327 '
        '...  0.03282529  0.07157829\n  -0.0139528 ]\n ...\n [-0.0370195  '
        '-0.09515273  0.00834717 ... -0.07280935 -0.07355522\n   0.03910803]\n '
        '[-0.05833184  0.0991596   0.02914995 ... -0.06485003 -0.08581042\n   '
        '0.08041064]\n [ 0.05045321 -0.01596119  0.08300199 ...  0.05857178  '
        '0.06642925\n   0.059383  ]]<no grad>',

And after a while, this appeared.
RuntimeError: Boolean value of Tensor with more than one value is ambiguous.
I don’t know why that happens and which boolean it does mean.

vdw · August 17, 2020, 9:06am

The choice of the type of convolution layer does not depend on the number of channels. Can you post the definitions of sefl.convA1 and self.convB1?

Sorry, I don’t know what this RuntimeError means.

pourya_farzi · August 17, 2020, 9:43am

Great!
Here it is :

self.convA1 = nn.Conv2d(1, 1, kernel_size=(5,100), padding=(2,0))

self.convB1 = nn.Conv2d(1, 1, kernel_size=(5,100), padding=(2,0))

padding is (2,0) for maintaining the height. Besides, each time I considered five rows(words). I’ve changed the shape of it for example (2,100), (28,1),… and even symmetric forms (5,5), … . I wanna check another dataset, simpler than what I have now. Any suggestion is welcomed.

vdw · August 17, 2020, 10:22am

OK, that was I was looking for. If you use nn.Conv2d over sentences, e.g., sequences of word vectors, then there are constraints on the kernel size: if your embedding vectors have as size of embed_size, then kernel_size=(x,embed_size) where x is the only value you can adjust. The issue is that it makes no semantic sense to convolve over the embedding dimension; see this older post of mine.

So if your 100 is the size of your embedding vectors, then kernel_size=(5,100) is valid while kernel_size=(5,5) is not.

By the way, what happens when you give the embeddings directly to the RNN(s), thus skipping the CNN part. I can’t intuitively tell what the CNN layer(s) are suppose to do.