Pytorch Chatbot loss function with ignore_index instead of target's padding mask

kedar · September 16, 2019, 9:08pm

Hello. I was going through Pytorch official tutorial regarding the implementation of seq2eq Chatbot with Attention. To gain better understanding I used different methods to achieve similar results. One of as I thought improvements was the implementation of simple nn.NLLLoss function with attribute ignore_index = padding_token. Previous solution that used binary mask tensor describing the padding of the target tensor seemed to be too complicated, I wanted a simpler solution, but as it tuned out adding few lines of code made everything worse.

criterion = nn.NLLLoss(ignore_index=PAD_token, reduction='mean')

In the definition of train function I only made small change in the loop when we forward propagate through decoder one step at a time:

if use_teacher_forcing:
    for t in range(max_target_len):
        decoder_output, decoder_hidden = decoder(
            decoder_input, decoder_hidden, encoder_outputs
        )
        # Teacher forcing: next input is current target
        decoder_input = target_variable[t].view(1, -1)

        # Calculate and accumulate loss
      
        loss = criterion(decoder_output, target_variable[t])
        total_loss += loss
        print_losses.append(loss.item())


# Perform backpropatation
total_loss.backward()

# Clip gradients: gradients are modified in place
_ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
_ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

# Adjust model weights
encoder_optimizer.step()
decoder_optimizer.step()

return sum(print_losses)/max_target_len

Output:
Iteration: 1; Percent complete: 0.0%; Average loss: 8.5246
Iteration: 2; Percent complete: 0.1%; Average loss: 8.5403
Iteration: 3; Percent complete: 0.1%; Average loss: 8.5663
Iteration: 4; Percent complete: 0.1%; Average loss: 8.5691
Iteration: 5; Percent complete: 0.1%; Average loss: 8.4637
Iteration: 6; Percent complete: 0.1%; Average loss: 8.5378
Iteration: 7; Percent complete: 0.2%; Average loss: 8.5575
Iteration: 8; Percent complete: 0.2%; Average loss: 8.5145
Iteration: 9; Percent complete: 0.2%; Average loss: 8.5717
Iteration: 10; Percent complete: 0.2%; Average loss: 8.5122
Iteration: 11; Percent complete: 0.3%; Average loss: 8.5587
Iteration: 12; Percent complete: 0.3%; Average loss: 8.5491
Iteration: 13; Percent complete: 0.3%; Average loss: 8.5933
Iteration: 14; Percent complete: 0.4%; Average loss: 8.5231
Iteration: 15; Percent complete: 0.4%; Average loss: 8.5239
… an so on

Total loss increase or fluctuate ate the same value, but it doesn’t want to decrease. Did anybody try to use any kind loss function with ignore_index in seq2seq model with variate length sequences and has some positive results? nn.CrossEntropyLoss behaves similarly.

I would like to emphasize that tensors passed to loss function have the following dimensions:
loss = criterion(decoder_output, target_variable[t])
decoder_output: (batch_size, vocab_size)
target_variable[t]: (batch_size)
Maybe there is a problem? Are they a proper dimensions, or should I reshape these tensors? In original code they calculate negative log manually passing to it “concatenated” tensors with final shape: (batch_size):
crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))

I tried to change learning rate, but it didn’t help at all.

One additional question. In my version I used:
packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths, batch_first=True, enforce_sorted=False), thus I didn’t sort batch in descending order according to sequence lengths. I noticed that my code process slower than original, could be that cause by using unsorted batch, or it should work well anyway?

Any suggestion would be very appreciated.

Best regards.

zhangguanheng66 · September 17, 2019, 1:24pm

I suggest you to open an issue on pytorch/tutorials and pin zhangguanheng66 and SethHWeidman.

Abhilash_Srivastava · September 21, 2019, 1:18am

Do you need to use optimizer.zero_grad() before loss.backward() ?

kedar · September 21, 2019, 9:57am

I don’t have to use optimizer.zero_grad() before loss.backward(), I can also use it after loss.backward(), but I think it works the same no matter which version I used, each time when the train function is called, so for each batch the gradient is set to zero.

I made some changes, now I accumulate decoder outputs, and then calculate loss on entire batch.

 for input_seq, target_seq, input_lengths, target_lengths in train_loader:
    # Run a training iteration with batch

    output = train(input_seq, target_seq, input_lengths, target_lengths, encoder, decoder, embedding,
      encoder_optimizer, decoder_optimizer, clip, teacher_forcing_ratio, criterion)
    
    #print(output.contiguous().view(-1, len(voc.word_count)).shape)
    #print(target_seq.contiguous().view(-1).shape)

    loss = criterion(output.view(-1, len(voc.word_count)), 
                     target_seq.contiguous().view(-1))
    loss = loss / target_seq.size(0)

    loss.backward()          
    encoder_optimizer.step()
    decoder_optimizer.step()
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

`
In train function loop through words in given sequences:

 #Forward propagate through decoder using SOS tokens
decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_outputs)

output_vectors.append(decoder_output)

#Forward batch of sequences through decoder one time step at a time
if use_teacher_forcing:
    for t in range(seq_len-1):
        
        # Teacher forcing: current target word is the next input
        decoder_input = target_seq[:, t].view(-1,1) # (batch_size, 1)
        
        # Forward propagate through decoder
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_outputs)
        
        output_vectors.append(decoder_output)
 return torch.cat(output_vectors, dim=1)

`
When calculation loss decoders inputs correspond to targets in following manner:
decoder_input (train on) SOS->1->2->3->4
targets 1->2->3->4->END

In loss function I used following shapes:
loss = criterion(output.view(-1, len(voc.word_count)),
target_seq.contiguous().view(-1))
output shape: [batch_size * seq_len, vocab_size]
target shape: [batch_size * seq_len]

It is totally different approach, but loss still fluctuates at the same level.

Any ideas what can be wrong here?

Best regards.
Radek

kedar · September 21, 2019, 4:47pm

I checked out the sum of the gradients of encoder.embedding.weight.grad during training for model that uses mask padding and for second one that uses criterion with ignore_index.
For model with padding mask the lower the loss was, the lower the gradients, but they decreased very slowly.
Iteration: 1; tensor(-0.1997)
Iteration: 35; tensor(-0.0057)

In contrast to preceding example, model with criterion begins with much lower gradients, and during the training gradients diminish much faster.
Iteration: 1; tensor(-5.2681e-05)
Iteration: 35; tensor(2.0964e-10)

The very small gradients cause that model doesn’t converge, but how to find a remedy to that problem? Could anybody please explain gradients behavior? Any other advice would be appreciated as well.

kedar · September 23, 2019, 12:29pm

I figured out what was wrong with my model. It turned out that despite that my loss function returned some reasonable values, the loss was not calculated properly, thus as a consequence model did not learn. Output from my AttentionDecoder was softmaxed, then I used CrossEntropyLoss or NLLLoss (tried them both), but I did not change the softmax to log_softmax in case of NLLLoss, or in case of using CrossEntropyLoss I did not get rid of softmax at all as CEL comprised of log_softmax and NLLLoss.