Pytorch Chatbot loss function with ignore_index instead of target's padding mask

Hello. I was going through Pytorch official tutorial regarding the implementation of seq2eq Chatbot with Attention. To gain better understanding I used different methods to achieve similar results. One of as I thought improvements was the implementation of simple nn.NLLLoss function with attribute ignore_index = padding_token. Previous solution that used binary mask tensor describing the padding of the target tensor seemed to be too complicated, I wanted a simpler solution, but as it tuned out adding few lines of code made everything worse.

criterion = nn.NLLLoss(ignore_index=PAD_token, reduction='mean')

In the definition of train function I only made small change in the loop when we forward propagate through decoder one step at a time:

if use_teacher_forcing:
    for t in range(max_target_len):
        decoder_output, decoder_hidden = decoder(
            decoder_input, decoder_hidden, encoder_outputs
        )
        # Teacher forcing: next input is current target
        decoder_input = target_variable[t].view(1, -1)

        # Calculate and accumulate loss
      
        loss = criterion(decoder_output, target_variable[t])
        total_loss += loss
        print_losses.append(loss.item())


# Perform backpropatation
total_loss.backward()

# Clip gradients: gradients are modified in place
_ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
_ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

# Adjust model weights
encoder_optimizer.step()
decoder_optimizer.step()

return sum(print_losses)/max_target_len

Output:
Iteration: 1; Percent complete: 0.0%; Average loss: 8.5246
Iteration: 2; Percent complete: 0.1%; Average loss: 8.5403
Iteration: 3; Percent complete: 0.1%; Average loss: 8.5663
Iteration: 4; Percent complete: 0.1%; Average loss: 8.5691
Iteration: 5; Percent complete: 0.1%; Average loss: 8.4637
Iteration: 6; Percent complete: 0.1%; Average loss: 8.5378
Iteration: 7; Percent complete: 0.2%; Average loss: 8.5575
Iteration: 8; Percent complete: 0.2%; Average loss: 8.5145
Iteration: 9; Percent complete: 0.2%; Average loss: 8.5717
Iteration: 10; Percent complete: 0.2%; Average loss: 8.5122
Iteration: 11; Percent complete: 0.3%; Average loss: 8.5587
Iteration: 12; Percent complete: 0.3%; Average loss: 8.5491
Iteration: 13; Percent complete: 0.3%; Average loss: 8.5933
Iteration: 14; Percent complete: 0.4%; Average loss: 8.5231
Iteration: 15; Percent complete: 0.4%; Average loss: 8.5239
… an so on

Total loss increase or fluctuate ate the same value, but it doesn’t want to decrease. Did anybody try to use any kind loss function with ignore_index in seq2seq model with variate length sequences and has some positive results? nn.CrossEntropyLoss behaves similarly.

I would like to emphasize that tensors passed to loss function have the following dimensions:
loss = criterion(decoder_output, target_variable[t])
decoder_output: (batch_size, vocab_size)
target_variable[t]: (batch_size)
Maybe there is a problem? Are they a proper dimensions, or should I reshape these tensors? In original code they calculate negative log manually passing to it “concatenated” tensors with final shape: (batch_size):
crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))

I tried to change learning rate, but it didn’t help at all.

One additional question. In my version I used:
packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths, batch_first=True, enforce_sorted=False), thus I didn’t sort batch in descending order according to sequence lengths. I noticed that my code process slower than original, could be that cause by using unsorted batch, or it should work well anyway?

Any suggestion would be very appreciated.

Best regards.

I suggest you to open an issue on pytorch/tutorials and pin zhangguanheng66 and SethHWeidman.

Do you need to use optimizer.zero_grad() before loss.backward() ?

I don’t have to use optimizer.zero_grad() before loss.backward(), I can also use it after loss.backward(), but I think it works the same no matter which version I used, each time when the train function is called, so for each batch the gradient is set to zero.

I made some changes, now I accumulate decoder outputs, and then calculate loss on entire batch.

 for input_seq, target_seq, input_lengths, target_lengths in train_loader:
    # Run a training iteration with batch

    output = train(input_seq, target_seq, input_lengths, target_lengths, encoder, decoder, embedding,
      encoder_optimizer, decoder_optimizer, clip, teacher_forcing_ratio, criterion)
    
    #print(output.contiguous().view(-1, len(voc.word_count)).shape)
    #print(target_seq.contiguous().view(-1).shape)

    loss = criterion(output.view(-1, len(voc.word_count)), 
                     target_seq.contiguous().view(-1))
    loss = loss / target_seq.size(0)

    loss.backward()          
    encoder_optimizer.step()
    decoder_optimizer.step()
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

`
In train function loop through words in given sequences:

 #Forward propagate through decoder using SOS tokens
decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_outputs)

output_vectors.append(decoder_output)

#Forward batch of sequences through decoder one time step at a time
if use_teacher_forcing:
    for t in range(seq_len-1):
        
        # Teacher forcing: current target word is the next input
        decoder_input = target_seq[:, t].view(-1,1) # (batch_size, 1)
        
        # Forward propagate through decoder
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_outputs)
        
        output_vectors.append(decoder_output)
 return torch.cat(output_vectors, dim=1)

`
When calculation loss decoders inputs correspond to targets in following manner:
decoder_input (train on) SOS->1->2->3->4
targets 1->2->3->4->END

In loss function I used following shapes:
loss = criterion(output.view(-1, len(voc.word_count)),
target_seq.contiguous().view(-1))
output shape: [batch_size * seq_len, vocab_size]
target shape: [batch_size * seq_len]

It is totally different approach, but loss still fluctuates at the same level.

Any ideas what can be wrong here?

Best regards.
Radek

I checked out the sum of the gradients of encoder.embedding.weight.grad during training for model that uses mask padding and for second one that uses criterion with ignore_index.
For model with padding mask the lower the loss was, the lower the gradients, but they decreased very slowly.
Iteration: 1; tensor(-0.1997)
Iteration: 35; tensor(-0.0057)

In contrast to preceding example, model with criterion begins with much lower gradients, and during the training gradients diminish much faster.
Iteration: 1; tensor(-5.2681e-05)
Iteration: 35; tensor(2.0964e-10)

The very small gradients cause that model doesn’t converge, but how to find a remedy to that problem? Could anybody please explain gradients behavior? Any other advice would be appreciated as well.

I figured out what was wrong with my model. It turned out that despite that my loss function returned some reasonable values, the loss was not calculated properly, thus as a consequence model did not learn. Output from my AttentionDecoder was softmaxed, then I used CrossEntropyLoss or NLLLoss (tried them both), but I did not change the softmax to log_softmax in case of NLLLoss, or in case of using CrossEntropyLoss I did not get rid of softmax at all as CEL comprised of log_softmax and NLLLoss.