When coping with variable lengths of the input sequence, if we use pack_padded_sequence, do we still need to set the ignore_index
parameter of the loss function to get rid of calculating the gradient of the padding element?
For example, in the image captioning PyTorch tutorial, the forward method of the DecoderRNN is
embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
packed = pack_padded_sequence(embeddings, lengths, batch_first=True)
hiddens, _ = self.lstm(packed)
outputs = self.linear(hiddens[0])
But during the training, it directly use the
# Set mini-batch dataset
images = images.to(device)
captions = captions.to(device)
targets = pack_padded_sequence(captions, lengths, batch_first=True)[0]
# Forward, backward and optimize
features = encoder(images)
outputs = decoder(features, captions, lengths)
loss = criterion(outputs, targets)
decoder.zero_grad()
encoder.zero_grad()
loss.backward()
optimizer.step()
My questions are:
Q1: If we use pack_padded_sequence
to pack the sequence with padding, it seems that it is not necessary to set ignore_index
in the loss function. However, if we then use pad_packed_sequence
to unpack the result returned by RNN, do we need to set following ignore_index
in the loss function?
Q2: How can I get the answer of Q1 by writing a toy program, such as printing the gradients of each element? Which element should I print out if I want to figure out whether the gradient of the padding element BP or not?
Thank you so much!