I am working on image captioning task with PyTorch.
In seq2seq, padding is used to handle the variable-length sequence problems.
Additionally, mask is multiplied by the calculated loss (vector not scalar) so that the padding does not affect the loss.

In TensorFlow, i can do this as below.

# targets is an int64 tensor of shape (batch_size, padded_length) which contains word indices.
# masks is a tensor of shape (batch_size, padded_length) which contains 0 or 1 (0 if pad otherwise 1).
outputs = decoder(...) # unnormalized scores of shape (batch_size, padded_length, vocab_size)
outputs = tf.reshape(outputs, (-1, vocab_size))
targets = tf.reshape(targets, (-1))
losses = tf.nn.sparse_softmax_cross_entropy_loss(outputs, targets) # loss of shape (batch_size*padded_length)
masks = tf.reshape(masks, (-1))
loss = losses * masks

In PyTorch, nn.CrossEntropyLoss() returns a scalar not tensor so that i can not multiply loss by masks.

criterion = nn.CrossEntropyLoss()
outputs = decoder(features, inputs) # (batch_size, padded_length, vocab_size)
loss = criterion(outputs.view(-1, vocab_size), targets.view(-1)) # this gives a scalar not tensor

A non-averaged cross-entropy loss is coming soon. Until then you can write your own using log_softmax and advanced indexing.
In addition, though I donâ€™t think it helps you here, nn.LSTM now has support for variable-length sequences without including padding, meaning that sequence model results will not be affected by the influence of padding tokens even with bidirectional RNNs. There are utility functions provided for creating the packed array data structure (~TFâ€™s TensorArray) needed for this.

How can i use nn.LSTM with variable-length sequences and without padding?
In code below, torch.utils.data.DataLoader concatenates each single tensor to construct mini batch data. This makes me pad the each sequence to make the tensor of fixed size.

cap = CocoCaptions(root = './data/train2014resized',
annFile = './data/annotations/captions_train2014.json',
vocab = vocab,
transform=transform,
target_transform=transforms.ToTensor())
data_loader = torch.utils.data.DataLoader(
cap, batch_size=16, shuffle=True, num_workers=2)
for i, (images, input_seqs, target_seqs, masks) in enumerate(data_loader):
# images: a tensor of shape (batch_size, 3, 256, 256).
# input_seqs, target_seqs, masks: tensors of shape (batch_size, padded_length).

If padded is a Variable with padding in it and lengths is a tensor containing the length of each sequence, then this is how to run a (potentially bidirectional) LSTM over the sequences in a way that doesnâ€™t include padding, then pad the result in order to use it in further computations.

Like https://github.com/Element-Research/rnn has MaskZero module, wonâ€™t PyTorch also need a wrapper to deal with padded inputs and gradients between LSTM outputs and final layer? Is there any plan for that or is there other elegant way to deal with it?

The cross-entropy loss for a particular vector of output scores and a target index is the value at that index of the negative log softmax of the vector of scores, so you can run negative log softmax on the whole score tensor, pick out the values you want using gather (advanced indexing was briefly semi-supported for this, and will be fully supported eventually), then sum/average the results

Could you let me know how to deal with the following example case?

Assume that I have an input, where maximum sequence length is 5, minibatch size is 4, and there are 3 possible labels. idxs has effective lengths of sequences in the input.

input = Variable(torch.randn(5,4,3))
idxs = [5,3,3,2]
target = Variable(torch.LongTensor(5,4))
# assume the target labels are assigned corresponding to the input.

Then, In a sequence tagging task, Iâ€™d like to get the cross entropy errors for the whole time steps of the first sequence, 3 time steps of the second sequence, and so on considering the values of idxs.
How could I use advanced indexing for addressing this sequence tagging with variable lengthed input sequences?

(I thought masked_select might be used for my purpose but I wonder what would be the most elegant one at this moment before some other features are added.)

The pack_padded_sequence can be placed at any point in the DAG right? That is, given a sequence, apply some dense layer on it and then pack it and give it to the LSTM

@pranav itâ€™s packed in an additional structure to hold the sequence lengths, but you might take out its .data attribute, compute a function on that, and rewrap it in a new torch.nn.utils.rnn.PackedSequence object.

Can you help me understand this torch.gather I have used the same tensors with similar shape but this line torch.gather(outputs, 1, targets.view(-1,1)) gives me error.

I used to use this masked_cross_entropy workaround, but i tested them both and i observe the same behaviour (compared to CrossEntropyLoss without ignore_index=0).

With the parameter â€średuceâ€ť, we can get the loss per batch element, but how can we use the mask on it?

For example, if I have a minibatch whose valid sequence lengths is [3,1,2], and with â€średuceâ€ť we can get three loss values [L1, L2, L3]. But we need to mask the last two values in calculating L2 and last one value in calculating L3. How could it be achieved?