I am working on image captioning task with PyTorch.

In seq2seq, padding is used to handle the variable-length sequence problems.

Additionally, mask is multiplied by the calculated loss (vector not scalar) so that the padding does not affect the loss.

In TensorFlow, i can do this as below.

```
# targets is an int64 tensor of shape (batch_size, padded_length) which contains word indices.
# masks is a tensor of shape (batch_size, padded_length) which contains 0 or 1 (0 if pad otherwise 1).
outputs = decoder(...) # unnormalized scores of shape (batch_size, padded_length, vocab_size)
outputs = tf.reshape(outputs, (-1, vocab_size))
targets = tf.reshape(targets, (-1))
losses = tf.nn.sparse_softmax_cross_entropy_loss(outputs, targets) # loss of shape (batch_size*padded_length)
masks = tf.reshape(masks, (-1))
loss = losses * masks
```

In PyTorch, `nn.CrossEntropyLoss()`

returns a scalar not tensor so that i can not multiply loss by masks.

```
criterion = nn.CrossEntropyLoss()
outputs = decoder(features, inputs) # (batch_size, padded_length, vocab_size)
loss = criterion(outputs.view(-1, vocab_size), targets.view(-1)) # this gives a scalar not tensor
```

How can i solve this problem?