Cross entropy for the batch with different masks

Hello to everyone!

TLDR. I want to compute cross entropy for the batch, and in the batch there can be different masking strategy for each element in a batch.

I am having quite interesing question. Consider that my output returns a tensor of shape [B, N, C]. B is the number of batches, N is the sequence length and C is the length of vocabulary.

To compute cross entropy loss I want to get each 2nd word, or each 3rd word, etc. depending on the original sequence length. However, I am not sure whether it is possible to parallelize this operation.

You should be able to slice the model output and targets in the temporal dimension and only pass the desired time steps to the criterion. Did you try it out and if so did you see any issues?

Unrelated to this, but you would have to permute the model output so that the shape is [batch_size, nb_classes, seq_len].