The method used in the paper works by mixing two inputs and their respective targets. This requires the targets to be smooth (float/double).
However, PyTorch’s nll_loss (used by CrossEntropyLoss) requires that the target tensors will be in the Long format.
One idea is to do weighted sum of hard loss for each non zero label. This seems reasonable to me, since there are two such labels in this case (because you mix two samples).
Whats the recommended way to go about this?
In the paper (and the Chainer code) they used cross entropy, but the extra loss term in binary cross entropy might not be a problem. I’ll give it a try.
I’m confused. The chainer implementation uses softmax_cross_entropy, which from the docs, takes integer targets like PyTorch’s cross entropy. What am I missing here?