The current version of cross-entropy loss only accepts one-hot vectors for target outputs.
I need to implement a version of cross-entropy loss that supports continuous target distributions. What I don’t know is how to implement a version of cross-entropy loss that is numerically stable.

For example, would the following implementation work well?

output = model(input) #model output is a softmax distribution over 3 categories
target = Variable(torch.FloatTensor([0.1, 0.7, 0.2])) #target distribution is continuous – not one-hot
loss = -1 * torch.sum(target * torch.log(output)) #compute the cross-entropy
loss.backward()

Is this stable? Is there a built-in version I can use?

Unfortunately, in current PyTorch’s CrossEntropyLoss, they are one-hot in the sense that target contains only one ground-truth class with “probability” 1.

if the output of the model is a probability distribution the right thing is to use cross entropy as its equivalent to MLE. Using square loss is something else (usually assumes the noise is Gaussian).

My code is specific for target distributions that are not one-hot, I don’t know if that’s what you want, but does this help?

output = model(input) #final layer of model is LogSoftmax(), so the output is the log-probability distribution
target = Variable(torch.FloatTensor([0.1, 0.7, 0.2])) #target probability distribution
loss = -1 * torch.sum(target * output) #the crossentropy formula is -1 * sum( log(output_dist) * target_dist)
loss.backward()

I’m confused what you are asking is correct, but the code I wrote above works. The numerical problem arises when taking torch.log of the softmax distribution because it could potentially output nan.

Hi
I have the same problem and tried this solution but seems like it’s not working very well.
Is there a way to achieve this while keeping the model output as a regular softmax?

If you make the output of your neural network softmax, and then take the log of it, it will be slower than logsoftmax and sometimes the output will be nan.

There are other loss functions, but cross-entropy loss is arguably the best one for probability distributions.

I meant I don’t get good results (of course there might be a different cause for this)
Anyway I would feel more comfortable if the model could just output a softmax and this way I could make sure there is no problem there.

assuming pred and soft_targets are both Variables with shape (batchsize, num_of_classes), each row of pred is predicted logits and each row of soft_targets is a discrete distribution.