How should I implement cross-entropy loss with continuous target outputs?

The current version of cross-entropy loss only accepts one-hot vectors for target outputs.
I need to implement a version of cross-entropy loss that supports continuous target distributions. What I don’t know is how to implement a version of cross-entropy loss that is numerically stable.

For example, would the following implementation work well?

output = model(input) #model output is a softmax distribution over 3 categories
target = Variable(torch.FloatTensor([0.1, 0.7, 0.2])) #target distribution is continuous – not one-hot
loss = -1 * torch.sum(target * torch.log(output)) #compute the cross-entropy
loss.backward()

Is this stable? Is there a built-in version I can use?

Thank you

6 Likes

I don’t know if it’s possible, but I’m interested in why you would want to use CrossEntropyLoss?

I have used MSELoss for similar things with good results.

Also, unless I’m very mistaken, the targets for nn.CrossEntropyLoss are not one-hot.

Change softmax + log to nn.LogSoftmax and you are golden :slight_smile:.

3 Likes

Unfortunately, in current PyTorch’s CrossEntropyLoss, they are one-hot in the sense that target contains only one ground-truth class with “probability” 1.

1 Like

sorry can you detail a little more what you mean by your answer?

Change softmax + log to nn.LogSoftmax and you are golden

instead of doing torch.log(<model_softmax_output>),

change the last layer of the neural network to LogSoftmax and remove the torch.log() from the loss equation.

1 Like

may I see it in code?

I’m not doing

torch.log(<model_softmax_output>)

I’m doing

loss = criterion(y_pred, batch_ys)

with:

criterion = torch.nn.CrossEntropyLoss()

If you are using torch.nn.CrossEntropyLoss() then you don’t need a softmax output layer on your model.

So it would just be

output = model(input) #logit output
criterion =torch.nn.CrossEntropyLoss()
loss = criterion(output, target)

if the output of the model is a probability distribution the right thing is to use cross entropy as its equivalent to MLE. Using square loss is something else (usually assumes the noise is Gaussian).

Honestly I’d rather not use torch.nn.CrossEntropyLoss(), and thats why I was asking to look at the actual code you used.

My code is specific for target distributions that are not one-hot, I don’t know if that’s what you want, but does this help?

output = model(input) #final layer of model is LogSoftmax(), so the output is the log-probability distribution
target = Variable(torch.FloatTensor([0.1, 0.7, 0.2])) #target probability distribution
loss = -1 * torch.sum(target * output) #the crossentropy formula is -1 * sum( log(output_dist) * target_dist)
loss.backward()

yes I don’t have hot vectors either I’m learning a distribution or continuous target values as a well.

Though, I thought that wasn’t right (due to numerical issues), hence your question…am I right?

I’m confused what you are asking is correct, but the code I wrote above works. The numerical problem arises when taking torch.log of the softmax distribution because it could potentially output nan.

Hi
I have the same problem and tried this solution but seems like it’s not working very well.
Is there a way to achieve this while keeping the model output as a regular softmax?

What do you mean by “it’s not working very well”?

If you make the output of your neural network softmax, and then take the log of it, it will be slower than logsoftmax and sometimes the output will be nan.

There are other loss functions, but cross-entropy loss is arguably the best one for probability distributions.

I meant I don’t get good results (of course there might be a different cause for this)
Anyway I would feel more comfortable if the model could just output a softmax and this way I could make sure there is no problem there.

The following code should work in PyTorch 0.2:

def cross_entropy(pred, soft_targets):
    logsoftmax = nn.LogSoftmax()
    return torch.mean(torch.sum(- soft_targets * logsoftmax(pred), 1))

assuming pred and soft_targets are both Variables with shape (batchsize, num_of_classes), each row of pred is predicted logits and each row of soft_targets is a discrete distribution.

18 Likes

Is there now an “official” pytorch function to do it or should we still do it by hand ?

4 Likes

I believe you can use BCELoss, as long as your label and outputs are represented as normalized vectors. For example,

loss_fn = nn.BCELoss()
softmax = nn.Softmax()
input = Variable(torch.randn(3))
output = softmax(input)
target = Variable(torch.FloatTensor([.1, .7, .2]))
loss = loss_fn(output, target)