Combining sparmax with CrossEntropyLoss

it is known that torch.nn.CrossEntropyLoss() method integrates the softmax method for calculating the loss. in my case, I want to use sparsemax method instead and still keep the CrossEntropyLoss as a criterion. how would this be possible?
ps: my model is a FCNN that has 6 inputs and 125 output classes and 5 hidden layers each with 128 neurons. I welcome any critics of the model (since I am not fully convinced about it)
thank you

Hi Yuri!

First off, I don’t think you want to do this. CrossEntropyLoss has a
logarithmic divergence when you predict a probability of zero for what
your target labels as the correct class. You will therefore get inf for
your loss function, which will quickly pollute your backpropagation and
future training.

Note that what I believe is the original sparsemax paper also proposes
a companion “sparsemax loss” that one would presumably use in place
of CrossEntropyLoss. I haven’t tried any of this, but perhaps it would
make sense for your use case.

With the proviso that it’s probably not a good idea, I see two ways:

You could write your own version of CrossEntropyLoss that applies just
log() rather than log_softmax() internally and takes probabilities as its
input. Then pass the output of sparsemax into your custom cross-entropy

You could also use the fact that log() is the inverse of softmax() (in the
sense that t.softmax (0).log().softmax (0) == t.softmax (0))
and apply log() to the output of sparsemax before feeding it into pytorch’s

Note that in both cases you will be applying log() to the output of
sparsemax, which will yield inf when the output of sparsemax is

You could try clamping the output of sparsemax away from zero, but
when you’re in the clamped regime, your gradient for the clamped
probability will be zero, defeating the benefit you get from the logarithmic
divergence when your prediction is quite wrong, and degrading training.


K. Frank

thanks for your answer KFrank, first the log in CrossEntropyLoss is preseeded by the softmax therefore no zeros.
the idea of writing an adapted version of CrossEntropyLoss sounds necessary since Pytorch does not integrate it yet.
Best regards

Hi Yuri!

Just to be clear about the logic of my previous post:

The sparsemax function outputs probabilities. (They range over [0.0, 1.0],
inclusive, and sum to 1.0.)

Pytorch’s CrossEntropyLoss is not designed to take probabilities as its
input. Instead, it takes (unnormalized) log-probabilities (that range over
(-inf, inf)).

If you want to pass the output of sparsemax to a cross-entropy function,
you have two choices: You can pass the outputs of sparsemax to pytorch’s
CrossEntropyLoss, but this is a known mistake, because passing
probabilities to CrossEntropyLoss doesn’t train well.

In the analogous softmax() situation, this would be like:

logprobs = torch.nn.Linear (10, 10) (somedata)
probs = logprobs.softmax (dim = 0)
loss = torch.nn.CrossEntropyLoss() (probs, target)   # error, should have passed in logprobs

You can do this and it will “work,” but it won’t train well.

It is true that when sparsemax outputs a zero, the softmax() that is internal
to pytorch’s CrossEntropyLoss will protect CrossEntropyLoss’s log()
and you won’t get that inf. But it’s still a mistake, because pytorch’s
CrossEntropyLoss doesn’t work properly when passed probabilities.

Or you can pass the output of sparsemax to a version of cross entropy that
accepts probabilities. Internally such a cross-entropy function will take the
log() of its inputs (because that it’s how it’s defined). But now when you
pass in a probability that is exactly zero (rather than just very small), you
will get an inf that breaks things unless you stand on your head to patch
it up somehow.

The key problem is that sparsemax is designed to output probabilities that
are frequently exactly zero, whereas a cross-entropy loss doesn’t work
with probabilities that are exactly zero. (Mathematically, you could say that
cross entropy isn’t defined when an input probability is zero, or you could
say that it’s defined to be inf, but either way, your training breaks.)

Well, this depends on what you mean by “adapted.” But if your “adapted
version” remains close to a standard cross-entropy function – for example,
by adding epsilons or clamping probabilities away from zero – it won’t train

The original sparsemax paper proposes a sparsemax-loss function (that
I would say is not just an “adapted” version of cross entropy). I’ve never
tried it, but it seems conceptually sound and is likely to be better than
trying to patch over the inherent incompatibility between cross entropy
and exactly-zero probabilities.


K. Frank

Dear K. Frank
once again thank you very much for your informative answer.
unfortunately, I could not find any implementation of the sparesmax loss function that could be easily exploited.
by “adapted” I mean to skip the log softmax part and only run the -∑ ground_truth_i * p_i part.