Combining sparmax with CrossEntropyLoss

it is known that torch.nn.CrossEntropyLoss() method integrates the softmax method for calculating the loss. in my case, I want to use sparsemax method instead and still keep the CrossEntropyLoss as a criterion. how would this be possible?
ps: my model is a FCNN that has 6 inputs and 125 output classes and 5 hidden layers each with 128 neurons. I welcome any critics of the model (since I am not fully convinced about it)
thank you

Hi Yuri!

First off, I don’t think you want to do this. `CrossEntropyLoss` has a
logarithmic divergence when you predict a probability of zero for what
your `target` labels as the correct class. You will therefore get `inf` for
future training.

Note that what I believe is the original sparsemax paper also proposes
a companion “sparsemax loss” that one would presumably use in place
of `CrossEntropyLoss`. I haven’t tried any of this, but perhaps it would
make sense for your use case.

With the proviso that it’s probably not a good idea, I see two ways:

You could write your own version of `CrossEntropyLoss` that applies just
`log()` rather than `log_softmax()` internally and takes probabilities as its
`input`. Then pass the output of sparsemax into your custom cross-entropy
loss.

You could also use the fact that `log()` is the inverse of `softmax()` (in the
sense that `t.softmax (0).log().softmax (0) == t.softmax (0)`)
and apply `log() ` to the output of sparsemax before feeding it into pytorch’s
`CrossEntropyLoss`.

Note that in both cases you will be applying `log()` to the output of
sparsemax, which will yield `inf` when the output of sparsemax is
zero.

You could try clamping the output of sparsemax away from zero, but
probability will be zero, defeating the benefit you get from the logarithmic

Best.

K. Frank

the idea of writing an adapted version of CrossEntropyLoss sounds necessary since Pytorch does not integrate it yet.
Best regards

Hi Yuri!

Just to be clear about the logic of my previous post:

The sparsemax function outputs probabilities. (They range over `[0.0, 1.0]`,
inclusive, and sum to `1.0`.)

Pytorch’s `CrossEntropyLoss` is not designed to take probabilities as its
input. Instead, it takes (unnormalized) log-probabilities (that range over
`(-inf, inf)`).

If you want to pass the output of sparsemax to a cross-entropy function,
you have two choices: You can pass the outputs of sparsemax to pytorch’s
`CrossEntropyLoss`, but this is a known mistake, because passing
probabilities to `CrossEntropyLoss` doesn’t train well.

In the analogous `softmax()` situation, this would be like:

``````logprobs = torch.nn.Linear (10, 10) (somedata)
probs = logprobs.softmax (dim = 0)
loss = torch.nn.CrossEntropyLoss() (probs, target)   # error, should have passed in logprobs
``````

You can do this and it will “work,” but it won’t train well.

It is true that when sparsemax outputs a zero, the `softmax()` that is internal
to pytorch’s `CrossEntropyLoss` will protect `CrossEntropyLoss`’s `log()`
and you won’t get that `inf`. But it’s still a mistake, because pytorch’s
`CrossEntropyLoss` doesn’t work properly when passed probabilities.

Or you can pass the output of sparsemax to a version of cross entropy that
accepts probabilities. Internally such a cross-entropy function will take the
`log()` of its inputs (because that it’s how it’s defined). But now when you
pass in a probability that is exactly zero (rather than just very small), you
will get an `inf` that breaks things unless you stand on your head to patch
it up somehow.

The key problem is that sparsemax is designed to output probabilities that
are frequently exactly zero, whereas a cross-entropy loss doesn’t work
with probabilities that are exactly zero. (Mathematically, you could say that
cross entropy isn’t defined when an input probability is zero, or you could
say that it’s defined to be `inf`, but either way, your training breaks.)

version” remains close to a standard cross-entropy function – for example,
by adding epsilons or clamping probabilities away from zero – it won’t train
well.

The original sparsemax paper proposes a sparsemax-loss function (that
I would say is not just an “adapted” version of cross entropy). I’ve never
tried it, but it seems conceptually sound and is likely to be better than
trying to patch over the inherent incompatibility between cross entropy
and exactly-zero probabilities.

Best.

K. Frank

Dear K. Frank