it is known that torch.nn.CrossEntropyLoss() method integrates the softmax method for calculating the loss. in my case, I want to use sparsemax method instead and still keep the CrossEntropyLoss as a criterion. how would this be possible?

ps: my model is a FCNN that has 6 inputs and 125 output classes and 5 hidden layers each with 128 neurons. I welcome any critics of the model (since I am not fully convinced about it)

thank you

Hi Yuri!

First off, I don’t think you want to do this. `CrossEntropyLoss`

has a

logarithmic divergence when you predict a probability of zero for what

your `target`

labels as the correct class. You will therefore get `inf`

for

your loss function, which will quickly pollute your backpropagation and

future training.

Note that what I believe is the original sparsemax paper also proposes

a companion “sparsemax loss” that one would presumably use in place

of `CrossEntropyLoss`

. I haven’t tried any of this, but perhaps it would

make sense for your use case.

With the proviso that it’s probably not a good idea, I see two ways:

You could write your own version of `CrossEntropyLoss`

that applies just

`log()`

rather than `log_softmax()`

internally and takes probabilities as its

`input`

. Then pass the output of sparsemax into your custom cross-entropy

loss.

You could also use the fact that `log()`

is the inverse of `softmax()`

(in the

sense that `t.softmax (0).log().softmax (0) == t.softmax (0)`

)

and apply `log() `

to the output of sparsemax before feeding it into pytorch’s

`CrossEntropyLoss`

.

Note that in both cases you will be applying `log()`

to the output of

sparsemax, which will yield `inf`

when the output of sparsemax is

zero.

You could try clamping the output of sparsemax away from zero, but

when you’re in the clamped regime, your gradient for the clamped

probability will be zero, defeating the benefit you get from the logarithmic

divergence when your prediction is quite wrong, and degrading training.

Best.

K. Frank

thanks for your answer KFrank, first the log in CrossEntropyLoss is preseeded by the softmax therefore no zeros.

the idea of writing an adapted version of CrossEntropyLoss sounds necessary since Pytorch does not integrate it yet.

Best regards

Hi Yuri!

Just to be clear about the logic of my previous post:

The sparsemax function outputs probabilities. (They range over `[0.0, 1.0]`

,

inclusive, and sum to `1.0`

.)

Pytorch’s `CrossEntropyLoss`

is *not* designed to take probabilities as its

input. Instead, it takes (unnormalized) log-probabilities (that range over

`(-inf, inf)`

).

If you want to pass the output of sparsemax to a cross-entropy function,

you have two choices: You can pass the outputs of sparsemax to pytorch’s

`CrossEntropyLoss`

, but this is a known mistake, because passing

probabilities to `CrossEntropyLoss`

doesn’t train well.

In the analogous `softmax()`

situation, this would be like:

```
logprobs = torch.nn.Linear (10, 10) (somedata)
probs = logprobs.softmax (dim = 0)
loss = torch.nn.CrossEntropyLoss() (probs, target) # error, should have passed in logprobs
```

You can do this and it will “work,” but it won’t train well.

It is true that when sparsemax outputs a zero, the `softmax()`

that is internal

to pytorch’s `CrossEntropyLoss`

will protect `CrossEntropyLoss`

’s `log()`

and you won’t get that `inf`

. But it’s still a mistake, because pytorch’s

`CrossEntropyLoss`

doesn’t work properly when passed probabilities.

Or you can pass the output of sparsemax to a version of cross entropy that

accepts probabilities. Internally such a cross-entropy function will take the

`log()`

of its inputs (because that it’s how it’s defined). But now when you

pass in a probability that is exactly zero (rather than just very small), you

will get an `inf`

that breaks things unless you stand on your head to patch

it up somehow.

The key problem is that sparsemax *is designed* to output probabilities that

are frequently exactly zero, whereas a cross-entropy loss doesn’t work

with probabilities that are exactly zero. (Mathematically, you could say that

cross entropy isn’t defined when an input probability is zero, or you could

say that it’s defined to be `inf`

, but either way, your training breaks.)

Well, this depends on what you mean by “adapted.” But if your “adapted

version” remains close to a standard cross-entropy function – for example,

by adding epsilons or clamping probabilities away from zero – it won’t train

well.

The original sparsemax paper proposes a sparsemax-loss function (that

I would say is *not* just an “adapted” version of cross entropy). I’ve never

tried it, but it seems conceptually sound and is likely to be better than

trying to patch over the inherent incompatibility between cross entropy

and exactly-zero probabilities.

Best.

K. Frank

Dear K. Frank

once again thank you very much for your informative answer.

unfortunately, I could not find any implementation of the sparesmax loss function that could be easily exploited.

by “adapted” I mean to skip the log softmax part and only run the -∑ ground_truth_i * p_i part.

Regards