I’m working on a system in which I want the network to predict one of the classes with a very high score and the others a very low score. These scores are attention weights, and they select different sections of the network downstream, so I want the network to choose a single downstream section at once and not a mixture of different ones. I have lowered the temperature on the softmax, but I am wondering how to implement an additional loss to incentivize the network to favor a single class, and not a mixture of many. Please let me know if you need any clarification.
Part of me wonders whether this is a good idea. You’re not able
to train your network to make these predictions with confidence,
so you want to fake it, and I wonder if this isn’t a little fishy.
Having said that:
I would have thought that tweaking the softmax temperature would
help. How did that not give you what you were looking for?
Working with probabilities (after the softmax, not logits), you could
simply take new_preds = torch.nn.functional.softmax (alpha * old_preds).
Increasing alpha will drive one of your predicted probabilities to
1, and the others to 0.
If you really want to add a loss term:
Thinking in terms of probabilities, for a problem with nClass classes,
all probabilities being equal to 1 / nClass is the least “confident”
prediction, while one probability being 1 and the others 0 is the
most “confident.” So we build a function of each individual probability, p, that peaks at 1 / nClass and goes to zero at 0 and 1.
First, p - p**2 peaks at p = 1/2 and goes to zero at 0 and 1.
Next, p**(log (2) / log (nClass)) maps the interval [0.0, 1.0] to itself (monotonically), while mapping (1 / nClass) to 1/2.
So, for a tensor of predicted probabilities, p, set q = p**(log (2) / log (nClass)), calculate your additional loss as loss = alpha * (q - q**2).sum(), where alpha just tunes how
strong you want this effect to be.
This loss function has the properties you want. As to whether your
overall scheme will work, I don’t know.
I know it sounds odd, but this classification is not the final answer of the network, only an intermediate selection of other downstream parts of the network. I am currently using a lower temperature on the softmax, and it does pretty well, but I don’t want the network to try and compensate for the colder softmax and still achive a mixture of answers. This is why I want to incentivize the network to stick with one choice and not a mixture. Thanks again for everything!