Softmax Function for a Probability Vector

mbehzad · October 23, 2020, 7:48am

Hi,
I know that the softmax function outputs probabilities with sum equal to 1. However, if we give it a probability vector (which already sums up to 1) , why does not it return the same values? For example, if I input [0.1 0.8 0.1] to softmax, it returns [0.2491 0.5017 0.2491], isn’t this wrong in some sense?

learner47 · October 23, 2020, 7:59am

It is because of the way softmax is calulated. When you compute exp(0.1)/(exp(0.1)+exp(0.8)+exp(0.1)), the value turns out to be 0.2491.

mbehzad · October 23, 2020, 8:02am

Thanks for the answer. Yeah yeah that I know. But my question is, isn’t it wrong in some sense?

Abhilash_Srivastava · October 23, 2020, 8:35am

Softmax is an activation function. The purpose is not just to ensure that the values are normalized (or rescaled) to sum = 1, but also allow to be used as input to cross-entropy loss (hence the function needs to be differentiable).

For your case, the inputs can be arbitrary values (not necessarily probability vectors). It is possible that there’s a mix of positive and negative values which still sum = 1 (eg: [0.3, 0.8, -0.2].
Since softmax picks the class with the highest value, with the values being softly rescaled, hence the name ‘soft’-‘max’.

KFrank · October 23, 2020, 2:17pm

Hi mbehzad!

Well, I suppose it depends on what your expectations are …

But you might wish to base your expectations on some other functions:

x**2 maps (-inf, inf) to [0.0, inf), but we don’t expect x**2 = x
to hold true for x >= 0.0, that is for values of x in [0.0, inf).

Or, back in the pytorch activation function world, torch.sigmoid() maps
(-inf, inf) to (0.0, 1.0), but torch.sigmoid (torch.sigmoid())
isn’t equal to torch.sigmoid().

Here’s another thing to consider:
softmax ([0.0 + delta, 1.0 - delta])

How would you like softmax() to behave when a negative delta
becomes zero and then crosses over to become positive? Bear in
mind, you want this behavior to be usefully differentiable to support
backpropagation.

K. Frank