I want to reimplement Softmax so I can customize it.
I followed this post by ptrblck.
There is a lot of discussion about numeric stability (see here for example). Is this the case in the provided solution?
Why is it necessary to substract the max of x?
Thanks for you help!
this is exactly for numerical stability, exp(x) never overflows on a non-positive tensor x, exp(0)=1 (constant) is always included in the denominator, and underflow is more stable (worst case output is 0 instead of Infinity)
softmax(x) = exp(x)/sum(exp(x)). If we have
x = [1, 10, 1000, 10000, 10000000],
exp(x) would be too large so that our computer may couldn’t store it (may return
inf). After substracting the max of
x is in the interval
(-inf, 0] and
exp(x) is in the interval