Two questions:
There is a lot of discussion about numeric stability (see here for example). Is this the case in the provided solution?
Why is it necessary to substract the max of x?
this is exactly for numerical stability, exp(x) never overflows on a non-positive tensor x, exp(0)=1 (constant) is always included in the denominator, and underflow is more stable (worst case output is 0 instead of Infinity)
Note that softmax(x) = exp(x)/sum(exp(x)). If we have x = [1, 10, 1000, 10000, 10000000], exp(x) would be too large so that our computer may couldn’t store it (may return inf). After substracting the max of x, x is in the interval (-inf, 0] and exp(x) is in the interval (0, 1].