How to use the gradient of softmax

0xfede · March 28, 2018, 1:58am

I’m trying to understand how to use the gradient of softmax.

For example, if I had an input x = [1,2] to a Sigmoid activation instead (let’s call it SIG), the forward pass would return the vector [1/1+e^1, 1/1+e^2] and the backward pass would return gradSIG/x = [dSIG/dx1, dSIG/dx2] = [SIG(1)(1-SIG(1)), SIG(2)(1-SIG(2))]. That is, the gradient of Sigmoid with respect to x has the same number of components as x.

But with the softmax (let’s call it SMAX), the gradient is usually defined as SMAX(i)*(1-SMAX(j)) if i = j, else -SMAX(i) * SMAX(j). I understand this as meaning that softmax returns the full Jacobian matrix with:
[[dSMAX(x_1)/dx_1, dSMAX(x_1)/dx_2], [dSMAX(x_2)/dx_1, dSMAX(x_2)/dx_2]].

These are 4 components instead of 2 but x has 2 components only. So should I be summing dSMAX(x_1)/dx_1 + dSMAX(x_2)/dx_1 and dSMAX(x_1)/dx_2 + dSMAX(x_2)/dx_2?

hughperkins · March 28, 2018, 3:13am

softmax has the same number of elements in the input and output vector. The i and the j bit is because each output element doesnt depend just on the single corresponding input element, as per sigmoid, but on all the input elements. Because, without looking at all the input elements, how else could it normalize itself? Therefore, each output element is a function of all the input elements, and hence hte gradients flow from each output element to all of the input elements.

I wrote this rambling kind of thing also at https://stats.stackexchange.com/questions/215521/how-to-find-derivative-of-softmax-function-for-the-purpose-of-gradient-descent/328095#328095

0xfede · April 3, 2018, 7:59am

and hence hte gradients flow from each output element to all of the input elements

I think by this you mean that the real ‘derivative’ would be the full Jacobian matrix. That is, each output with respect to each input (all combinations).

By doing some more derivatives on paper however, I found that in practice, you only want each output with respect to its corresponding input. That is, for each smax(x_i), you want dsmax(x_i)/dx_i. Which is the first case of the piecewise version of the Jacobian (smax(x) * (1-smax(x))).