How can i calculate correct softmax gradient?

How does the softmax gradient in the pytorch framework calculate the gradient of the input data? I followed this link https://community.deeplearning.ai/t/calculating-gradient-of-softmax-function/1897/3 to implement the gradient derivative function, why are the gradients of the derivatives all 0? How do I correctly pass the gradient when using the softmax layer?

import torch

def softmax_forward(x):
e_x = torch.exp(x - torch.amax(x, dim=1, keepdim=True))
softmax = e_x / torch.sum(e_x, dim=1, keepdim=True)
return softmax

def softmax_backward(softmax, grad_output):
shape = softmax.shape
softmax = torch.reshape(softmax, (1, -1))
grad_output = torch.reshape(grad_output, (1, -1))
d_softmax = softmax * torch.eye(softmax.numel()) - softmax.T @ softmax
grad = (grad_output @ d_softmax).reshape(shape)
return grad

data = torch.tensor([1,2,3,4], dtype=torch.float32, requires_grad=True).view(1, 4)
data.retain_grad()
out = torch.softmax(data, dim=-1)
out.retain_grad()

sim_out = softmax_forward(data)
print(f"out_diff: {torch.abs(sim_out - out).max()}")

y = torch.sum(out, dim=-1)
y.backward()

print(f"data_grad: {data.grad}“)
sim_grad = softmax_backward(out, out.grad)
print(f"grad_diff: {torch.abs(sim_grad - data.grad).max()}”)

out_diff: 0.0
data_grad: tensor([[0., 0., 0., 0.]])
grad_diff: 2.9802322387695312e-08

Hi Hao!

I haven’t looked at the details of your code, but softmax() has a property that
will cause your particular gradients to be zero. Namely, softmax() returns a
set of probabilities that sum to one, which, being a constant, has zero gradient.

Consider:

>>> torch.softmax (torch.tensor ([1., 2., 3., 4.]), dim  = -1)
tensor([0.0321, 0.0871, 0.2369, 0.6439])
>>> torch.softmax (torch.tensor ([1., 2., 3., 4.]), dim  = -1).sum()
tensor(1.)

Instead of calling backward on the sum of out, you might try out[0].backward().
An individual component of the result of softmax() is not a constant, so you will
get a non-trivial gradient.

Best.

K. Frank

Thank you very much for your reply。
Yes, what I want to describe is the same as what you said. After I call the backward function of the softmax function, I find that the gradient of the softmax input data obtained by using the softmax output data to differentiate is always 0. So I would like to ask, if softmax is an intermediate layer in a network, how is its corresponding gradient calculated? Is it only the maximum value to calculate the gradient or is there any other method? How to implement this part of the program in pytorch?

Hi Hao!

This isn’t true.

First note that applying softmax() to, say, a one-dimensional tensor returns a
one-dimensional tensor. You can’t compute a gradient of a tensor (of length greater
than one), so you have to take the gradient of some scalar function of the tensor.
If that scalar function happens to be sum(), then the result will be 1.0, a constant,
so the gradient of that constant will be zero.

However, if you use some other scalar function that doesn’t always return 1.0, you
will get a non-zero gradient, as you expect.

Consider:

>>> import torch
>>> torch.__version__
'2.5.1'
>>> t = torch.arange (5.0, requires_grad = True)
>>> (t.softmax (dim = 0)**2).sum().backward()
>>> t.grad
tensor([-0.0106, -0.0277, -0.0658, -0.1097,  0.2139])

(Again, the point is that softmax() has the special property that the result of
softmax() sums to one. It’s the fact that you are summing the result of softmax(),
instead of doing something else with it, that leads to the zero gradient.)

Best.

K. Frank

Thank you very much for your answer. After your suggestion, I also realized that it was because I called the sum function every time, which caused their gradient values ​​to be 1, and thus the gradient of the input tensor to be 0. I just reproduced your code and the gradient is indeed normal. Thank you very much, and I wish you a happy life~