At first the soft_old_output is the same as soft_new_output . So the grad of soft_new_output is all zero. It’s right.
But when I try to use some of the value to do is, like that:

The small absolute error is most likely created by the limited numerical precision using flaot32 and you should get a smaller error using float64.
Also note, that using torch.log(torch.softmax(...)) is numerically less stable than F.log_softmax, as the latter applies the log-sum-exp trick to increase the stability.

Thank you very much~
And I have another question now, when I try to find the problem myself, I try to track gradient changes. But I don’t know where to find the grad_fn of each variable. So where can I find them? I mean, I want to know how they work so that I can analyze which part leading to this problem, but I can’t find them.

I know that but what does MulBackward0 do? It is easy to understand it but what about other grad_fn? Do they just work like the mathematical formula says? Or do they have any trick to work? So I want to find the code of them to have a look.

Yes, they should be implemented using the right mathematical derivatives.
You can find the implementations in derivatives.yaml, where these methods are either directly defined or are pointing to the name of the implementation.