Here is my code, it’s similar to official example.

```
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
loss = nn.CrossEntropyLoss()
feature = torch.ones(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
output_sm = F.log_softmax(feature, dim=1)
output_nll = F.nll_loss(output_sm, target)
output = output_nll
output.backward()
print("Input:\n", feature)
print("Target:\n", target)
print("Gradient in feature:\n", feature.grad)
```

The picture below is my results.

The result of theoretical deduction is as follow:

Partial derivative of L to z equals p - y

L represents loss function

z is the input feature

p is the output of softmax

y is the target

Regarding the detailed derivation of this formula, I am not here because it is easy to get.

What confused me is the experimental results do not match the theoretical values?

Can you give me some suggestion? Thank you!