Hi,
Is there any code line in my loss function to break backpropagation? Because, after started training, loss value plays around the started value, not changed dramatically.
I could not find why.
Thanks in advance.
I can’t see any line of code, which would detach a tensor from the graph.
If you are concerned about detaching, you could check the .grad attribute of all parameters after the backward call. If they contain valid values, your graph wasn’t detached, and I would recommend to try to overfit a small data sample as a quick test.
It returns None. Is it normal?
I checked it with a predefined loss function. Training loss value are decreasing as planned but loss.grad is still None.
Should I write .grad for the network model layers? I could not understand what all parameters mean.
Also, my dataset is small. 500 training, 250 validation and 250 test samples.
When I use the big dataset, the loss value still does not decrease.
The grad attribute will be retained for leaf variables by default.
If you want to print them for the loss, you would need to call loss.retain_grad() before calling loss.baclward().
However, note that this gradient will be 1. by default, if you didn’t pass any manual gradient argument to loss.backward(gradient=).
Here is a small example:
model = models.resnet18()
x = torch.randn(1, 3, 224, 224)
target = torch.zeros(1).long()
criterion = nn.CrossEntropyLoss()
out = model(x)
loss = criterion(out, target)
loss.retain_grad() # use this to print the grad
loss.backward()
print(loss.grad)
> tensor(1.)
To print the gradient of all parameters, you could use this code snippet after calling backward:
# print grads of all parameters
for name, param in model.named_parameters():
print(name, param.grad.abs().max())
@ptrblck hi again,
as you said, loss.grad returns 1. However, the gradient of the parameters seems there are some problems.
All parameters have the same weight and bias regardless of the epoch except FC.bias tensor.
At least, they seem same but I used your code that mentioned here:
Many parameters have been printed so it means there are some changes between new and old state but it is so small for example:
It might be the reason and as explained in the other topic, the sign method could kill the gradients.
You could try to increase the learning rate and compare how large the updates would get (you can of course also increase the learning rate for a specific parameter set only).
Alternatively, you could try to use smooth approximations of the sign function, which wouldn’t yield a zero gradient.