Loss.grad() is constant 1

Hi,

I am using torch.nn.BCELoss() and F.binary_cross_entropy_with_logits, but loss.grad() is always constant 1.

Do you know why this happens and how to solve it?
Thanks!

        optimizer.zero_grad()
        A_pred, z = model(features, adj_norm)

        weight_tensor = torch.tensor(weight_tensor, dtype=torch.float).cuda()

        criterion = torch.nn.BCELoss(weight=weight_tensor)
        log_lik = norm*criterion(A_pred.view(-1), adj_label.to_dense().view(-1))
        loss = log_lik 
        loss.retain_grad()
        loss.backward()
        print(loss.grad)
        optimizer.step()

Why is this a problem?

I tried four different models, but all of them are not learning… then I checked loss gradient and see it is constant 1. I guess it needs to change at least if the model is training properly.

Why do you call loss.retain_grad() ?

without retain_grad, loss gradient will be None. So I add retain_grad(), then gradient shows 1 always.

From what I understand, loss.retain_grad() initializes loss.grad to 1. This is why you get 1 as the value of loss.grad. This understanding was wrong. See my comment below.

This has no relation to whether your network is learning or not, since (as far as I understand) calling loss.backward() has no effect on loss.grad.

A better way to figure out if your network is learning is to print out one of the values in loss and see how if it changes over iterations.

For the sake of completeness:
The backward() operation will use the default gradient value of 1 for scalar loss values (for other shapes, you need to manually pass the gradient to backward).
Here is a small example showing that loss.grad is defined by the passed gradient:

model = nn.Linear(1, 1)
x = torch.randn(1, 1)

# default gradient
out = model(x)
loss = out**2
loss.retain_grad()

loss.backward()
print(loss.grad)
> tensor([[1.]])

# pass gradient manually
out = model(x)
loss = out**2
loss.retain_grad()

loss.backward(torch.tensor(10.).view_as(loss))
print(loss.grad)
> tensor([[10.]])

Besides that, I agree with @gphilip that checking the loss.grad value is not particularly useful to check if the model is learning or not.
As a debugging step I would also suggest to check the .grad attribute of all parameters of the model and make sure they show valid (or the expected) values.

To complete my understanding, based on what @ptrblck said: loss.retain_grad() initializes loss.grad to None, not 1 like I said above. It is the subsequent call to loss.backward() that sets grad to 1. This is analogous to what happens when we specify requires_grad=True : the grad value is initially set to None, and is set to numerical values after a call to backward().

Here is a mini-tutorial in code that I wrote to get my head around these concepts:

import torch
from torch import nn

model = nn.Linear(1, 1)

print("#"*7)
print("Case 1")
print("#"*7)

# Case 1: No tensor has the grad attribute, none of the learnt gradients 
# is saved

# This is a leaf, but has no grad attribute 
x = torch.randn(1, 1) 

print(x)
print(x.is_leaf) # True
print(x.grad) # None

# Forward pass, compute loss
out = model(x)
loss = out**2

# The following will print None, since we haven't computed any gradients 
# yet. (Also because neither variable has a grad attribute!)
print(loss.grad) # None
print(x.grad) # None


# This computes the gradients of loss w.r.t the leaf nodes 
# (which is just x). Since loss is a 1-element tensor, this method 
# implicitly sets loss.grad to torch.ones_like(loss).
# But this setting is temporary, and is not saved because loss is 
# not a leaf. So loss.grad remains at None after this call.
loss.backward()

# The following will print None since, while we did compute the 
# gradients of loss w.r.t x, neither loss nor x has a grad attribute
print(loss.grad) # None
print(x.grad) # None

print("")
print("*"*7)
print("Case 2")
print("*"*7)
# Case 2: The one leaf tensor has the grad attribute, so its
# learnt gradient is saved

# This is a leaf, and has a grad attribute because we told so.
# This grad attribute is initialized to None.
y = torch.randn(1, 1, requires_grad=True)

print(y)
print(y.is_leaf) # True
print(y.grad) # None

# Forward pass, compute loss
out = model(y)
loss = out**2

# The following will print None, since we haven't computed any gradients 
# yet
print(loss.grad) # None
print(y.grad) # None

# This computes the gradients of loss w.r.t the leaf nodes 
# (which is just y). Since loss is a 1-element tensor, this method 
# implicitly sets loss.grad to torch.ones_like(loss).
# But this setting is temporary, and is not saved because loss is not 
# a leaf. So loss.grad remains at None after this call.
loss.backward()

# The following will print None since, while we did compute the gradients
# of loss w.r.t y,
# loss has no grad attribute
print(loss.grad) # None

# This will print a [1, 1] float tensor, which is the gradient of 
# loss w.r.t y 
print(y.grad) 

print("")
print("*"*7)
print("Case 3")
print("*"*7)
# Case 3: Both the leaf tensor and the loss (which is not a leaf) have 
# grad attributes, 
# so their learnt gradients are saved

# This is a leaf, and has a grad attribute because we told so.
# This grad attribute is initialized to None.
z = torch.randn(1, 1, requires_grad=True)

print(z)
print(z.is_leaf) # True
print(z.grad) # None

# Forward pass, compute loss
out = model(z)
loss = out**2

# Enable the grad attribute for the non-leaf tensor loss.
loss.retain_grad()

# The following will print None, since we haven't computed any gradients 
# yet
print(loss.grad) # None
print(z.grad) # None

# This computes the gradients of loss w.r.t the leaf nodes 
# (which is just z). Since loss is a 1-element tensor, this method 
# implicitly sets loss.grad to torch.ones_like(loss).
# Since we enabled the grad attribute of loss, this setting is saved even
# after the call returns.
loss.backward()

# The following will print a [1, 1] float tensor whose element is 1.0. 
# This is the value which was implicitly set by the call to loss.backward()
print(loss.grad) # [[1.0]]

# This will print a [1, 1] float tensor, which is the gradient of loss 
# w.r.t z 
print(z.grad) 

# Case 4: Same as Case 3, except we explicitly set the grad attribute 
# of loss to a large value.
# Observe that the grad of the leaf changes proportionately.
print("")
print("*"*7)
print("Case 4")
print("*"*7)

w = torch.randn(1, 1, requires_grad=True)

print(w)
print(w.is_leaf)
print(w.grad)


out = model(w)
loss = out**2

loss.retain_grad()


print(loss.grad) 
print(w.grad) 


loss.backward(torch.tensor(1e5).view_as(loss))


print(loss.grad) 
print(w.grad) 


1 Like

I guess you probably didn’t understand what Loss.grad() means here. Because you are doing backprop from Loss all the way backwardly to the leaf nodes in the graph, the starting point is Loss, so Loss.grad() is the grad of loss w.r.t loss itself, so the result will be 1, besides the grad passed to loss which is a scalar defaults to 1 also. Meaning 1 * 1 = 1, so you are getting 1 in the end.
For other nodes, e.g., in the below equation:
y = 2 * x;
if you are calculating x.grad(), which is the grad of loss w.r.t x as in a computation graph, so let’s denote it as dL/dx.
dL/dx = dL/dy * dy/dx = dL/dy * 2;
So if you are calculating Loss.grad(). Which would be:
Loss = Loss. And dL/dL = 1.
So you are getting:
dL/dL = 1 * 1 = 1
As already mentioned by @ptrblck and @gphilip , if you are trying to debug if your model is learning or not, you probably should check the loss value itself.

@ptrblck @gphilip @BruceDai003 Thanks for you reply.

I followed your suggestion, but unfortunately it is still not training…I checked loss and weights that have different value in each epoch, but the model is not learning within 50 epoches. Besides, I also tried different model, architecture and hyperparameters. The code is as following link, can you please help to check the code? Appreciate your help.