How to print the computed gradient values for a network

I want to print the gradient values before and after doing back propagation, but i have no idea how to do it.

if i do loss.grad it gives me None.

can i get the gradient for each weight in the model (with respect to that weight)?

sample code:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        self.conv11 = nn.Conv2d(3, 64, 3, padding = 1 )
        self.pool1 = nn.AvgPool2d(2, 2)

        self.conv21 = nn.Conv2d(64, 64*2, 3, padding = 1 )
        self.pool2 = nn.AvgPool2d(2, 2)
        
        self.conv52 = nn.Conv2d(64*2, 10, 1)
        self.pool5 = nn.AvgPool2d(8, 8)
        
    def forward(self, x):
        
        x = F.relu(self.conv11(x))
        x = self.pool1(x)

        x = F.relu(self.conv21(x))
        x = self.pool2(x)
        
        x = self.conv52(x)
        x = self.pool5(x)
        
        x = x.view(-1, 10)
        return x
    

net = Net()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
net.to(device)
inputs = torch.rand(4,3,32,32)
labels = torch.rand(4)*10//5
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
inputs = inputs.to(device)
labels = labels.to(device)

outputs = net(inputs)

loss = criterion(outputs, labels.long() )

print(loss.grad)
loss.backward()
print(loss.grad)

optimizer.step()
 

3 Likes

Before the first backward call, all grad attributes are set to None. After the first backward you should see some gradient values. Thereafter the gradients will be either zero (after optimizer.zero_grad()) or valid values.

4 Likes

I understand, but why it is not showing the gradient values :confused:
am i doing something wrong:


# Initialization
net = Net()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net.to(device)
# defining loss
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

#some random input and lables
inputs = torch.rand(4,3,32,32)
labels = torch.rand(4)*10//5
inputs, labels= inputs.to(device), labels.to(device)

# zero_grad
net.zero_grad()
optimizer.zero_grad()

outputs = net(inputs)
loss = criterion(outputs, labels.long() )
print(loss.data)
print(loss.grad)
loss.backward()
print(loss.grad)
optimizer.step()
print(loss.grad)

output:

tensor(2.3276, device='cuda:0')
None
None
None
2 Likes

Yes, you can get the gradient for each weight in the model w.r.t that weight. Just like this:

print(net.conv11.weight.grad) 
print(net.conv21.bias.grad)

The reason you do loss.grad it gives you None is that “loss” is not in optimizer, however, the “net.parameters()” in optimizer.

optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

And “loss” is not leaf node in computation graph, so you can’t add it to optimizer directly

24 Likes

Sorry for the misunderstanding. I haven’t realized you would like to see the gradient of your loss.
In that case, @zhl515 is right, and you would need to use hooks to get the gradients w.r.t. some intermediate values (i.e. calculated from leaf variables).
Could you try to add loss.register_hook(lambda grad: print(grad)) before the backward call?

3 Likes

@ptrblck when i put loss.register_hook(lambda grad: print(grad)) before loss.backward() it gives me tensor(1., device='cuda:0'), is it what it is supposed to show? what intermediate values it is computing the gradient wrt?

@zhl515 and @ptrblck
I have a follow up question:

print(net.conv11.weight.grad) 

let me to print the grad values for conv11.weights, if i want to set these weights value to zero i thought i can do this:

Temp = net.conv11.weight.grad =net.conv11.weight.grad.clone()
net.conv11.weight.grad = torch.zeros(Temp.size())

but it is throwing

RuntimeError: assigned grad has data of a different type

can you please let me know your suggestion on that?

thanks

update:

I noticed that the second question is solved when i do the following :slight_smile:

net.conv11.weight.grad = torch.zeros(Temp.size()).to(device)

Yes, as you are calculating dLoss/dLoss = 1. Note that you can also pass this gradient directly to your backward call.

3 Likes

wow :sunglasses:
so cool!
thanks

the following code making it more clear for myself, maybe it helps others too

from torch import FloatTensor
from torch.autograd import Variable


# Define the leaf nodes
a = Variable(FloatTensor([4]))

weights = [Variable(FloatTensor([i]), requires_grad=True) for i in (2, 5, 9, 7)]

# unpack the weights for nicer assignment
w1, w2, w3, w4 = weights

b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
L = (10 - d)

L.register_hook(lambda grad: print(grad)) 
d.register_hook(lambda grad: print(grad)) 
b.register_hook(lambda grad: print(grad)) 
c.register_hook(lambda grad: print(grad)) 
b.register_hook(lambda grad: print(grad)) 

L.backward()


for index, weight in enumerate(weights, start=1):
    gradient, *_ = weight.grad.data
    print(f"Gradient of w{index} w.r.t to L: {gradient}")

tensor([1.])
tensor([-1.])
tensor([-7.])
tensor([-9.])
tensor([-9.])
Gradient of w1 w.r.t to L: -36.0
Gradient of w2 w.r.t to L: -28.0
Gradient of w3 w.r.t to L: -8.0
Gradient of w4 w.r.t to L: -20.0

16 Likes

Print the ‘state_dict_keys’ for the model, then print the specific key and get the values.

1 Like

Hi @ptrblck. Before training we zero the optimizer values. What is the need for it? I understand why we do it after updating the weights. But don’t know why we do that before training.

I would argue it depends a bit on your coding style. We recently had a discussion about it here.

1 Like

This is not gradient value, in fact it is parameter value.

1 Like

Here, you print the grad after register_hook, so how to keep the grad of d,b,c as a variable?

hey @ptrblck I know this is an old thread, but I am wondering if there is a way to print gradients, even if their nested away from the training script. So far I have only been able to get the weights if they are immediately available. (sorry for the vague language, I’m not too familiar with the syntax I should be using) but hopefully this will make it clearer.
In my program, this instantiates the model, and is part of my run.py script:
net = Transformer(window_size=window_size,
timestep_in=d_input,
channel_in=d_channel,
heads=heads,
d_model=d_model,
device=DEVICE,
dropout = dropout,
class_num=d_output,
stack=stack,
p=p,
).to(DEVICE)

The transformer model is part of my transformer.py script. This is the call to the model during training:
for i, (x, y) in enumerate(train_dataloader):
x, y = x.to(DEVICE), y.to(DEVICE)
optimizer.zero_grad()
y_pre = net(x)
loss = loss_function(y_pre, y)
loss.backward()

This is the code inside my transformer.py script, that calls to pytorch’s implementation of the encoder layer

self.channel_tower = ModuleList([
TransformerEncoderLayer(
d_model=d_model,
nhead=heads,
dim_feedforward=4 * d_model,
dropout=dropout,
activation=F.gelu,
batch_first=True,
norm_first=True,
device=device
) for _ in range(stack)
])

    #Timestep Init
    self.timestep_tower = ModuleList([
        TransformerEncoderLayer(
             d_model=d_model,
             nhead=heads,
             dim_feedforward=4 * d_model,
             dropout=dropout,
             activation=F.gelu,
             batch_first=True,
             norm_first=True,
             device=device
        ) for _ in range(stack)
    ])

and then, the actual weight layers themselves are hidden in the pytorch.nn.transformerencoderlayer package. in this way the actual layers are sort of abstracted away, and seemingly hard to get a hold of and peek at. Do you know of a way to peek at them?

You can still directly access all parameters of this module. Here is a small example accessing some of them:

encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
encoder_layer.self_attn.in_proj_weight
encoder_layer.linear2.weight
encoder_layer.linear1.weight
1 Like

wandb.watch can help one with monitoring these parameters, that could also be an alternative.