How to print the computed gradient values for a network

isalirezag · January 8, 2019, 10:29pm

I want to print the gradient values before and after doing back propagation, but i have no idea how to do it.

if i do loss.grad it gives me None.

can i get the gradient for each weight in the model (with respect to that weight)?

sample code:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        self.conv11 = nn.Conv2d(3, 64, 3, padding = 1 )
        self.pool1 = nn.AvgPool2d(2, 2)

        self.conv21 = nn.Conv2d(64, 64*2, 3, padding = 1 )
        self.pool2 = nn.AvgPool2d(2, 2)
        
        self.conv52 = nn.Conv2d(64*2, 10, 1)
        self.pool5 = nn.AvgPool2d(8, 8)
        
    def forward(self, x):
        
        x = F.relu(self.conv11(x))
        x = self.pool1(x)

        x = F.relu(self.conv21(x))
        x = self.pool2(x)
        
        x = self.conv52(x)
        x = self.pool5(x)
        
        x = x.view(-1, 10)
        return x
    

net = Net()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
net.to(device)
inputs = torch.rand(4,3,32,32)
labels = torch.rand(4)*10//5
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
inputs = inputs.to(device)
labels = labels.to(device)

outputs = net(inputs)

loss = criterion(outputs, labels.long() )

print(loss.grad)
loss.backward()
print(loss.grad)

optimizer.step()

ptrblck · January 9, 2019, 8:17am

Before the first backward call, all grad attributes are set to None. After the first backward you should see some gradient values. Thereafter the gradients will be either zero (after optimizer.zero_grad()) or valid values.

isalirezag · January 9, 2019, 5:18pm

I understand, but why it is not showing the gradient values
am i doing something wrong:


# Initialization
net = Net()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net.to(device)
# defining loss
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

#some random input and lables
inputs = torch.rand(4,3,32,32)
labels = torch.rand(4)*10//5
inputs, labels= inputs.to(device), labels.to(device)

# zero_grad
net.zero_grad()
optimizer.zero_grad()

outputs = net(inputs)
loss = criterion(outputs, labels.long() )
print(loss.data)
print(loss.grad)
loss.backward()
print(loss.grad)
optimizer.step()
print(loss.grad)

output:

tensor(2.3276, device='cuda:0')
None
None
None

zhl515 · January 10, 2019, 6:45am

Yes, you can get the gradient for each weight in the model w.r.t that weight. Just like this:

print(net.conv11.weight.grad) 
print(net.conv21.bias.grad)

The reason you do loss.grad it gives you None is that “loss” is not in optimizer, however, the “net.parameters()” in optimizer.

optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

And “loss” is not leaf node in computation graph, so you can’t add it to optimizer directly

ptrblck · January 10, 2019, 5:51pm

Sorry for the misunderstanding. I haven’t realized you would like to see the gradient of your loss.
In that case, @zhl515 is right, and you would need to use hooks to get the gradients w.r.t. some intermediate values (i.e. calculated from leaf variables).
Could you try to add loss.register_hook(lambda grad: print(grad)) before the backward call?

isalirezag · January 10, 2019, 7:18pm

@ptrblck when i put loss.register_hook(lambda grad: print(grad)) before loss.backward() it gives me tensor(1., device='cuda:0'), is it what it is supposed to show? what intermediate values it is computing the gradient wrt?

@zhl515 and @ptrblck
I have a follow up question:

print(net.conv11.weight.grad)

let me to print the grad values for conv11.weights, if i want to set these weights value to zero i thought i can do this:

Temp = net.conv11.weight.grad =net.conv11.weight.grad.clone()
net.conv11.weight.grad = torch.zeros(Temp.size())

but it is throwing

RuntimeError: assigned grad has data of a different type

can you please let me know your suggestion on that?

thanks

update:

I noticed that the second question is solved when i do the following

net.conv11.weight.grad = torch.zeros(Temp.size()).to(device)

ptrblck · January 10, 2019, 8:29pm

Yes, as you are calculating dLoss/dLoss = 1. Note that you can also pass this gradient directly to your backward call.

isalirezag · January 10, 2019, 8:47pm

wow
so cool!
thanks

the following code making it more clear for myself, maybe it helps others too

from torch import FloatTensor
from torch.autograd import Variable


# Define the leaf nodes
a = Variable(FloatTensor([4]))

weights = [Variable(FloatTensor([i]), requires_grad=True) for i in (2, 5, 9, 7)]

# unpack the weights for nicer assignment
w1, w2, w3, w4 = weights

b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
L = (10 - d)

L.register_hook(lambda grad: print(grad)) 
d.register_hook(lambda grad: print(grad)) 
b.register_hook(lambda grad: print(grad)) 
c.register_hook(lambda grad: print(grad)) 
b.register_hook(lambda grad: print(grad)) 

L.backward()


for index, weight in enumerate(weights, start=1):
    gradient, *_ = weight.grad.data
    print(f"Gradient of w{index} w.r.t to L: {gradient}")

tensor([1.])
tensor([-1.])
tensor([-7.])
tensor([-9.])
tensor([-9.])
Gradient of w1 w.r.t to L: -36.0
Gradient of w2 w.r.t to L: -28.0
Gradient of w3 w.r.t to L: -8.0
Gradient of w4 w.r.t to L: -20.0

Paul_Gureghian · January 15, 2019, 1:37am

Print the ‘state_dict_keys’ for the model, then print the specific key and get the values.

K_Kk · May 25, 2019, 12:54pm

Hi @ptrblck. Before training we zero the optimizer values. What is the need for it? I understand why we do it after updating the weights. But don’t know why we do that before training.

ptrblck · May 25, 2019, 4:27pm

I would argue it depends a bit on your coding style. We recently had a discussion about it here.

nonoc · March 31, 2020, 4:01am

This is not gradient value, in fact it is parameter value.

RoseFun · August 16, 2020, 9:55pm

Here, you print the grad after register_hook, so how to keep the grad of d,b,c as a variable？

Spencerfonbuena · June 22, 2023, 12:34am

hey @ptrblck I know this is an old thread, but I am wondering if there is a way to print gradients, even if their nested away from the training script. So far I have only been able to get the weights if they are immediately available. (sorry for the vague language, I’m not too familiar with the syntax I should be using) but hopefully this will make it clearer.
In my program, this instantiates the model, and is part of my run.py script:
net = Transformer(window_size=window_size,
timestep_in=d_input,
channel_in=d_channel,
heads=heads,
d_model=d_model,
device=DEVICE,
dropout = dropout,
class_num=d_output,
stack=stack,
p=p,
).to(DEVICE)

The transformer model is part of my transformer.py script. This is the call to the model during training:
for i, (x, y) in enumerate(train_dataloader):
x, y = x.to(DEVICE), y.to(DEVICE)
optimizer.zero_grad()
y_pre = net(x)
loss = loss_function(y_pre, y)
loss.backward()

This is the code inside my transformer.py script, that calls to pytorch’s implementation of the encoder layer

self.channel_tower = ModuleList([
TransformerEncoderLayer(
d_model=d_model,
nhead=heads,
dim_feedforward=4 * d_model,
dropout=dropout,
activation=F.gelu,
batch_first=True,
norm_first=True,
device=device
) for _ in range(stack)
])

    #Timestep Init
    self.timestep_tower = ModuleList([
        TransformerEncoderLayer(
             d_model=d_model,
             nhead=heads,
             dim_feedforward=4 * d_model,
             dropout=dropout,
             activation=F.gelu,
             batch_first=True,
             norm_first=True,
             device=device
        ) for _ in range(stack)
    ])

and then, the actual weight layers themselves are hidden in the pytorch.nn.transformerencoderlayer package. in this way the actual layers are sort of abstracted away, and seemingly hard to get a hold of and peek at. Do you know of a way to peek at them?

ptrblck · June 22, 2023, 5:22am

You can still directly access all parameters of this module. Here is a small example accessing some of them:

encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
encoder_layer.self_attn.in_proj_weight
encoder_layer.linear2.weight
encoder_layer.linear1.weight

Megh_Bhalerao · July 23, 2023, 3:32am

wandb.watch can help one with monitoring these parameters, that could also be an alternative.