Retain_graph is also retaining grad values and adds them to new one!

after noticing unexpected gradient values during a model training. I performed this experience and I expected that I should get the same gradient values however that was not the case. below you find a ready to run code. the first scenario was to run loss1.backward(retain_graph=True)
then loss2.backward()
the second experiment was the way around (run loss2.backward and then loss1.backward)
values were not the same.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

dtype = torch.float32
X = torch.tensor([[1, 2, 3, 4, 5, 6]], dtype=dtype)
Y = torch.tensor([[1, 4, 9, 16, 25, 36]], dtype=dtype)


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        torch.manual_seed(3)
        self.base_l1 = torch.nn.Linear(6, 6, bias=True)
        self.base_l2 = torch.nn.Linear(6, 6, bias=True)
        self.l3 = torch.nn.Linear(6, 6, bias=True)
        self.l4 = torch.nn.Linear(6, 6, bias=True)

    def forward(self, x):
        x1 = self.base_l1(x)
        x1 = F.relu(x1)
        x1 = self.base_l2(x1)
        x2 = x1
        x2 = F.relu(x2)
        x2 = self.l3(x2)
        x2 = F.relu(x2)
        x2 = self.l4(x2)
        return x2, x1


model = Model()
Loss = nn.MSELoss()
y_pred2, y_pred1 = model(X)
print('grad0', model.base_l1.weight.grad)
loss1 = Loss(y_pred1, Y)
loss2 = Loss(y_pred2, Y)
# first scenario
#### comment this and uncomment second scenario and rerun
#'''
loss1.backward(retain_graph=True)
print('grad1', model.base_l1.weight.grad)
loss2.backward()
print('grad2', model.base_l1.weight.grad)
####
# second scenario uncomment after running 1st scenario
'''
loss2.backward(retain_graph=True)
print('grad2', model.base_l1.weight.grad)
loss1.backward()
print('grad1', model.base_l1.weight.grad)
'''

here we could clearly understand that retain_graph=True save all necessary information to recalculate the gradient again but Also preserves also the grad values!!! the new gradient will be added to the old one.
I do not think this is wished when we want to calculate a brand new gradient.

Hello,

You need to explicitly reset the gradient between each backward pass with optimizer.zero_grad() or model.zero_grad(), else you are doing gradient accumulation. That’s the expected behavior and it has nothing to do with retain_graph. Are you sure you need retain_graph=True? Why do you need it?

retain_graph can be used, among other things, to backward multiple times the same loss, or to compute a backward pass on a loss computed on some gradient (for example, to compute gradient penalty in WGAN-GP models). In most case you won’t need it, and you don’t need it for gradient accumulation.

Best regards,
Thomas

If you zero gradient after loss1.backward, then optimizer will have only gradient values from loss2.backward to work with.

Hey Thomas,
thank you for your answer.
in the following, I will post a snippet of my code.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
dtype = torch.float32

X = torch.tensor([[1, 2, 3, 4, 5, 6]], dtype=dtype)
Y = torch.tensor([[1, 4, 9, 16, 25, 36]], dtype=dtype)


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.base_l1 = torch.nn.Linear(6, 6, bias=True)
        self.base_l2 = torch.nn.Linear(6, 6, bias=True)
        self.l3 = torch.nn.Linear(6, 6, bias=True)
        self.l4 = torch.nn.Linear(6, 6, bias=True)

    def forward(self, x):
        x1 = self.base_l1(x)
        x1 = F.relu(x1)
        x1 = self.base_l2(x1)
        x2 = x1
        x2 = F.relu(x2)
        self.l3.weight.data = x1.repeat(6, 1)
        x2 = self.l3(x2)
        x2 = F.relu(x2)
        x2 = self.l4(x2)
        return x2, x1


model = Model()
my_list = ['base']
base_params = list(filter(lambda kv: my_list[0] in kv[0], model.named_parameters()))
params = list(filter(lambda kv: my_list[0] not in kv[0], model.named_parameters()))
prms = []
base_prms = []
for i in params:
    prms.append(i[1])
for i in base_params:
    base_prms.append(i[1])

optimizer1 = optim.SGD(base_prms, lr=.05, momentum=0.9)
optimizer2 = optim.SGD(prms, lr=.05, momentum=0.9)

Loss = nn.MSELoss()
l1 = []
l2 = []
n_iters = 100
for epoch in range(n_iters):
    print(epoch)
    optimizer1.zero_grad()
    optimizer2.zero_grad()
    y_pred2, y_pred1 = model(X)
    loss2 = Loss(y_pred2, Y)
    l2.append(loss2)
    loss2.backward()
    optimizer2.step()
    y_pred1.data = model.l3.weight[0, :].data.unsqueeze(0) # reinject the newly updated parameters to 
     #y_pred1
    loss1 = Loss(y_pred1, Y)
    l1.append(loss1)
    loss1.backward()
    optimizer1.step()

I want that my model final output x2 and the intermediate output x1 gets both optimized to the target Y
however, X1 is penalized by how much l3.weights are far from Y which means I have to update the layer l3 with loss2.backward(retain_graph=True) first. then collect it new value and assign it to X1 and then execute loss1.backward()
can you see another way to achieve this without assigning True to retain_graph?
thank you

thank you for your answer
the optimizer should be executed directly after each backward()
please consult the original code snippet that I just published

do zero_grad() after optimizer2.step()

thx,
is there a way to manage this without retain_graph=True
or is it legitim to use in this case retain_grad