Combining gradients

I have a network N1 with some layers and a network N2. They have some layers that are different. They’re trained on different data. I want to get gradients for the common layers, say, l1, l2, l5 and add them and continue the training process.

I was thinking of getting gradients from for name, parameter in model.named_parameters(): and filtering by name and adding them and calling optimizer.step(). But, the model.named_parameters is read only. How do I access the gradients and perform operations on them?

do you mean something like this?

import torch.nn as nn, torch
class A(nn.Module):
  def __init__(self):
    super().__init__()
    self.lin = nn.Linear(3, 3)
  def forward(self, x):
    return self.lin(x)
class B(nn.Module):
  def __init__(self):
    super().__init__()
    self.lin = nn.Linear(3, 3)
  def forward(self, x):
    return self.lin(x)
net1 = A()
net2 = B()
optimizer1 = torch.optim.SGD(net1.parameters(), lr=0.01)
optimizer2 = torch.optim.SGD(net2.parameters(), lr=0.01)
for i in range(2):
  input = torch.randn(3, 3)
  loss = (net1(input) + net2(input)).sum()
  loss.backward()
  for name1, param1 in net1.named_parameters():
    for name2, param2 in net2.named_parameters():
      print(name1, param1.grad, name2, param2.grad)
      print('\n')
      if(name1 == name2):
        param2.grad += param1.grad
        print(name1, param1.grad, name2, param2.grad)
  optimizer1.step()
  optimizer2.step()

@vainaijr yes that’s exactly right. It’s weird why the grad wasn’t changing in my code. I’ll see what’s going on

When you say “common” do you mean that they should be the same (always keep the same parameters values)? Or just that gradients from one should be added to the gradients of the other?

Gradients should be added to the other. So in the Solution above, it will be param2.grad = param1.grad = param1.grad + param2.grad

Note that it is also possible to just share the same module.

class A(nn.Module):
  def __init__(self):
    super().__init__()
    self.lin = nn.Linear(3, 3)
  def forward(self, x):
    return self.lin(x)

class B(nn.Module):
  def __init__(self):
    super().__init__(lin)
    self.lin = lin
  def forward(self, x):
    return self.lin(x)

net1 = A()
net2 = B(net1.lin)

optimizer1 = torch.optim.SGD(net1.parameters(), lr=0.01)
# No need for a second optimizer as all the parameters are already in net1.

for i in range(2):
  input = torch.randn(3, 3)
  loss = (net1(input) + net2(input)).sum()
  loss.backward()
  optimizer1.step()

If you have only part of the model that is shared, you can use a trick like this for B:

class B(nn.Module):
  def __init__(self):
    super().__init__(lin)
    self.lin = [lin,] # Puting this in a python list hides it from .parameters()
    self.lin2 = nn.Linear(3, 3)
  def forward(self, x):
    return self.lin2(self.lin(x))

net2 = B(net1.lin)
optimizer2 = torch.optim.SGD(net2.parameters(), lr=0.01)
# Note that the parameters of net2.lin won't be in parameters because it is in a list !
1 Like