How to compute the gradient of gradient if I have two models?

Hi, I am working on a problem where I have two models, namely a Teacher model (A) and a student model (B).

Phase 1

The Teacher network is used to generate pseudo-labels for a set of unlabelled train set X1. The pseudo-labels are used as ground truth to train the student network. The student network is updated based on the loss computed using the prediction from student network and the pseudo-labels.

Phase 2

Given labelled train set, (X2, Y2), we use the updated student model B2 and perform forward propagation. Then a loss is computed between the prediction from B2 and Y2. Rather than updating the student network again, now I would like to update the Teacher model A. In other words, I would like to compute the gradient of loss in phase 2 w.r.t model A. However, in my implementation, when I do loss.backward() in phase 2, only the gradient w.r.t model B is computed. May I ask how I can compute the gradient w.r.t model A???

Below is a much more detailed explanation and I have also pasted my code at the very end.

Given a set of inputs, X1,

Y_A1 = A(X1)
Y_B1 = B(X1)
loss1 = crossEntropy(Y_A1, Y_B1)
loss1.backward()

By calling loss1.backward() will computes the gradient of loss1 w.r.t all the parameters in A and B. Then to update only model B,

optimB.step() 
B -> B2

is called, where optimB can be something like

optimB = torch.optim.SGD(B.parameters(), lr=0.01)

Then, given another set of input X2, and labels Y2

Y_B2 = B2(X2)
loss2 = crossEntropy(Y_B2, Y2)
loss2.backward()

I was under the assumption that by calling loss2.backward(), gradient w.r.t. to parameters in model A will also be computed (second order derivative of loss1 w.r.t model A) because

Y_B2 = B2(X2)
B2 = B - lr * dloss1/dB

To update A -> A2
A2 = A - lr * dloss2/dA
A2 = A - lr * d(crossEntropy(Y_B2, Y2))/dA
A2 = A - lr * d(crossEntropy(B2(X2), Y2))/dA

Therefore, in order to update A we need to compute d(dloss1/db)/dA. However, in my implementation, when I called loss2.backward(), only gradient of dloss2/dB is computed.

I have pasted my code below:

X1 = torch.randn(5,2, dtype=torch.float32)
X2 = torch.randn(5,2, dtype=torch.float32)
Y2 = torch.randint(0,2, (5,1), dtype=torch.float32)

# define linear model
class linearModel(nn.Module):
    def __init__(self):
        super(linearModel, self).__init__()

        self.layer1 = nn.Linear(2, 2, bias=False)
        self.layer2 = nn.Linear(2, 1, bias=False)
        self.activation = nn.Tanh()

    def forward(self, x):
        x = self.layer1(x)
        x = self.activation(x)
        x = self.layer2(x)
        return torch.sigmoid(x)

# define loss function: Cross Entropy
def loss_fcn(pred, target):
     loss = -torch.mean(target*torch.log(pred) - (1-target)*torch.log(1-pred))
  return loss

modelA = linearModel()
modelB = linearModel() 

optimA = torch.optim.SGD(modelA.parameters(), lr=0.01)
optimB = torch.optim.SGD(modelB.parameters(), lr=0.01)

# Phase 1
Y_A1 = modelA(X1)
Y_B1 = modelB(X1)
loss1 = loss_fcn(Y_B1, Y_A1)

# Compute gradient w.r.t. B and update modelB 
loss1.backward()
optimB.step()

# Phase 2
Y_B2 = modelB(X2)   # model B has been updated 
loss2 = loss_fcn(Y_B2, Y2)
# Compute gradients w.r.t. model A and update model A
loss2.backward()
optimA.step()

However, modelA.layer1.weight.grad returns None.

Based on your code, your modelA()'s parameters didn’t get involved in the computation graph of calculating loss2, so when you print it gives you None

Hi, thanks for your reply. Is there a way to manually compute the gradients of loss2 w.r.t modelA’s parameters? Thank you

In this case, you should pass X2 to modelA():

Y_A2 = modelA(X2)
loss2 = loss_fnc(Y_A2, Y2)
loss2.backward()
optimA.step()