Backprop One Layer at a Time

rrichglitch · January 27, 2023, 12:26am

Hi all,

Im trying to make a somewhat custom optimizing algorithm however Im having a real tough time figuring out how to optimize one layer at a time without running the whole loss.backward() for every layer. I saw someone had basically the same question here: Computing Backward one layer at a time but the answer wasn’t descriptive enough for me to really understand. I also found this gem: Bit of fun with gradients which gets me close however I don’t know how to grab the gradient of the detached tensor of the previous layer so that I can give that to the previous layers backward call.

Here are clips of the relevant (not)working code:

class testNet(nn.Module):
  def __init__(self, n_observations, n_actions):
      super(testNet, self).__init__()
      self.layer1 = nn.Linear(n_observations, 128)
      self.layer2 = nn.Linear(128, 128)
      self.layer3 = nn.Linear(128, n_actions)
      self.activate = nn.LeakyReLU(.01)

  def forward(self, x):
      x1 = self.activate(self.layer1(x))
      x2 = self.activate(self.layer2(torch.tensor(x1.detach(), requires_grad=True)))
      return self.layer3(torch.tensor(x2.detach(), requires_grad=True))



params = list(test_net.parameters())
params.reverse()
  
slope = nn.MSELoss()
loss = slope(state_action_values, expected_state_action_values.unsqueeze(1))

loss.backward(retain_graph=True)
  
# continue backward grad layer by layer
for i in range(1,len(params),2):
    # optimize this layers parameters 
    params[i+1].backward(gradient=HOW_CAN_I_GET_THIS, retain_graph=True)

It would be nice to solve the mystery of where values are stored in the computation graph but any implementation details for a layer by layer backprop would be appreciated.
Thank you for your time.