Do gradients propagate through forward hooks?

Pericules · March 18, 2023, 9:24pm

For a binary classification problem using a feedforward NN with 5 layers, I want to create a joint loss function that includes predictive outputs from the intermediate layers. I am considering what is the best way to access the outputs of these intermediate layers, and one approach would be to add hooks to them. If I were to add a hook to an intermediate layer, and pass the output of that layer through a linear layer and to a loss function, would the loss.backward() propagate through these hooks properly?

ptrblck · March 19, 2023, 5:59am

Yes, this should be the case as seen in this small example which uses a hook to get an intermediate output, uses another custom layer, calculates the loss, and backpropagates through all used layers:

act = {}
def get_hook(name):
    def hook(m, input, output):
        act[name] = output
    return hook

model = models.resnet18()
model.layer1[1].conv2.register_forward_hook(get_hook("layer1_conv2"))

x = torch.randn(1, 3, 224, 224)
out = model(x)

# verify that no grads are set
for name, param in model.named_parameters():
    print("param {}, grad {}".format(name, param.grad))
# param conv1.weight, grad None
# param bn1.weight, grad None
# param bn1.bias, grad None
# ...
# param layer4.1.bn2.bias, grad None
# param fc.weight, grad None
# param fc.bias, grad None


custom_layer = nn.Conv2d(64, 1, 3, 1, 1)
act_out = custom_layer(act["layer1_conv2"])

loss = act_out.mean()
loss.backward()

# the custom layer as well as all layers used to create layer1_conv2 output should have valid grads
for name, param in custom_layer.named_parameters():
    grad = param.grad.abs().sum() if param.grad is not None else None
    print("param {}, grad {}".format(name, grad))
# param weight, grad 222.42572021484375
# param bias, grad 1.000019907951355

for name, param in model.named_parameters():
    grad = param.grad.abs().sum() if param.grad is not None else None
    print("param {}, grad {}".format(name, grad))
# param conv1.weight, grad 30.616352081298828
# param bn1.weight, grad 0.10833178460597992
# param bn1.bias, grad 0.06121913343667984
# param layer1.0.conv1.weight, grad 23.733787536621094
# param layer1.0.bn1.weight, grad 0.046802934259176254
# param layer1.0.bn1.bias, grad 0.03454311192035675
# param layer1.0.conv2.weight, grad 22.624469757080078
# param layer1.0.bn2.weight, grad 0.036605916917324066
# param layer1.0.bn2.bias, grad 0.04052506014704704
# param layer1.1.conv1.weight, grad 37.31376266479492
# param layer1.1.bn1.weight, grad 2.3493189811706543
# param layer1.1.bn1.bias, grad 2.9538869857788086
# param layer1.1.conv2.weight, grad 902.742919921875
# param layer1.1.bn2.weight, grad None
# param layer1.1.bn2.bias, grad None
# param layer2.0.conv1.weight, grad None
# ...
# param layer4.1.bn2.bias, grad None
# param fc.weight, grad None
# param fc.bias, grad None

Pericules · March 21, 2023, 9:54pm

Excellent, thanks a lot for the example! This is precisely what I was after.