How to BP loss gradient to only a subset of network parameters?

Dazitu616 · February 3, 2022, 3:15am

I’m training a network (say it’s consisted of module1 and module2) with two losses loss1 and loss2. For loss1, I only want to BP it to module1, while for loss2 I want to BP it to the whole network. Can I do something like this to achieve so? Is there any more elegant way to do so?

optimizer.zero_grad()

loss1.backward(retain_graph=True)  # first BP loss1
model.module2.zero_grad()  # zero out grad for module2, so that only module1 is affected by loss1

loss2.backward()  # loss2 BP to both modules
optimizer.step()

My second question is that, apart from module.zero_grad(), can I use param.grad.data.zero_() for some individual params which don’t belong to any modules?

And my final question is how to extend this when using amp training? Will the following code work? (My main concern is that do I need to do explicit scaler.unscale_(optimizer) before performing module2.zero_grad())

optimizer.zero_grad()

with torch.cuda.amp.autocast():
    loss1, loss2 = calculate_loss(xxx)

scaler.scale(loss1).backward(retain_graph=True)
model.module2.zero_grad()

scaler.scale(loss2).backward()
scaler.step(optimizer)
scaler.update()

ptrblck · February 3, 2022, 7:22am

Instead of zeroing out some gradients, you could also pass the inputs argument to backward to calculate only the gradients for these tensors.
Yes, I don’t see any issue with using the approach in amp (once you’ve verified it’s working without amp).

Dazitu616 · February 3, 2022, 7:44am

Thank you! It seems working as expected now.