I’m training a network (say it’s consisted of module1
and module2
) with two losses loss1
and loss2
. For loss1
, I only want to BP it to module1
, while for loss2
I want to BP it to the whole network. Can I do something like this to achieve so? Is there any more elegant way to do so?
optimizer.zero_grad()
loss1.backward(retain_graph=True) # first BP loss1
model.module2.zero_grad() # zero out grad for module2, so that only module1 is affected by loss1
loss2.backward() # loss2 BP to both modules
optimizer.step()
My second question is that, apart from module.zero_grad()
, can I use param.grad.data.zero_()
for some individual params which don’t belong to any modules?
And my final question is how to extend this when using amp training? Will the following code work? (My main concern is that do I need to do explicit scaler.unscale_(optimizer)
before performing module2.zero_grad()
)
optimizer.zero_grad()
with torch.cuda.amp.autocast():
loss1, loss2 = calculate_loss(xxx)
scaler.scale(loss1).backward(retain_graph=True)
model.module2.zero_grad()
scaler.scale(loss2).backward()
scaler.step(optimizer)
scaler.update()