I have an array of models, corresponding optimizers and losses. Without mixed precision, I sum the losses and then apply backward.
(loss1 + loss2 + ...).backward()
Now while doing mixed precision training with torch.cuda.amp
, acc to the sample code given I should apply backward on individual losses.
scaler.scale(loss1).backward()
scaler.scale(loss2).backward()
and so on.
So I wanted to know what is the correct way - applying backward()
on individual losses or on sum total? Is the backward func linear?
Code for reference
scaler = torch.cuda.amp.GradScaler()
for epoch in epochs:
for input, target in data:
optimizer0.zero_grad()
optimizer1.zero_grad()
with autocast():
output0 = model0(input)
output1 = model1(input)
loss0 = loss_fn(2 * output0 + 3 * output1, target)
loss1 = loss_fn(3 * output0 - 5 * output1, target)
scaler.scale(loss0).backward(retain_graph=True)
scaler.scale(loss1).backward()
# You can choose which optimizers receive explicit unscaling, if you
# want to inspect or modify the gradients of the params they own.
scaler.unscale_(optimizer0)
scaler.step(optimizer0)
scaler.step(optimizer1)
scaler.update()