I’d just like to confirm that DDP would behave correctly in the following use case. If I’m not mistaken, upon loss.backward()
each parameter should accumulate the combined gradient from loss_1
and loss_2
, so the gradient synchronization should only happen once? If I’m correct, Does this extend to composite losses with an arbitrary number of components?
model = DDP(some_model)
training loop:
p1 = model(x1)
p2 = model(x2)
loss = loss_1(p1) + loss_2(p2)
loss.backward()
optimizer.step()
Quick update, I’ve read this thread and it seems this will not work. Please let me know if anything has changed since then.