Composite loss from multiple forward passes with DDP

EachOneChew · March 26, 2023, 8:19pm

I’d just like to confirm that DDP would behave correctly in the following use case. If I’m not mistaken, upon loss.backward() each parameter should accumulate the combined gradient from loss_1 and loss_2, so the gradient synchronization should only happen once? If I’m correct, Does this extend to composite losses with an arbitrary number of components?

model = DDP(some_model)

training loop:
    p1 = model(x1)
    p2 = model(x2)

    loss = loss_1(p1) + loss_2(p2)
    loss.backward()
    optimizer.step()

Quick update, I’ve read this thread and it seems this will not work. Please let me know if anything has changed since then.

rvarm1 · March 27, 2023, 3:00pm

Gradient synchronization should indeed happen only once.

In general, DDP’s usability should be the same as that of local training, so if this approach isn’t giving correct results, please file a github issue with a minimal repro.

EachOneChew · March 28, 2023, 8:39am

I see, thank you. Unfortunately it does seem like running forward twice does not work under DDP; I got around it by wrapping both forward calls into a single parent module’s forward, but I might have made a mistake somewhere else.