Calling DistributedDataParallel on multiple Modules?

mdlockyer · February 23, 2019, 10:54pm

I’m wondering if anyone has some insight into the effects of calling DDP twice for one process group initialization? Good example of this would be a GAN where there are two distinct models. Can they both safely be wrapped in DDP? I suppose a dummy container module could be made that encases both models and only requires a single DDP wrapper. I looked around but didn’t see anything posted about this.

krumo · April 23, 2019, 10:51am

Hi, I’m also curious about this question. Have you solved it?

pietern · April 23, 2019, 4:44pm

Using the same process group for multiple DDP wrapped modules may work, only if they are independently used, and a call to backwards doesn’t involve both models. For GANs this may be the case, where you alternate between training the discriminator and the generator. That said, if a single call to backward involves gradient accumulation for more than 1 DDP wrapped module, then you’ll have to use a different process group for each of them to avoid interference.

You can do this as follows:

pg1 = torch.distributed.new_group(range(torch.distributed.get_world_size()))
model1 = DistributedDataParallel(
    create_model1(),
    device_ids=[args.local_rank],
    process_group=pg1)

pg2 = torch.distributed.new_group(range(torch.distributed.get_world_size()))
model2 = DistributedDataParallel(
    create_model2(),
    device_ids=[args.local_rank],
    process_group=pg2)

mdlockyer · April 23, 2019, 6:48pm

This is interesting. With most GAN architectures, the backward pass of the generator does indeed include both G and D. Generally you will get some prediction from the discriminator using the fake samples and then call backward on that, propagating that output through D, then G. It may actually be a requirement then that GANs use separate process groups for each model. Granted, in this architecture, weight updates only occur for one model at a time, but gradients should be accumulated for both D and G during the G backward/update stage.

mdlockyer · April 23, 2019, 6:52pm

If it is indeed the case that GANs need separate process groups for G and D, then that is something that definitely needs to be in the docs. I’ve had some strange training results while using DDP and this may be the cause.

@pietern Do you know if the interference you speak of would cause an exception, or just produce incorrect gradients?

I want to put together a test bed for this and see if there are indeed different gradients when using 1 vs 2 PGs.

pietern · June 24, 2019, 11:55am

@mdlockyer I think bad behavior would result in crashes or hangs. Calls to allreduce that are mixed and matched with different dimensionality across processes is a recipe for out of bound memory access and undefined behavior.

Would it be possible to use autograd.grad for the discriminator part and autograd.backward for the generator part? This would avoid gradient accumulation and reduction for the discriminator. You’d still have to stitch them together at the boundary of course.

mdlockyer · June 24, 2019, 3:58pm

@pietern that’s an interesting idea! Only backward() will trigger the reducer right? Off the top of your head, what would the stitch look like between grad and backward?

pietern · June 25, 2019, 7:17pm

Yes, only backward() interacts with the reducer.

I imagine combining grad and backward like this (but YMMV and I don’t know if this works).

G_out_grad = torch.autograd.grad(D_loss, G_out)
torch.autograd.backward(G_out, G_out_grad)

This wouldn’t trigger any reduction for the discriminator and only do so for the generator.

mdlockyer · June 25, 2019, 7:22pm

Very cool. When I get a second I’m going to implement this and test it out.

arinaruck · September 30, 2020, 11:34pm

Hi! Is this the only adjustment to make DDP work with two separate models?
I am trying to train two models in two stages: I train the first DDP model A for k steps, then I use its encoder part with torch.no_grad() and then use its projection layers and a second DDP model B.
For this I use two data loaders with distributed data samplers and two optimizers, each watching only one model.
So in the training loop I have:

for k steps:
 pred = modelA(X)
 loss = lossA(pred, y)
 loss.backward()
 optimizerA.step()
....
for m steps:
 with torch.no_grad():
      x = modelA.module.encoder()
 x = modelA.module.proj(x)
 pred = modelB(x)
 loss = lossB(pred, y)
 loss.backward()
 optimizerA.step()
 optimizerB.step()

If I do not use 2 pg I get a CUDA memory access problems as you said, but with 2 pgs it just seems to hang after training modelA for some amount of steps

What should I do?

seungjun · October 3, 2020, 4:16pm

I use a single process group for both generator and discriminator.
An example here.

You can just wrap each module with DDP separately and then use them as normal models.