Optimizer initialization in Distributed Data Parallel

YutongZheng · February 4, 2021, 4:44pm

Hi, I am new to PyTorch’s DistributedDataParallel module. Now I want to convert my GAN model to DDP training, but I’m not very confident about what should I modify.

My original toy script is like:

# Initialization
G = Generator()
D = Discriminator()
G.cuda()
D.cuda()
opt_G = optim.SGD(G.parameters(), lr=0.001)
opt_D = optim.SGD(D.parameters(), lr=0.001)

G_train = GeneratorOperation(G, D) # a PyTorch module to calculate all training losses for G. 
D_train = DiscriminatorOperation(G, D) # a PyTorch module to calculate all training losses for D. 

# Training
for i in range(10000):
  loss_D = D_train()
  opt_D.zero_grad()
  loss_D.backward()
  opt_D.step()

  loss_G = G_train()
  opt_G.zero_grad()
  loss_G.backward()
  opt_G.step()

My question is, when I add DDP module to the above script, should I modify in

torch.cuda.set_device(local_rank)
G = Generator()
D = Discriminator()
G.cuda()
D.cuda()
G_ddp = DDP(G, device_ids=[local_rank], output_device=local_rank)
D_ddp = DDP(D, device_ids=[local_rank], output_device=local_rank)
opt_G = optim.SGD(G_ddp.parameters(), lr=0.001)
opt_D = optim.SGD(D_ddp.parameters(), lr=0.001)

G_train = GeneratorOperation(G_ddp, D_ddp)
D_train = DiscriminatorOperation(G_ddp, D_ddp)

or in

torch.cuda.set_device(local_rank)
G = Generator()
D = Discriminator()
G.cuda()
D.cuda()
opt_G = optim.SGD(G.parameters(), lr=0.001)
opt_D = optim.SGD(D.parameters(), lr=0.001)

G_train = DDP(GeneratorOperation(G, D), device_ids=[local_rank], output_device=local_rank)
D_train = DDP(DiscriminatorOperation(G, D), device_ids=[local_rank], output_device=local_rank)

Should I use one of the above over another, or they are the same? Appreciated if you can also explain in detail. Thanks!

osalpekar · February 4, 2021, 9:58pm

Per the example here: examples/main.py at master · pytorch/examples · GitHub, the DDP model is created first and then it’s parameters are passed as an arg when creating the optimizer (like the first option shown).

A more specific answer would depend on what GeneratorOperation and DiscriminatorOperation do. If you want the functionality in these to be replicated across ranks and the corresponding gradients to be synchronized during the backward pass, then they should passed to DDP.