Hi, I am new to PyTorch’s DistributedDataParallel module. Now I want to convert my GAN model to DDP training, but I’m not very confident about what should I modify.
My original toy script is like:
# Initialization
G = Generator()
D = Discriminator()
G.cuda()
D.cuda()
opt_G = optim.SGD(G.parameters(), lr=0.001)
opt_D = optim.SGD(D.parameters(), lr=0.001)
G_train = GeneratorOperation(G, D) # a PyTorch module to calculate all training losses for G.
D_train = DiscriminatorOperation(G, D) # a PyTorch module to calculate all training losses for D.
# Training
for i in range(10000):
loss_D = D_train()
opt_D.zero_grad()
loss_D.backward()
opt_D.step()
loss_G = G_train()
opt_G.zero_grad()
loss_G.backward()
opt_G.step()
My question is, when I add DDP module to the above script, should I modify in
torch.cuda.set_device(local_rank)
G = Generator()
D = Discriminator()
G.cuda()
D.cuda()
G_ddp = DDP(G, device_ids=[local_rank], output_device=local_rank)
D_ddp = DDP(D, device_ids=[local_rank], output_device=local_rank)
opt_G = optim.SGD(G_ddp.parameters(), lr=0.001)
opt_D = optim.SGD(D_ddp.parameters(), lr=0.001)
G_train = GeneratorOperation(G_ddp, D_ddp)
D_train = DiscriminatorOperation(G_ddp, D_ddp)
or in
torch.cuda.set_device(local_rank)
G = Generator()
D = Discriminator()
G.cuda()
D.cuda()
opt_G = optim.SGD(G.parameters(), lr=0.001)
opt_D = optim.SGD(D.parameters(), lr=0.001)
G_train = DDP(GeneratorOperation(G, D), device_ids=[local_rank], output_device=local_rank)
D_train = DDP(DiscriminatorOperation(G, D), device_ids=[local_rank], output_device=local_rank)
Should I use one of the above over another, or they are the same? Appreciated if you can also explain in detail. Thanks!