How can I use the Distributed instead of dataparallel

When I used Distributed dataparallel to replace dataparallel,the result of the validation set becomes very poor, as in the case of overfitting. I used 4 GPUs, one process per GPU, keeping the learning rate and batchsize unchanged.The following is all the code related to DPP:

device = torch.device("cuda", args.local_rank)

train_sampler =
train_loader =
        train_set, batch_size=args.batch_size,
        num_workers=args.workers,sampler=train_sampler, pin_memory=True, shuffle=(train_sampler is None))
val_sampler =
val_loader =
        val_set, batch_size=args.batch_size,
        num_workers=args.workers, pin_memory=True, shuffle=False,sampler=val_sampler)

model = models.__dict__[args.arch](network_data).to(device)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
cudnn.benchmark = True
for epoch in tqdm(range(args.start_epoch, args.epochs)):
    # train for one epoch

    dist.reduce(train_loss, 0, op=dist.ReduceOp.SUM)

    dist.reduce(test_loss, 0, op=dist.ReduceOp.SUM)

blue curve is the result of validation set

Hey @111344

If each DDP (DistributedDataParallel) process is using the same batch size as you passed to DataParallel, then I think you need to divide the reduced loss by world_size. Otherwise, you are summing together losses from world_size batches.

Another thing is that batch size and learning rate might need to change when switched to DDP. Check out the discussions below:

  1. Should we split batch_size according to ngpu_per_node when DistributedDataparallel
  2. Is average the correct way for the gradient in DistributedDataParallel with multi nodes?

And this briefly explains how DDP works:

Thanks for your answer,it helped me a lot. :smiley:
One conclusion I got from these materials is that I should set
lr still be 1xlr.
Is this correct?

Yes, this should let the DDP gang collectively process the same number of samples compared to the single process case. But it may or may not stay mathematically equivalent due to the loss function. DDP is taking average of grads across processes. So if the loss function is calculating sum loss of all samples or if (loss(x) + loss(y)) / 2 != loss([x, y]) / 2, it won’t be mathematically equivalent. Hence, it might take some efforts to optimizer the lr and batch size when using DDP.

Hey,sorry for late reply.
My loss function is defined as follows:
loss = torch.norm(target_flow - input_flow, 2, 1)/batch_size
there are some discussions on how to calculate loss,it seems that DDP will automatically do batchsize average operation on loss,so do I need to manually average the loss?

No, you don’t need to manually average the loss. When using DDP, losses are local to every process, and DDP will automatically average gradients for all parameters using AllReduce communication.

My loss function is defined as follows:
loss = torch.norm(target_flow - input_flow, 2, 1)/batch_size

The batch_size here is the per-process input batch size, right?

Yes,it’s per-process batch_size.
In fact, I think the problem is basically solved after dividing Batchsize by ngpus (although performance is still slightly behind DP, but this should be a tuning problem)
Thank you for your help. Best wishes!

1 Like