Distributed Data Parallel slower than Data Parallel

mrshenli · August 26, 2020, 3:36pm

Hey @TT_YY, I took a closer look at the code and noticed that you converted BatchNorm to SyncBatchNorm for DDP, which might be the source of the slowness. If you look at SyncBatchNorm's implementation (see below), it launches its own communication, which is not handled by DDP. This additional comm leads to ~10% slowdown in your program when running on 2 GPUs. When I use BatchNorm instead of SyncBatchNorm, DDP is faster than DP. In general, when comparing DDP and DP speed, we need to make sure that they run the same model.

github.com

pytorch/pytorch/blob/573940f8d71de45356b1e6c851f876a32cb8a0ac/torch/nn/modules/_functions.py#L79


    self.needs_input_grad[0],
    self.needs_input_grad[1],
    self.needs_input_grad[2]
)

if self.needs_input_grad[0]:
    # synchronizing stats used to calculate input gradient.
    # TODO: move div_ into batch_norm_backward_elemt kernel
    num_channels = sum_dy.shape[0]
    combined = torch.cat([sum_dy, sum_dy_xmu], dim=0)
    torch.distributed.all_reduce(
        combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)
    sum_dy, sum_dy_xmu = torch.split(combined, num_channels)

    divisor = count_tensor.sum()
    mean_dy = sum_dy / divisor
    mean_dy_xmu = sum_dy_xmu / divisor
    # backward pass for gradient calculation
    grad_input = torch.batch_norm_backward_elemt(
        grad_output,
        saved_input,

This is how I measure the latency.

# run one iteration to warm up
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, targets)
loss.backward()
loss_val = optimizer.step(loss.item) 

# measure latency of the second iteration
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, targets)
loss.backward()
loss_val = optimizer.step(loss.item)
end.record()
torch.cuda.synchronize()

print(f"world size = {args.world_size}, batch size = {batch_size}, latency = {start.elapsed_time(end)}")

I tried to run the DDP script with the following configs on two GPUs:

Run as is

world size = 2, batch size = 2048, latency = 506.9587707519531
world size = 2, batch size = 2048, latency = 506.40606689453125

Comment out the following line, as SyncBatchNorm has its own way to communicate buffers, which can e slower.

#net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net)

world size = 2, batch size = 2048, latency = 456.42352294921875
world size = 2, batch size = 2048, latency = 457.8104248046875

Made the following edits and set args.n_gpus = 1. So the program runs DataParallel.

#net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net)
...
#net = nn.parallel.DistributedDataParallel(net, device_ids=[gpu])
net = nn.parallel.DataParallel(net)

world size = 1, batch size = 4096, latency = 496.3483581542969