I am not sure whether this is a bug of ddp training.
When using the default value for bucket_cap_mb, my model runs fluently on 1 gpu, but got stuck (hanging at loss.backward()) when using multiple gpus.

However, when I set bucket_cap_mb=50, the training runs successfully on multiple gpus.

Another solution is keeping bucket_cap_mb unchanged, while using a smaller model. In this way, the training also succeeds.

Is this expected? It seems like when using a larger model, we need a larger bucket_cap_mb to make it train correctly?

Is this expected? It seems like when using a larger model, we need a larger bucket_cap_mb to make it train correctly?

hmm, I’d be surprised if this is the case. DDP makes sure that all ranks launch the same set of allreduce in the same order.

Besides DDP, does any part of your model/program also launches collective communications? Asking because the symptom looks like there is another collective fired somewhere during the backward, and caused desync problem when combined together with DDP’s allreduce.

Thanks for the reply.
As you mentioned, I did launch some communications elsewhere.

I am training a multi-task learning model, and I want to log each heads’ loss collected from all gpus. So before loss.backward(), I firstly dist.all_reduce() the losses and also some stats from dataloader.

My training step looks like this:

def dist_reduce_sum(tensor):
rt = tensor.clone()
dist.all_reduce(rt, op=dist.ReduceOp.SUM)
return rt
def train_one_iter(inp, gt):
optimizer.zero_grad()
# forward model
output = model(inp)
# calc loss for each head
l1 = loss_head1(output[head1], gt[head1])
l2 = loss_head2(output[head2], gt[head2])
l3 = loss_head3(output[head3], gt[head3])
# barrier
torch.distributed.barrier()
# this is for logging (collect each head loss from all gpu, and also some stats from dataloader)
stat1 = dist_reduce_sum(inp[head1][some_stat])
stat2 = dist_reduce_sum(inp[head2][some_stat])
stat3 = dist_reduce_sum(inp[head3][some_stat])
l1_all = dist_reduce_sum(l1) / stat1
l2_all = dist_reduce_sum(l2) / stat2
l3_all = dist_reduce_sum(l3) / stat3
logging.info(l1_all, l2_all, l3_all)
# backward
total_loss = l1 + l2 + l3
total_loss.backward() # !! this is where it hangs !!
# step
optimizer.step()

Is my above training procedure problemetic for DDP?
Thanks for the help again ! @mrshenli

GPUs are 8xA100 (using only 4 of them) running in a docker container with Ubuntu 18.04.6 LTS.

I tried another env, the training also hangs:
pytorch 1.12.1 py3.9_cuda11.3_cudnn8.3.2_0
torchvision 0.13.1 py39_cu113 pytorch
cudatoolkit 11.3.1 h2bc3f7f_2 defaults