Default bucket_cap_mb param cause training hang for large model?

I am not sure whether this is a bug of ddp training.
When using the default value for bucket_cap_mb, my model runs fluently on 1 gpu, but got stuck (hanging at loss.backward()) when using multiple gpus.

However, when I set bucket_cap_mb=50, the training runs successfully on multiple gpus.

Another solution is keeping bucket_cap_mb unchanged, while using a smaller model. In this way, the training also succeeds.

Is this expected? It seems like when using a larger model, we need a larger bucket_cap_mb to make it train correctly?

Is this expected? It seems like when using a larger model, we need a larger bucket_cap_mb to make it train correctly?

hmm, I’d be surprised if this is the case. DDP makes sure that all ranks launch the same set of allreduce in the same order.

Besides DDP, does any part of your model/program also launches collective communications? Asking because the symptom looks like there is another collective fired somewhere during the backward, and caused desync problem when combined together with DDP’s allreduce.

Thanks for the reply.
As you mentioned, I did launch some communications elsewhere.

I am training a multi-task learning model, and I want to log each heads’ loss collected from all gpus. So before loss.backward(), I firstly dist.all_reduce() the losses and also some stats from dataloader.

My training step looks like this:

def dist_reduce_sum(tensor):
    rt = tensor.clone()
    dist.all_reduce(rt, op=dist.ReduceOp.SUM)
    return rt

def train_one_iter(inp, gt):


    # forward model
    output = model(inp)

    # calc loss for each head
    l1 = loss_head1(output[head1], gt[head1])
    l2 = loss_head2(output[head2], gt[head2])
    l3 = loss_head3(output[head3], gt[head3])

    # barrier

    # this is for logging (collect each head loss from all gpu, and also some stats from dataloader)
    stat1 = dist_reduce_sum(inp[head1][some_stat])
    stat2 = dist_reduce_sum(inp[head2][some_stat])
    stat3 = dist_reduce_sum(inp[head3][some_stat])
    l1_all = dist_reduce_sum(l1) / stat1
    l2_all = dist_reduce_sum(l2) / stat2
    l3_all = dist_reduce_sum(l3) / stat3, l2_all, l3_all)

    # backward
    total_loss = l1 + l2 + l3
    total_loss.backward() # !! this is where it hangs !!

    # step

Is my above training procedure problemetic for DDP?
Thanks for the help again ! @mrshenli

btw, I use timm Regnet800M model and convert model to syncbn using their func here.

From @Yanli_Zhao,

Hey @lzhbrian, which PyTorch version are you using? For some old PyTorch releases, this desync might be caused by DDP rebuild bucket logic.

I am using:

nvidia driver: 460.27.04
cuda: cuda_11.1.TC455_06.29190527_0

NCCL is used as backend for DDP.

GPUs are 8xA100 (using only 4 of them) running in a docker container with Ubuntu 18.04.6 LTS.

I tried another env, the training also hangs:
pytorch 1.12.1 py3.9_cuda11.3_cudnn8.3.2_0
torchvision 0.13.1 py39_cu113 pytorch
cudatoolkit 11.3.1 h2bc3f7f_2 defaults

Another observation update:

If I disabled syncbn, the training also succeeded …

Hi @mrshenli , any update on this ?