Default bucket_cap_mb param cause training hang for large model?

lzhbrian · October 13, 2022, 7:28pm

I am not sure whether this is a bug of ddp training.
When using the default value for bucket_cap_mb, my model runs fluently on 1 gpu, but got stuck (hanging at loss.backward()) when using multiple gpus.

However, when I set bucket_cap_mb=50, the training runs successfully on multiple gpus.

Another solution is keeping bucket_cap_mb unchanged, while using a smaller model. In this way, the training also succeeds.

Is this expected? It seems like when using a larger model, we need a larger bucket_cap_mb to make it train correctly?

mrshenli · October 18, 2022, 3:48am

Is this expected? It seems like when using a larger model, we need a larger bucket_cap_mb to make it train correctly?

hmm, I’d be surprised if this is the case. DDP makes sure that all ranks launch the same set of allreduce in the same order.

Besides DDP, does any part of your model/program also launches collective communications? Asking because the symptom looks like there is another collective fired somewhere during the backward, and caused desync problem when combined together with DDP’s allreduce.

lzhbrian · October 18, 2022, 5:50am

Thanks for the reply.
As you mentioned, I did launch some communications elsewhere.

I am training a multi-task learning model, and I want to log each heads’ loss collected from all gpus. So before loss.backward(), I firstly dist.all_reduce() the losses and also some stats from dataloader.

My training step looks like this:

def dist_reduce_sum(tensor):
    rt = tensor.clone()
    dist.all_reduce(rt, op=dist.ReduceOp.SUM)
    return rt

def train_one_iter(inp, gt):

    optimizer.zero_grad()

    # forward model
    output = model(inp)

    # calc loss for each head
    l1 = loss_head1(output[head1], gt[head1])
    l2 = loss_head2(output[head2], gt[head2])
    l3 = loss_head3(output[head3], gt[head3])

    # barrier
    torch.distributed.barrier()

    # this is for logging (collect each head loss from all gpu, and also some stats from dataloader)
    stat1 = dist_reduce_sum(inp[head1][some_stat])
    stat2 = dist_reduce_sum(inp[head2][some_stat])
    stat3 = dist_reduce_sum(inp[head3][some_stat])
    l1_all = dist_reduce_sum(l1) / stat1
    l2_all = dist_reduce_sum(l2) / stat2
    l3_all = dist_reduce_sum(l3) / stat3
    logging.info(l1_all, l2_all, l3_all)

    # backward
    total_loss = l1 + l2 + l3
    total_loss.backward() # !! this is where it hangs !!

    # step
    optimizer.step()

Is my above training procedure problemetic for DDP?
Thanks for the help again ! @mrshenli

lzhbrian · October 18, 2022, 9:14am

btw, I use timm Regnet800M model and convert model to syncbn using their func here.

github.com

rwightman/pytorch-image-models/blob/4f72bae43be26d9764a08d83b88f8bd4ec3dbe43/timm/models/layers/norm_act.py#L130


      
              # Do not create this module directly or use the PyTorch conversion function.
              def forward(self, x: torch.Tensor) -> torch.Tensor:
                  x = super().forward(x)  # SyncBN doesn't work with torchscript anyways, so this is fine
                  if hasattr(self, "drop"):
                      x = self.drop(x)
                  if hasattr(self, "act"):
                      x = self.act(x)
                  return x
          
          

          
def convert_sync_batchnorm(module, process_group=None):
              # convert both BatchNorm and BatchNormAct layers to Synchronized variants
              module_output = module
              if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
                  if isinstance(module, BatchNormAct2d):
                      # convert timm norm + act layer
                      module_output = SyncBatchNormAct(
                          module.num_features,
                          module.eps,
                          module.momentum,
                          module.affine,

mrshenli · October 18, 2022, 6:47pm

From @Yanli_Zhao,

Hey @lzhbrian, which PyTorch version are you using? For some old PyTorch releases, this desync might be caused by DDP rebuild bucket logic.

lzhbrian · October 19, 2022, 1:07am

I am using:

torch==1.10.0+cu111
torchvision==0.11.0+cu111
python==3.8.12
nvidia driver: 460.27.04
cuda: cuda_11.1.TC455_06.29190527_0

NCCL is used as backend for DDP.

GPUs are 8xA100 (using only 4 of them) running in a docker container with Ubuntu 18.04.6 LTS.

I tried another env, the training also hangs:
pytorch 1.12.1 py3.9_cuda11.3_cudnn8.3.2_0
torchvision 0.13.1 py39_cu113 pytorch
cudatoolkit 11.3.1 h2bc3f7f_2 defaults

lzhbrian · October 19, 2022, 2:28am

Another observation update:

If I disabled syncbn, the training also succeeded …

lzhbrian · October 26, 2022, 3:55pm

Hi @mrshenli , any update on this ?

Thanks!