Training crashes for multi-GPU DDP with gradient accumulation

kmityagin · July 13, 2021, 8:33am

I am trying to train neural network with Distributed Data Parallel. Each frame (not image) from dataset has a variable size and consist of several patches. Since optimization step should be done for each frame, gradient accumulation is need to be incorporated. In my DDP settings, batch_size=1 which means one frame per GPU and Adam optimizer with learning rate 1e-3 is used. After 15-20 epochs, training/validation loss starts to decrease significantly and then fails to converge.

From my additional test, one GPU case works well for the same script – training loss decreases monotonically. My basic guess right now is that the problem is connected with unavoidable sync between GPUs during gradient accumulation stage.

I’ve tested with different version of Pytorch (old 1.4.0 and newer 1.8.0)
Could you help me please? Thanks!

Here is my code:
(train.py)

        with ddp_net.no_sync():
            for patch_ind in range(patch_size-1):
                # extract patch
                patch_node_features = batch_node_features[patch_ind].cuda(non_blocking=True)
                patch_gt = batch_gt[patch_ind].cuda(non_blocking=True)
                patch_output = ddp_net(patch_node_features)

                # loss calculation
                train_loss = training_criterion(patch_output, patch_gt)
                train_loss_mean += torch.Tensor.cpu(train_loss).detach().numpy()
                train_loss = train_loss/train_node_num
                train_loss.backward()


        # plus last patch (sync)
        patch_node_features = batch_node_features[patch_size-1].cuda(non_blocking=True)    
        patch_gt = batch_gt[patch_size-1].cuda(non_blocking=True)
        patch_output = ddp_net(patch_node_features)
        train_loss = training_criterion(patch_output, patch_gt)
        train_loss_mean += torch.Tensor.cpu(train_loss).detach().numpy()
        train_loss = train_loss/train_node_num
        train_loss.backward()                               # with sync

        # optimizer step 
        iteration_count += 1
        optimizer.step()
        ddp_net.zero_grad()

rvarm1 · July 13, 2021, 6:04pm

There shouldn’t be any sync between the GPUs going on if you’ve disabled gradient synchronization with no_sync context manager. You can verify this by broadcasting gradients across all ranks and observing that they are different.

When switching to distributed training there may be a need to tune certain parameters to improve the accuracy. Have you tried tuning gradient sync interval, batch size, learning rates, etc?

kmityagin · July 13, 2021, 8:50pm

@rvarm1, Thanks for you response!

In my case, batch size=1 which follows to one frame per GPU, gradient accumulation interval include all patches (one frame is set of multiple patches). Basically, multi-gpu allows to speed up training time, which is main bottleneck for my task.

The training script is configured to support multi-gpu and single-gpu with torch.multiprocessing.spawn. I tried to train with 1 gpu, training works well. But for multi-gpu, it fails and train/val loss starts increase after few epochs. The difference between successful and failed cases is only sync.

I am also tried different learning rate, there are no any changes.