Skiping training iteration in DDP setting results in stalled training

danish_nazir · April 10, 2023, 11:03am

Hi All,

I want to skip training iteration for a single GPU when training loss is greater than 10,000 (for example). The naive way of doing it is as follows.

  for x in train_dataloader:
            
            optimizer.zero_grad()

            out = model(x)

            out_criterion = criterion(out, x)
           
            out_criterion.backward()

            if out_criterion > 10000: #skipping optimizer update
                continue

            optimizer.step()

However, this results in “stalled training” due to synchronizaton issue between the GPU’s. Is there any other efficient way of achieving this especially in the DDP setting?

ptrblck · April 11, 2023, 6:20am

Maybe you could allreduce the losses, check its value, and skip the step() on all ranks instead of checking it on each rank separately which would diverge the runs.

danish_nazir · April 11, 2023, 7:44am

Thank you for your reply, Do you mean like this?

        dist.barrier()
        dist.all_reduce(out_criterion, op=torch.distributed.ReduceOp.SUM)

The drawback of this is that if we have N GPU’s for training and only single GPU has e.g. out_criterion = 10000, while the others has normal losses, then the gradients of all those other GPU’s will also be wasted…, since we are not taking the optimization step.

ptrblck · April 11, 2023, 7:50am

Yes, but this would also be expected since all parameters in each rank are equal in each iteration. If you skip the update on one rank your training would diverge since parameters would differ.

danish_nazir · April 11, 2023, 7:54am

Got it thanks for your response.